Uberblic.org Release
In early 2009 I had the idea for a linked data integration platform, capable of importing and mapping all of the publicly available data sets in the Linked Data cloud into a single repository, represented in a coherent ontology, reconciled and accessible through central APIs.
What started out as an idea a year ago now became reality. Today, we have released uberblic.org, a service for integrating the web of data.
The project announcement has a summary and screencast of what Uberblic is and how it works.
I’m very excited about today’s release! Many thanks to my friends at Talis for their help and support!
The harm of “non-commercial”
Many data sources on the web are licensed under non-commercial licenses. There are understandable reasons for data publishers to choose non-commercial licensing: Their data might have been funded with tax money, or they license the content from other content creators under contracts that only allow non-commercial republishing.
The intention is that the data should be available to anyone who doesn’t earn any income by using the data.
But there’s a huge problem: Not earning any income isn’t equivalent to non-commercial usage. A consultation done by Creative Commons has shown that a large share of content publishers consider all services and web sites run by a for-profit organisation commercial usage. Even if those offerings do not generate any income at all.
That creates a large uncertainty for companies and startups: You don’t know whether a data license actually permits the usage of content for your free service, but instead you need to sign custom agreements with each data publisher, approving that your definition of non-commercial meets their definition.
I doubt that was the intention behind data licenses such as Creative Common Noncommercial…
Let me cite from CC-BY-NC 3.0: “You may not exercise any of the rights granted to You in Section 3 above in any manner that is primarily intended for or directed toward commercial advantage or private monetary compensation.”
Both primarily and commercial advantage are quite vague, and that’s the main harm:
License terms should not be vague!
DBpedia Ontology – designed to break?
One of the bigger changes of the upcoming DBpedia 3.4 release is the ontology’s new URI schema: Property URIs are now partitioned by the property’s domain. While before it was http://dbpedia.org/ontology/artitect, now it is http://dbpedia.org/ontology/Building/artitect. In the past, there’s been the statement that http://dbpedia.org/ontology/architect has the rdfs:domain http://dbpedia.org/ontology/Building, now this fact is in addition also coded into the URI.
Looks ok? Maybe on the first sight. But in my opinion, it’s a big mistake.
Let’s first look at the reasons for that change in the URI schema. It aims to provide a solution for semantically ambiguous properties. For example, the word “length” can be used to describe the long dimension of an object, like the length of a bridge. But it can be also used as a synonym for the runtime of a song or movie (like 90 minutes). Now, with only the one URI http://dbpedia.org/ontology/length, it’s unclear whether the range of that property is measured in metres or minutes (let alone inch, feet, and miles
) So in order to properly represent the two different semantics, we need two different URIs. Consensus so far…
Now there are to possible solutions: Either you use two different property IDs (such as length and movie_length), or you use two different namespaces. The DBpedia team chose the latter. The problem is that they did the partitioning for every single property, even those unambiguous ones. And since the DBpedia ontology wasn’t entirely carefully designed upfront, but is instead due to community refinement, that leaves us with URIs that will most probably break in the future.
See for example http://dbpedia.org/ontology/ceo. Its domain is http://dbpedia.org/ontology/SoccerClub, which seems kind of strange, but is due to the way how the ontology was created: The Infobox Soccer Club was probably the only one with a ceo property, so the domain of http://dbpedia.org/ontology/ceo became SoccerClub. Clearly, once the community starts refining the ontology, the domain will be changed to Company or something similar more reasonable. That wouldn’t be much of deal since we’d only see a changed statement about the ceo property. But with the new URI schema, the SoccerClub is part of the property’s URI (http://dbpedia.org/ontology/SoccerClub/ceo). We have a problem…
If we’d change the property URI to http://dbpedia.org/ontology/Company/ceo now, all existing queries will break. If we leave the URI as it is, the encoded SoccerClub becomes misleading. It would have been even better to use http://dbpedia.org/ontology/abc123/ceo instead, since that string doesn’t suppose to have any meaning.
The simple, straightforward solution would have been: use different IDs to disambiguate properties which actually need to be disambiguated. Or to do it like Freebase and partition very carefully by topic…
Linked Data at the New York Times
Good news for the Linked Open Data community: Evan Sandhaus of the New York Times announced today at the International Semantic Web Conference in Washington their new service http://data.nytimes.com. Through that service, the NY Times is publishing their large annotation vocabulary which is used to tag all their articles as Linked Data, and those tags link to DBpedia and Freebase. So it will be possible in the future to combine news articles about a subject with structured data from the Linked Data cloud.
The real beauty will come in when other large publishers such as the BBC follow, enabling developers to gather news stories about a topic across news sites. And I’m sure this will happen very soon…
What’s wrong with the Linked Data world, part 1 – Keyword Search
Usually I don’t publicly point my finger at the work of others that I think is wrong. Still, I believe it’s time for an analysis of the stuff that’s going on in the Linked Data world, and that will include pointing out what I believe is wrong with some of the work that’s been invested lots of effort in. I don’t mean to offend anyone, and certainly in my own work I’ve done lots of mistakes. But certain elephants in the room need to be pointed out.
So, let’s begin today with the topic that’s been around since the very first days of the Linked Data movement: Search. With that I refer to the services that give the user the ability to enter a keyword and get back a set of Linked Data URIs crawled from the whole Linked Data cloud containing that keyword. To better understand my critisism, I’d first like to describe what I see as the motivation to build such a service.
Whenever talking about the killer apps of the internet, search comes up first. Without Altavista, Yahoo and Google, the web would probably never been as successful as it is today. Search enables people to find information from the web of distributed and linked documents, by giving them a little search box for keywords. So it seems obvious to take that approach, adapt it to the web of linked data, and you get another killer app. Unfortunately, it’s a little bit too obvious, leading people to copy instead of adapt. What people did was starting to copy the methods that Google use, i.e. providing a search box for keywords, instead of adapting the principle behind it.
What’s that principle? Building a centralized index of the greatest common denominator of documents on the web, i.e. the words occuring in these documents – our beloved keywords. Within a large set of documents, the words are the content, and the only commonality is the occurance of words. Natural language processing isn’t mature enough yet to see commonalities in the meaning of texts, so we have to stick with plain words for a while. It’s our best shot.
So what’s wrong with linked data search engines going the same route, proving keyword search on linked data and calling that a killer app? It completely misses the point. As I said, keyword search was the best shot for documents. Words create the content of text documents, and there’s no more real commonalities between documents than their words. Both that is very different for linked data.
Firstly, linked data could be highly useful without any words (i.e. literals or strings) at all. Imagine linked statistics just with IDs and numbers. Sure, the names of people, places, companies, movies, etc in the linked data cloud are very useful, but the main content lies in links and numbers. Secondly, interconnected datasets have much more in common than words. They have structure. And structured data can be mapped, merged, aggregated. The greatest common denominator of two datasets (in the same domain) is one merged dataset with the integrated data from both in one coherent data schema.
Now when linked data search engines provide keyword search on the large blob of unintegrated data, just crawled and thrown together into a large document index, they miss not only the opportunity of merging the data. By failing to do so, they miss the main principle of structured data. And what’s been the proposed solution to turn those keyword search apps into something useful? Plugging a faceted search feature in. That’s not going to help.
The name faceted search itself is already misleading. It gives the impression that the two functions search and faceted filtering together create something different than just the combination of the two. But that’s not the case. Faceted filtering refers to the concept of filtering a set of things (such as documents in a search result) by multiple different dimensions. But faceted filtering can only do… filtering. And in order to filter, you need to have a set of things that is coherent enough to be filtered. Keyword search is the method seen by many to create this coherence. But then, it has to create coherence within its results. When the user searches for Berlin, and a service returns a list of things named Berlin, including cities, people and music albums, that doesn’t create coherence on the data schema level. It can only filter the inital set of things/documents, so the level of coherence is the same before and after the search. Go to Amazon, search for Toshiba, and see what you get. The facets you’ll see are those which apply to the commonality of all things in Amazon’s index: products. The real magic of faceted filtering starts at the point where you’re down to just TVs, when you see facets such as the display size, the panel technology, etc. As long as we have a set of loosely coherent things, faceted filtering can’t do its trick.
And the problem with creating coherence within a set of linked data sources? If you don’t integrate the schemas, you’ll never get coherence. Web search engines have succeeded in the integration of documents, because they integrate the content of documents according to their commonality: words. Linked Data search engines have failed with that integration task, because they tried to integrate words where those don’t make the real content of linked data. And they didn’t try to integrate what’s really the content: structured data.
Access to integrated information has always been the killer application of the Web. The Web of Data doesn’t need to seek for a different killer app, just for a working adaption of that principle.
Restoring my blog
It’s sad when a server goes down, and it’s stupid to not have a backup of your blog running on that server. But luckily, my Google Reader cache at least contained my few blog post, which I could restore that way. Still sadly, all comments are gone.
But anyway, it gave me a little boost to start blogging again, and so you’ll see more posts coming over the next weeks.
Curated vs. Linked Data
In a very interesting article, the German news site spiegel.de analyses the recently launched Wolfram Alpha. Their analysis hits the bull’s-eye: the Wolfram team will never be able to curate all the information needed to make W|A really useful. Especially compared to Wikipedia, which is maintained by such a large community.
Linked Data is the model for decentralized data publication that can solve this problem. I’m just wondering how much curation it will require…
DBpedia Lookup – Find me some URIs
The Linked Data Web is about reusing and linking URIs. With DBpedia, we provide URIs for a broad range of topics: People, organizations, countries, cities, rivers, mountains, music albums, films, books, buildings, etc. 2.6 million URIs overall.
But it didn’t used to be easy enough to find a DBpedia URI for a given keyword. DBpedia Lookup aims to fill that gap. It provides a service to find the most-likely DBpedia URIs for a given keyword. The algorithm ranks DBpedia resources based on their relevance in Wikipedia and includes synonyms into the underlying Lucene index.
Try the terms “Shakespeare“, “EU“, “Eiffel“, or “Cambridge” and see for yourself if the results you’d expect show up at the top. The result ranking is different – and supposed to be more useful – than a simple full-text search or SPARQL-Query with embedded regular expression for matching labels.
There is a web-service available as well. You can use the KeywordSearch method for searching full terms, and the PrefixSearch method for an autocompletion-style interface such as the one you see here. The webservice returns a list of resource URIs with English abstracts, dbpedia classes and categories.
Feel free to use the service as you like. If you plan to use it in a production system or to run a high-load batch process, please drop me a message to let me know.
I hope that DBpedia Lookup is useful, and I’d appreciate any feedback.
Many thanks to the semantic web folks at the BBC for their support and feedback on the development of DBpedia Lookup.
OpenCalais joining the cloud
Reuters just announced that they’re going to fully support Linked Data with the upcoming release of OpenCalais 4.0. And that they’ll join the cloud by linking to DBpedia. Finally.
Anybody still doubting that Linked Data is ready for prime time?
Welcome to the Linked Data Cloud, Freebase
A week ago, Freebase announced that its rich dataset is now available as Linked Data. Congratulations, Metaweb! So it is finally the time to welcome Freebase to the Linked Data Cloud.
But wait, isn’t Linked Data about… linking data? Unfortunately, while Freebase now does export RDF, its concepts aren’t linked to other datasets in “the cloud“. No problem, we at DBpedia can help out. ![]()
So I’m glad to announce that 2.5 Million DBpedia resources are now interlinked with Freebase via owl:sameas links!
http://dbpedia.org/resource/Berlin owl:sameAs http://rdf.freebase.com/ns/guid.9202a8c04000641f80000000000094d6
You can grab a dump here.
So all your concepts which are interlinked with DBpedia can from now on also benefit from the data in Freebase.
And the Semantic Web continues to grow…