Usually I don’t publicly point my finger at the work of others that I think is wrong. Still, I believe it’s time for an analysis of the stuff that’s going on in the Linked Data world, and that will include pointing out what I believe is wrong with some of the work that’s been invested lots of effort in. I don’t mean to offend anyone, and certainly in my own work I’ve done lots of mistakes. But certain elephants in the room need to be pointed out.

So, let’s begin today with the topic that’s been around since the very first days of the Linked Data movement: Search. With that I refer to the services that give the user the ability to enter a keyword and get back a set of Linked Data URIs crawled from the whole Linked Data cloud containing that keyword. To better understand my critisism, I’d first like to describe what I see as the motivation to build such a service.

Whenever talking about the killer apps of the internet, search comes up first. Without Altavista, Yahoo and Google, the web would probably never been as successful as it is today. Search enables people to find information from the web of distributed and linked documents, by giving them a little search box for keywords. So it seems obvious to take that approach, adapt it to the web of linked data, and you get another killer app. Unfortunately, it’s a little bit too obvious, leading people to copy instead of adapt. What people did was starting to copy the methods that Google use, i.e. providing a search box for keywords, instead of adapting the principle behind it.

What’s that principle? Building a centralized index of the greatest common denominator of documents on the web, i.e. the words occuring in these documents – our beloved keywords. Within a large set of documents, the words are the content, and the only commonality is the occurance of words. Natural language processing isn’t mature enough yet to see commonalities in the meaning of texts, so we have to stick with plain words for a while. It’s our best shot.

So what’s wrong with linked data search engines going the same route, proving keyword search on linked data and calling that a killer app? It completely misses the point. As I said, keyword search was the best shot for documents. Words create the content of text documents, and there’s no more real commonalities between documents than their words. Both that is very different for linked data.

Firstly, linked data could be highly useful without any words (i.e. literals or strings) at all. Imagine linked statistics just with IDs and numbers. Sure, the names of people, places, companies, movies, etc in the linked data cloud are very useful, but the main content lies in links and numbers. Secondly, interconnected datasets have much more in common than words. They have structure. And structured data can be mapped, merged, aggregated. The greatest common denominator of two datasets (in the same domain) is one merged dataset with the integrated data from both in one coherent data schema.

Now when linked data search engines provide keyword search on the large blob of unintegrated data, just crawled and thrown together into a large document index, they miss not only the opportunity of merging the data. By failing to do so, they miss the main principle of structured data. And what’s been the proposed solution to turn those keyword search apps into something useful? Plugging a faceted search feature in. That’s not going to help.

The name faceted search itself is already misleading. It gives the impression that the two functions search and faceted filtering together create something different than just the combination of the two. But that’s not the case. Faceted filtering refers to the concept of filtering a set of things (such as documents in a search result) by multiple different dimensions. But faceted filtering can only do… filtering. And in order to filter, you need to have a set of things that is coherent enough to be filtered. Keyword search is the method seen by many to create this coherence. But then, it has to create coherence within its results. When the user searches for Berlin, and a service returns a list of things named Berlin, including cities, people and music albums, that doesn’t create coherence on the data schema level. It can only filter the inital set of things/documents, so the level of coherence is the same before and after the search. Go to Amazon, search for Toshiba, and see what you get. The facets you’ll see are those which apply to the commonality of all things in Amazon’s index: products. The real magic of faceted filtering starts at the point where you’re down to just TVs, when you see facets such as the display size, the panel technology, etc. As long as we have a set of loosely coherent things, faceted filtering can’t do its trick.

And the problem with creating coherence within a set of linked data sources? If you don’t integrate the schemas, you’ll never get coherence. Web search engines have succeeded in the integration of documents, because they integrate the content of documents according to their commonality: words. Linked Data search engines have failed with that integration task, because they tried to integrate words where those don’t make the real content of linked data. And they didn’t try to integrate what’s really the content: structured data.

Access to integrated information has always been the killer application of the Web. The Web of Data doesn’t need to seek for a different killer app, just for a working adaption of that principle.

Share this: