Curated Data vs. Linked Data

In a very interesting article, the German news site spiegel.de analyses the recently launched Wolfram Alpha. Their analysis hits the bull’s-eye: the Wolfram team will never be able to curate all the information needed to make W|A really useful. Especially compared to Wikipedia, which is maintained by such a large community.

Linked Data is the model for decentralized data publication that can solve this problem. I’m just wondering how much curating it will require…

Posted in Uncategorized | 1 Comment

DBpedia Lookup – Find me some URIs

The Linked Data Web is about reusing and linking URIs. With DBpedia, we provide URIs for a broad range of topics: People, organizations, countries, cities, rivers, mountains, music albums, films, books, buildings, etc. 2.6 million URIs overall.

But it didn’t used to be easy enough to find a DBpedia URI for a given keyword. DBpedia Lookup aims to fill that gap. It provides a service to find the most-likely DBpedia URIs for a given keyword. The algorithm ranks DBpedia resources based on their relevance in Wikipedia and includes synonyms into the underlying Lucene index.

Try the terms “Shakespeare“, “EU“, “Eiffel“, or “Cambridge” and see for yourself if the results you’d expect show up at the top. The result ranking is different – and supposed to be more useful – than a simple full-text search or SPARQL-Query with embedded regular expression for matching labels.

There is a web-service available as well. You can use the KeywordSearch method for searching full terms, and the PrefixSearch method for an autocompletion-style interface such as the one you see here. The webservice returns a list of resource URIs with English abstracts, dbpedia classes and categories.

Feel free to use the service as you like. If you plan to use it in a production system or to run a high-load batch process, please drop me a message to let me know.

I hope that DBpedia Lookup is useful, and I’d appreciate any feedback.

Many thanks to the semantic web folks at the BBC for their support and feedback on the development of DBpedia Lookup.

Posted in Uncategorized | Tagged | 7 Comments

OpenCalais joining the cloud

Reuters just announced that they’re going to fully support Linked Data with the upcoming release of OpenCalais 4.0. And that they’ll join the cloud by linking to DBpedia. Finally.
Anybody still doubting that Linked Data is ready for prime time?

Posted in Uncategorized | Tagged , | Leave a comment

Welcome to the Linked Data Cloud, Freebase

A week ago, Freebase announced that its rich dataset is now available as Linked Data. Congratulations, Metaweb! So it is finally the time to welcome Freebase to the Linked Data Cloud.

But wait, isn’t Linked Data about… linking data? Unfortunately, while Freebase now does export RDF, its concepts aren’t linked to other datasets in “the cloud“. No problem, we at DBpedia can help out. :-)

So I’m glad to announce that 2.5 Million DBpedia resources are now interlinked with Freebase via owl:sameas links!

http://dbpedia.org/resource/Berlin owl:sameAs http://rdf.freebase.com/ns/guid.9202a8c04000641f80000000000094d6

You can grab a dump here.
So all your concepts which are interlinked with DBpedia can from now on also benefit from the data in Freebase.

And the Semantic Web continues to grow…

Posted in Uncategorized | 1 Comment

DBpedia – Rethinking Wikipedia infobox extraction

Wikipedia consists mostly of unstructured information and free text, and those tiny bits of structure represented in templates are inaccessible unless we deploy a semantic markup extension to Wikipedia and start from scratch.

Wrong.

DBpedia and Freebase have shown that it’s possible to make all the structured data hidden in Wikipedia articles accessible, queryable and reusable without any change to the Wikipedia codebase. And there are many more of these structured bits in Wikipedia than we thought when we started DBpedia.

While the general DBpedia codebase did evolve over the last 18 months, with a new extraction framework, different special purpose extractors, multiple languages and connectors for extracting live data from Wikipedia, our approach of extracting infobox data did not change. As DBpedia data got more used by several projects and people, we faced the same sort of question about extraction problems of infobox data over and over again on our mailinglist.

We tried to address these issues with a new way of handling the infobox extraction, and so today I’m glad to announce a completely revised DBpedia infobox extraction approach.


Our former infobox extraction approach

First, let’s have a look at how the old extraction approach worked and what those problems were. Here are two Wikipedia infoboxes with wiki markup (screen shots from the according Wikipedia edit pages), one for the scientist Albert Einstein and one for the tennis player Andre Agassi:

Albert Einstein Infobox

Albert Einstein Infobox

Andre Agassi Infobox

Andre Agassi Infobox

As you can see, those two infobox templates, Infobox Scientist and Infobox Tennis player, use different property names for the “same” attribute. While the scientist has name, birth_date and birth_place, the tennis player has playername, datebirth and placebirth.

Our old infobox extraction parsed those templates and turned the template property names as they were into RDF predicates. So we ended up with http://dbpedia.org/property/birth_place for resources with a scientist infobox, and with http://dbpedia.org/property/placebirth for resources with a tennis player infobox.

There were no formal definitions of scientist, tennis player or the birthplace/placebirth properties. All that made it very hard to write queries like “give me all people born in Berlin”. You would have had to figure out all different RDF predicates for the “place of birth”. And there was no formal definition of the class “person”. Sure, we had Yago, but it wasn’t that intuitive to use either. And even if you would have found all predicates, the according SPARQL query would have been quite ugly.


The new template mapping approach

So what we needed were formal definitions of classes and properties, and then mappings of infobox templates to those classes.

There’s the computer scientist way of deriving a class hierarchy from Wikipedia: do machine learning and use an existing upper ontology like WordNet or OpenCyc.

And there’s the pragmatic way: Build your own one from scratch that actually works, structures data that’s actually available and makes sense to people.

As you might know, my educational background is not in computer science but in economics. And hence, I highly prefer the pragmatic approach. No offense at all to all the great people doing fantastic work with machine learning. But since DBpedia is – in my very personal opinion – a project which aims to make real data actually available and usable for common users, I rather pass on the scientific contribution here and create something simple and simply usable. The question is: If you were a website developer, would you show the terms “living thing”, “human being” or “agent” (which are very common in upper ontologies) on your website? I doubt it.

So what we did is building a very flat DBpedia ontology by looking at the available data in Wikipedia. And unlike Freebase’s topic system, we built a class hierarchy with inheritance.

There’s only one artificial sounding class: Resource. It’s the ontology’s root class. We tried to align all other classes in a way that makes sense to common people. The path to Andre Agassi is Resource->Person->Athlete->TennisPlayer. That’s it. For Albert Einstein, we’ve got Resource->Person->Scientist. We’ve defined those properties “name”, “birthplace” and “birthdate” on the Person class, and hence Athlete and Scientist inherit them.

And lastly, we mapped Wikipedia infobox templates to DBpedia classes. If an article uses the Infobox Scientist, the derived DBpedia resource is of class dbpedia:Scientist#Class. And the properties “birth_date” in Infobox Scientist as well as “datebirth” in Infobox Tennis Player map to dbpedia:Person#birthdate.

This way, we’ve mapped 350 Wikipedia templates with 2,200 properties to 170 DBpedia classes and 900 class properties. By hand.

There’s a new extractor in the DBpedia extraction framework processing all that information. We’ve run it on the complete English Wikipedia, and here’re some numbers (instances of classes) resulting from the new mapping-based infobox extraction:

  • 817,000 resources overall
  • 200,000 people (70,000 athletes, 65,000 artists, 18,000 office holders)
  • 193,000 places (100,000 areas, 40,000 cities, 10,000 rivers)
  • 187,000 works (71,000 music albums, 24,000 singles, 31,000 films, 15,000 books)
  • 87,000 species
  • 70,000 organizations (20,000 educational institutions, 18,000 companies, 12,000 radio stations)
  • 22,000 buildings (8,000 airports, 5,000 stations, 2,000 stadiums, 1,000 bridges)
  • 12,000 planets
  • And more… (events, diseases, proteins, drugs, aircrafts, automobiles, ships, astronauts, architects, scientists)

You can download the new infobox dataset here, and the according (forward-chained) rdf:type statements here. This data will be available in our DBpedia SPARQL endpoint soon, and then I’ll show you some demos. And you can have a look at the full class hierarchy here.

UPDATE: For the most up-to-date version of this dataset, please download the “Ontology Infoboxes” from the DBpedia Downloads Site.


Extraction code

I’ve completely rewritten the infobox extraction code to work with those template mappings. The goal was to make the infobox extraction as granular as possible. Every single property in an infobox can have it’s own parsing rule. There are specific parsers for dates, currencies, measurement units, strings, resource links etc. and rules how to parse specific templates like Infobox Musical Artist, which is used for solo artists as well as for bands. All that is still under development, and there’s a lot more work to do. But I hope that you already get the idea and that those initial datasets are already useful.


And who’s going to maintain all that?

As I said, there’s a whole bunch of template mappings and fine granular parsing rules that we have created by hand. But that’s just the beginning. Infobox templates will change over time, some mappings are still missing, the ontology is incomplete, and it’s nearly impossible for our small group to maintain all that on our own.

So a very legitimate question is: who’s going to maintain all that in the future?

My answer is quite simple: hopefully, you are. All of you. Well, those of you who have an interest in using a particular subset of DBpedia data are hopefully going to help us maintain those subsets. Are you building a web application using DBpedia data about British football players? Help us improve and complete the ontology class definition, the mappings and the parsing rules for football players. I will soon show you a way to easily contribute. So stay tuned.

Posted in Uncategorized | Tagged | 8 Comments

Hello again

It’s been a while since my last post on my blog. Actually it’s been over 18 months, and I think it is time for a redesign and a fresh start.

In the meantime, I’ve finished my masters degree, worked at HP Labs in Bristol, moved back to Berlin to work as a researcher at the Free University, got to know great people doing Semantic Web work at the BBC, collaborated with them on a Linked Data project and continued to work on DBpedia.

So this is my “hello (again) world” post and over the next time I’m going to publish some stuff about my work on DBpedia, Linked Data, graph-based user interfaces and more general Semantic Web stuff.

Cheers!

Posted in Uncategorized | Tagged | Leave a comment