What can linked data do for the enterprise, can it solve the CIO’s headaches around data integration problems? That topic comes up more and more often in the linked data community. Where does Linked Data fit into the enterprise? Let’s explore that by looking at conventional enterprise data integration first.

I stumbled upon a blog post describing the challenges of providing a single view of enterprise data sources. That post and the previous of its series describe typical corporate IT: SOAs, SAPs, IBMs, multi-million dollar integration projects, legacy systems all over the place, tons of code for special purpose application integration, etc. The usual ecosystem that has grown over decades, which from the outside might appear like a mess, but can not be replaced or consolidated. Because consolidating all those systems would be unaffordable, and might sometimes be politically difficult to pursue, because for each individual department and applications, things mostly work well enough the way they are. It’s the cross-application integration (and hence often cross-department) integration that causes headaches.

My friends at the BBC and I wrote a paper describing the challenges at the BBC and how Linked Data can help (read chapter one). Enterprise-wide IT system consolidation is not an option. The desired solution is something on top of existing systems that provides integrated views of all data repositories and workflows without the risk of breaking existing applications.

The blog post above describes a case of “small pieces of information about customers [being] littered throughout the data center”, in the billing system, marketing system, CRM system, support system, etc. The suggested solution is data virtualization (or federation): A software that allowes users to define aggregated views over the different pieces of data, and analysts and application developers query those views instead of querying the multiple different data sources directly. The integration layer acts as a middleware and retrieves the information from these disparate systems real time when requested. Whether or not to virtualize or materialize the integration layer (i.e. whether data from sources is retrieved at query time, or gets replicated into a central repository) depends on the concrete case. Complex data joins are difficult to virtualize, and query execution time is often much worse than in materialization systems. Also, in virtualized views there is the problem of availability: If one data source is unavailable, your queries will not execute, so you’d want proxies anyway.

The power of these integration layers is that they separate the business logic from the data sources. IT staff can focus on the data they need and don’t have to deal with the different systems that data is stored. They can start glueing data together instead of glueing systems together in a point-to-point manner.

Now, where does Linked Data fit in? Linked Data enables you to push the integration down to the level of the actual data. Think of it is a network (or web) of all the little pieces of information: One customer in your CRM system links to her doppelgänger in the billing systems.  So the data object about Anna Smith in the customer database links to the according Anna Smith in the billing system. And that Anna Smith links to her doppelgänger in the support system, and so on. Applications can follow these links through the different systems and that way get all the data about Anna Smith. The beauty of that? The links do not have to stop at your firewall. Your data objects can link to data sources on the Web or to your suppliers as well.You can link to whatever other Linked Data source you want, the technical barriers disappear.

Disambiguity is another important aspect. There are probably many Anna Smiths in any large enterprise customer database. With linked data, the objects become unambiguous, like a database IDs. Anna Smith lives in Cambridge? There are unambiguous IDs for the Cambridge in Massachusetts and the Cambridge in England, so Anna can link to the correct one unambiguously. And if you use a link to a linked data source on the web, like Geonames, your applications can fetch information about Cambridge from there. Once the disambiguation is done, and links between objects are established, they are available for all your applications to use.

All that makes data integration much more lightweight and agile, and at the same time much more powerful. And your integration layer software can do much more clever things in a more agile way. Is there still a need for that integration layer? Yes, there is. The integration layer becomes the place where the links between data objects get managed, where data collections get curated, where it gets defined which data sources and pieces of information to trust for which use case, where data collections are built, and where the data from all your enterprise and web data sources gets consolidated. Providing the single point of access into the web of data that exists in your enterprise and on the Web.

Share this:

Subscribe to comments Comment | Trackback |
Post Tags: ,

Browse Timeline


Comments ( View Comments )

[...] This post was mentioned on Twitter by Georgi Kobilarov, Richard Cyganiak, Uwe Stoll, Peter Haase, Uwe Stoll and others. Uwe Stoll said: RT @gkob: new blog post: "Linked Data and Enterprise Data Integration" http://bit.ly/b1bGQn #enterprise #linkeddata [...]

Tweets that mention Georgi Kobilarov » Linked Data and Enterprise Data Integration -- Topsy.com added these pithy words on Mar 24 10 at 5:02 pm

[...] Georgi Kobilarov » Linked Data and Enterprise Data Integration – Interesting angle on linked data – i.e. joing up the data silos within the enterprise. [...]

Communities and Collaboration » Bookmarks for March 31st through April 5th added these pithy words on Apr 05 10 at 12:05 pm

Enterprise Data Integration is about Data Virtualization, been so for eons. The main adoption impediment has always boiled down to the lack of standardized mechanisms for making the Enterprise Conceptual Model (that’s existed forever in IT) real.

What HTTP based Linked Data offers, is an ability to finally deliver platform agnostic Data Virtualization such so that the following basic issues are addressed, without compromising security or performance:

1. Identifiers for each Data Object (Customers, Contacts, Orders, Invoices, Competitors, Employees)

2. Access to Disparately Located and Shaped Data Sources (ODBC, JDBC, Web Services, Hypermedia Resources)

3. Dirty Data — eliminating value based data joining (typical RDBMS realm approach) via Identifier based relations e.g. owl:sameAs style co-references

4. Locale Issues — units of measurement is a classic example

5. Policy Based Data Access — leveraging Identifiers as basis for Data Access policies (again this ultimately benefits from OWL based reasoning within context of ACLs).

It nice to see that one of the most important aspects of Linked Data is finally getting attention re. value proposition articulation; especially as it allows enterprises to do much more with what they already have, without expensive and disruptive overhaul etc.

Kingsley Idehen added these pithy words on Mar 24 10 at 3:13 pm

First of all, a good insightful post (once again).

I do not completely agree regarding what/when usage of Linked Data infrastructure is recommended to CIOs here.

One headache of CIOs is for sure to not completely have managed all data integration issues, but it would be one headache more to add an additional layer on top w/o adressing the underlying data quality problems.

So I see after some quick wins due to Linked Data usage (meshups etc.) a roadmap to clarify and define where the leading data source for which information is –> this enables you to adress quality issues in a manageable/traceable way.

Only by putting a – on the surface – clean data access layer on top, this does not reduce the likeliness that wrong data or (as Kinsley pointed out) wrongly interpreted data are used in business processes.

This does not reduce the technical value of technological arguments, as both Georgi and Kingsley pointed them out.

In this context I like very much the idea to see Linked Data infrastructures employed as a further development after the SOA wave: to be the authoritative source for defined branches of enterprise data for mashups/meshups to consume and act on them.

The discussion point for the CIO would be: what is the cure for data quality problems – after getting quick wins/showcases. (think of e.g. of SAP Master data Management accessing a Linked Data source…wouldn’t that be nice?)

Daniel added these pithy words on Mar 24 10 at 6:33 pm

Daniel,

Yes, the Master Data Management realm is basically yet another moniker for the intrinsic value that Linked Data delivers to enterprises.

Re. SOA, yes, if we treat those SOAP Web Services as Data Source conduits, and simply transform their output (ETL style) or place real-time RDF Views atop them (either can be done using Virtuoso’s middleware layer), you also fix the excruciating pain laid down by SOA’s failure address: Data Access and Representation the right way (i.e, with the EAV model granularity of Linked Data via use of HTTP scheme based Identifiers for Data Objects).

Kingsley Idehen added these pithy words on Mar 24 10 at 7:39 pm

Thank you Georgi for these brilliant posts. Daniel do you see that the “Establishing Trust by Describing Provenance” approach is something that could work? For details see JeniTennison http://www.jenitennison.com/blog/node/133

JoukoSalonen added these pithy words on Mar 24 10 at 8:27 pm

Hi Georgi

One thing that did strike me about this is that business organisations are not static. The problems you describe are mainly about integrating data across businesses that have grown organically. An extension of this is businesses acquiring other businesses. I’ve worked for a fairly large organisation that was going through an acquisition phase. The first part was always about redrawing reporting lines and business unit boundaries; then moved on to aligning business functions and finally got round to aligning business systems and data. So you’d have one merged HR department running 2 different HR systems with 2 lots of data for months, sometimes years. Obviously any means of consolidating data in a low friction way without merging systems is useful here.

But depending on the state of the economic cycle businesses tend to be either acquiring or divesting. There’s often a lot of attention paid to the cost of merging systems and data when they’re in the acquisition phase. I know from experience the pain points of being acquired and having to rewrite systems into the mother companies “global platform”.

What isn’t often mentioned is the cost of divesting. Once you’ve written all that code to merge 2 businesses, migrated to one “common platform” and consolidated the data into one store, 4 years down the line the mother business decides to sell you off. And then there’s all the pain of separating the data *and* the systems so that daughter business can once again function as a separate entity.

In reality what’s usually required in acquisition is consolidated data. Consolidated systems are just seen as a price you pay for that. But when you want to unconsolidate (is that a word?) your data you end up with more pain cos you also have to unconsolidate your systems. And often that’s more expensive than migrating to the common platform in the first place.

The point I’m trying to make is being able to consolidate data without building a cathedral of code is good. But being able to deconsolidate without having to dismantle that cathedral is also not worth nothing… :-)

Michael Smethurst added these pithy words on Mar 24 10 at 10:49 pm

Add a Comment


XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

blog comments powered by Disqus

© Copyright 2010 Georgi Kobilarov . Thanks for visiting!