I have been seeing an enormous amount of momentum in the library industry toward “linked data”, often in the form of a fairly ambitious collective project to rebuild much of our infrastructure around data formats built on linked data.
I think linked data technology is interesting and can be useful. But I have some concerns about how it appears to me it’s being approached. I worry that “linked data” is being approached as a goal in and of itself, and what it is meant to accomplish (and how it will or could accomplish those things) is being approached somewhat vaguely. I worry that this linked data campaign is being approached in a risky way from a “project management” point of view, where there’s no way to know if it’s “working” to accomplish it’s goals until the end of a long resource-intensive process. I worry that there’s an “opportunity cost” to focusing on linked data in itself as a goal, instead of focusing on understanding our patrons needs, and how we can add maximal value for our patrons.
I am particularly wary of approaches to linked data that seem to assume from the start that we need to rebuild much or all of our local and collective infrastructure to be “based” on linked data, as an end in itself. And I’m wary of “does it support linked data” as the main question you asked when evaluating software to purchase or adopt. “Does it support linked data” or “is it based on linked data” can be too vague to even be useful as questions.
I also think some of those advocating for linked data in libraries are promoting an inflated sense of how widespread or successful linked data has been in the wider IT world. And that this is playing into the existing tendency for “magic bullet” thinking when it comes to information technology decision-making in libraries.
This long essay is an attempt to explain my concerns, based on my own experiences developing software and using metadata in the library industry. As is my nature, it turned into a far too long thought dump, hopefully not too grumpy. Feel free to skip around, I hope at least some parts end up valuable.
What is linked data?
The term “linked data” as used in these discussions basically refers to what I’ll call an “abstract data model” for data — a model of how you model data.
The model says that all metadata will be listed as a “triple” of (1) “subject”, (2) “predicate” (or relationship), and (3) “object”.
1. Object A [subject] 2. Is a [predicate] 3. book [object] 1. Object A [subject] 2. Has the ISBN [predicate] 3. "0853453535" [object] 1. Object A 2. has the title 3. "Revolution and evolution in the twentieth century" 1. Object A 2. has the author 3. Author N 1. Author N 2. has the first name 3. James 1. Author N 2. has the last name 3. Boggs
Our data is encoded as triples, statements of three parts: subject, predicate, object.
Linked data prefers to use identifiers for as many of these data elements as possible, and in particular identifiers in the form of URI’s.
“Object A” in my example above is basically an identifier, but similar to the “x” or “y” in an algebra problem, it has meaning only in the context of my example; someone elses “Object A” or “x” or “y” in another example might mean something different, and trying to throw them all together you’re going to get conflicts. URI’s are nice as identifiers in that, being based on domain names, they have a nice way of “namespacing” and avoiding conflicts, they are global identifiers.
# The identifiers I'm using are made up by me, and I use # example.org to get across I'm not using standard/conventional # identifiers used by others. 1. http://example.org/book/oclcnum/828033 [subject] 2. http://example.org/relationship/is_member_of_class [predicate] 3. http://example.org/type/Book [object] # We can see sometimes we still need string literals, not URIs 1. http://example.org/book/oclcnum/828033 2. http://example.org/relationship/has_title 3. "Revolution and evolution in the twentieth century" 1. http://example.org/book/oclcnum/828033 2. http://example.org/relationship/has_author 3. http://example.org/lccn/79128112 1. http://example.org/lccn/79128112 2. http://example.org/relationship/is_member_of_class 3. http://example.org/type/Person 1. http://example.org/lccn/79128112 2. http://example.org/relationship/has_name 3. "Boggs, James"
I call the linked data model an “abstract data model“, because it is a model for how you model data: As triples.
You still, as with any kind of data modeling, need what I’ll call a “domain model” — a formal listing of the entities you care about (books, people), and what attributes, properties, and relationships with each other those entities have.
In the library world, we’ve always created these formal domain models, even before there were computers. We’ve called it “vocabulary control” and “authority control”. In linked data, that domain model takes the form of standard shared URI identifiers for entities, properties, and relationships. Establishing standard shared URI’s with certain meanings for properties or relationships (eg `http://example.org/relationship/has_title` will be used to refer to the title, possibly with special technical specification of what we mean exactly by ‘title’) is basically “vocabulary control”, while establishing standard shared URI’s for entities (eg `http://example.org/lccn/79128112`) is basically “authority control”.
You still need common vocabularies for your linked data to be inter-operable, there’s no magic in linked data otherwise, linked data just says the data will be encoded in the form of triples, with the vocabularies being encoded in the form of URIs. (Or, you need what we’ve historically called a “cross-walk” to make data from different vocabularies inter-operable; linked data has certain standard ways to encode cross-walks for software to use them, but no special magic ways to automatically create them).
For an example of vocabulary (or “schema”) built on linked data technology, see schema.org.
You can see that through aggregating and combining multiple simple “triple” statements, we can build up a complex knowledge graph. Through basically one simple rule of “all data statements are triples”, we can build up remarkably complex data, and model just about any domain model we’d want. The library world is full of analytical and theoretically minded people who will find this theoretical elegance very satisfying, the ability to model any data at all as a bunch of triples. I think it’s kind of neat myself.
You really can model just about any data — any domain model — as linked data triples. We could take AACR2-MARC21 as a domain model, and express it as linked data by establishing a URI to be used as a predicate for every tag-subtag. There would be some tricky parts and edge cases, but once figured out, translation would be a purely mechanical task — and our data would contain no more information or utility output as linked data than it did originally, nor be any more inter-operable than it was originally, as is true of the output of any automated transformation process.
You can model anything as linked data, but some things are more convenient and some things less convenient. The nature of linked data as being building complex information graphs based on simple triples can actually make the linked data more difficult to deal with practically, as you can see looking at our made up examples above and trying to understand what they mean. By being so abstract and formally simple, it can get confusing.
Some things that might surprise you are kind of inconvenient to model as linked data. It can take some contortions to model an ordered sequence using linked data triples, or to figure out how to model alternate language representations (say of a title) in triples. There are potentially multiple ways to solve these goals, with certain patterns as established as standards for inter-operability, but they can be somewhat confusing to work with. Domain modeling is difficult already — having to fit your domain model into the linked data abstract model can be a fun intellectual exercise, but the need to undertake that exercise can make the task more difficult.
Other things are more convenient with linked data. You might have been wondering when the “linked” would come in.
Modeling all our data as individual “triples” makes it easier to merge data from multiple sources. You just throw all the triples together (You are still going to need to deal with any conflicts or inconsistencies that come about). Using URI’s as vocabulary identifiers means that you can throw all this data together from multiple sources, and you won’t have any conflicts, you won’t find one source using MARC tag 100 to mean “main entry” and another source using the 100 tag to mean all sorts of other things (See UNIMARC!).
Linked data vocabularies are always “open for extension”. let’s see we established that there’s a sort of thing as a `http://example.org/type/Book` and it has a number of properties and relationships including `http://example.org/relationship/has_title`. But someone realizes, gee, we really want to record the color of the book too. No problem, they just start using `http://mydomain.tld/relationship/color`, or whatever they want. It won’t conflict with any existing data (no need to find an unused MARC tag!), but of course it won’t be useful outside the originator’s own system unless other people adopt this convention, and software is written to recognize and do something with it (open for extension, but we still need to adopt common vocabularies).
And using URI’s is meant to make it more straightforward to combine data from multiple sources in another way, that an http URI actually points to a network location, that could be used to deliver more information about something, say, `http://example.org/book/oclcnum/828033`, in the form of more triples. Mechanics to make it easier to assemble (meta)data from multiple sources together.
There are mechanics meant to support aggregating, combining, and sharing data built into the linked data design — but the fundamental problems of vocabulary and authority control, of using the same or overlapping vocabularies (or creating cross-walks), of creating software that recognizes and does something useful with vocabulary elements actually in use, etc, — all still exist. So do business model challenges with entities that don’t want to share their data, or human labor power challenges with getting data recorded. I think it’s worth asking if the mechanical difficulties with, say, merging MARC records from different sources, are actually the major barriers to more information sharing/coordination in the present environment, vs these other factors.
“Semantic web” vs “linked data”? vs “RDF”?
The “semantic web” is an older term than “linked data”, but you can consider it to refer to basically the same thing. Some people cynically suggest “linked data” was meant to rebrand the “semantic web” technology after it failed to get much adoption or live up to it’s hype. The relationship between the two terms according to Tim Berners-Lee (who invented the web, and is either the inventor or at least a strong proponent of semantic web/linked data) seems to be that “linked data” is the specific technology or implementations of individual buckets of data, while the “semantic web” is the ecosystem that results from lots of people using it.
RDF, which stands for “Resource Description Framework”, and is actually the official name of the abstract data model of “triples”. Whereas then “linked data” could be understood as data using RDF and URI’s, and the “semantic web” the ecosystem that results from plenty of people doing it. Similarly, “RDF” can be roughly understood as a synonym.
Technicalities aside, “semantic web”, “linked data”, and “RDF” can generally be understood as rough synonyms when you see people discussing them — whatever term they use, they are talking about (meta)data modeled as “triples”, and the systems that are created by lots of such data integrated together over the web.
So. What do you actually want to do? Where are the users?
At a recent NISO forum on The Future of Library Resource Discovery, there was a session where representatives from 4(?) major library software vendors took Q&A from a moderator and the audience. There was a question about the vendor’s commitment to linked data. The first respondent (who I think was from EBSCO?) said something like
[paraphrased] Linked data is a tool. First you need to decide what you want to do, then linked data may or may not be useful to doing that.
I think that’s exactly right.
Some of the other respondents, perhaps prompted by the first answer, gave similar answers. While others (especially OCLC) remarked of their commitment to linked data and the various places they are using it. Of these though, I’m not sure any have actually resulted in any currently useful outcomes due to linked data usage.
Four or five years ago, talk of “user-centered design” was big in libraries — and in the software development world in general. For libraries (and other service organizations), user-centered design isn’t just about software — but software plays a key role in almost any service a contemporary library offers, quite often mediating the service through software, such that user-centered design in libraries probably always involves software.
For academic libraries, with a mission to help our patrons in research, teaching, and learning — user-centered design begins with understanding our patrons’ research and leaning processes. And figuring out the most significant interventions we can make to improve things for our patrons. What are their biggest pain points? Where can we make the biggest difference? To maximize our effectiveness when there’s an unlimited number of approaches we could take, you want to start with areas you can make a big improvement for the least resource investment.
Even if your institution lacks the resources to do much local research into user behavior, over the past few years a lot of interesting and useful multi-institutional research has been done by various national and international library organizations, such as reports from OCLC [a] [b], JISC [a], and Ithaka [a], [b], as well as various studies done by practitioner and published in journals.
To what extent is the linked data campaign informed by, motivated by, or based on what we know about our users behavior and needs? To what extent are the goals of the linked data campaign explicit and specific, and are those goals connected back to what our users need from us? Do we even know what we’re supposed to get out of it at all, beyond “data that’s linked better”, or “data that works well with the systems of entities outside the library industry”? (And for the latter, do we actually understand in what ways we want it to “work well”, for what reasons, and what it takes to accomplish that?) Are we asking for specific success stories from the pilot projects that have already been done? And connecting them to what we need to do provide our users?
To be clear, I do think goals to increase our own internal staff efficiency, or to improve the quality of our metadata that powers most of our services are legitimate as well. But they still need to be tied back to user needs (for instance, to know the metadata you are improving is actually the metadata you need and the improvements really will help us serve our users better), and be made explicit (so you can evaluate how well efforts at improvement are working).
I think the motivations for the linked data campaign can be somewhat unclear and implicit; when they are made explicit, they are sometimes very ambitious goals which require a lot of pieces falling into place (including third-party cooperation and investment that is hardly assured) for realization only in the long-term — and with unclear or not-made-explicit benefits for our patrons even if realized. For a major multi-institution multi-year resource-intensive campaign — this seems to me not sufficiently grounded in our user’s needs.
Is everyone else really doing it? Maybe not.
At another linked data presentation I attended recently, a linked data promoter said something along the lines of:
[paraphrased] Don’t do linked data because I say so, or because LC says so. Do it because it’s what’s necessary to keep us relevant in the larger information world, because it’s what everyone else is doing. Linked data is what lets Google give you good search results so quickly. Linked data is used by all the major e-commerce sites, this is how they do can accomplish what they can.
The thing is, from my observation and understanding of the industry and environment, I just don’t think it’s true that “everyone is doing it”.
Google does use data formats based on the linked data model for it’s “rich snippets” (link to a 2010 paper). This feature, which gives you a list of links next to a search result, is basically peripheral to the actual Google search.
Google also uses linked data to a somewhat more central extent in it’s Knowledge Graph feature, which provides “facts” in sidebars on search results. But most of the sources of data Google harvests from for it’s Knowledge Graph aren’t actually linked data, rather Google harvests and turns them into linked data internally — and then doesn’t actually expose the linked-data-ified data to the wider world. In fact, Google has several times announced initiatives to expose the collected and triple-ified data to the wider world, but they have not actually turned into supported products. This doesn’t necessarily say what advocates might want about the purported central role of linked data to Google, or what it means for linked data’s wider adoption. As far as I know or can find out, linked data does not play a role in the actual primary Google search results, just in the Knowledge Graph “fact boxes”, and the “rich snippets” associated with results.
In a 2013 blog post, Andreas Blumaeur, arguing for the increased use of linked data, still acknowledges: “Internet companies like Google and Facebook make use of linked data quite hesitantly.”
My sense is that the general industry understanding is that linked data has not caught on like people thought it would in the 2007-2012 heyday, and adoption has in fact slowed and reversed. (Google trend of linked data/semantic web)
An October 2014 post on Hacker News asks: ” A few years ago, it seemed as if everyone was talking about the semantic web as the next big thing. What happened? Are there still startups working in that space? Are people still interested?”
In the ensuing discussion on that thread (which I encourage you to read), you can find many opinions, including:
- “The way I see it that technology has been on the cusp of being successful for a long time” [but has stayed on the cusp]
- “A bit of background, I’ve been working in environments next to, and sometimes with, large scale Semantic Graph projects for much of my career — I usually try to avoid working near a semantic graph program due to my long histories of poor outcomes with them. I’ve seen uncountably large chunks of money put into KM projects that go absolutely nowhere and I’ve come to understand and appreciate many of the foundational problems the field continues to suffer from. Despite a long period of time, progress in solving these fundamental problems seem hopelessly delayed.”
- “For what it’s worth, I spent last month trying to use RDF tooling (Python bindings, triple stores) for a project recently, and the experience has left me convinced that none of it is workable for an average-size, client-server web application. There may well be a number of good points to the model of graph data, but in practice, 16 years of development have not lead to production-ready tools; so my guess is that another year will not fix it.”
- But also, to be fair: “There’s really no debate any more. We use the the technology borne by the ‘Semantic Web’ every day.” [Personally I think this claim was short on specifics, and gets disputed a bit in the comments]
At the very least, the discussion reveals that linked data/semantic web is still controversial in the industry at large, it is not an accepted consensus that it is “the future”, it has not “taken over.” And linked data is probably less “trendy” now in the industry at large than it was 4-6 years ago.
Talis was a major UK vendor of ILS/LMS library software, the companies history begins in 1969 as a library cooperative, similar to OCLC’s beginnings. In the mid-2000’s, they started shifting to a strategic focus on semantic web/linked data. In 2011, they actually sold off their library management division to focus primarily on semantic web technology. But quickly thereafter in 2012, they announced “that investment in the semantic web and data marketplace areas would cease. All efforts are now concentrated on the education business. ” They are now in the business of producing “enterprise teaching and learning platform” (compare to Blackboard, if I understand correctly), and apparently fairly succesful at it — but the semantic web focus didn’t pan out. (Wikipedia, Talis Group)
In 2009, The New York Times, to much excitement, announced a project to expose their internal subject vocabulary as linked data in. While the data is still up, it looks to me like was abandoned in 2010; there has been no further discussion or expansion of the service, and the data looks not to have been updated. Subject terms have a “latest use” field which seems to be stuck in May or June 2010 for every term I looked at (see Obama, Barak for instance), and no terms seem to be available for subjects that have become newsworthy since 2010 (no Carson, Ben, for instance).
In the semantic web/linked data heydey, a couple attempts to create large linked data databases were announced and generated a lot of interest. Freebase was started in 2007, acquired by Google in 2010… and shut down in 2014. DBPedia was began much earlier and still exists… but it doesn’t generate the excitement or buzz that it used to. The newer WikiData (2012) still exists, and is considered a successor to Freebase by some. It is generally acknowledged that none of these projects have lived up to initial hopes with regard to resulting in actual useful user-facing products or services, they remain experiments. A 2013 article, “There’s No Money in Linked Data“, suggests:
….[W]e started exploring the use of notable LD datasets such as DBpedia, Freebase, Geonames and others for a commercial application. However, it turns out that using these datasets in realistic settings is not always easy. Surprisingly, in many cases the underlying issues are not technical but legal barriers erected by the LD data publishers.
In Jan 2014, Paul Houle in “The trouble with DBpedia” argues that the problems are actually about data quality in DBPedia — specifically about vocabulary control, and how automatic creation of terms from use in wikipedia leads to inconsistent vocabularies . Houle thinks there are in fact technical solutions — but he, too, begins from the acknowledgement that DBPedia has not lived up to it’s expected promise. In a very lengthy slide deck from February 2015, “DBpedia Ontology and Mapping Problems”, vladimiralexiev has a perhaps different diagnosis of the problem, about ontology and vocabulary design, and he thinks he has solutions. Note that he too is coming from an experience of finding DBPedia not working out for his uses.
There’s disagreement about why these experiments haven’t panned out to be more than experiments or what can be done or what promise they (and linked data in general) still have — but pretty widespread agreement in the industry at large that they have not lived up to their initial expected promise or hype, and have as of yet delivered few if any significant user-facing products based upon them.
It is interesting that many diagnoses of the problems there are about the challenges of vocabulary control and developing shared vocabularies, the challenges of producing/extracting sufficient data that is fit to these vocabularies, as well as business model issues — sorts of barriers we are well familiar with in the library industry. Linked data is not a magic bullet that solves these problems, they will remain for us as barriers and challenges to our metadata dreams.
Semantic web and linked data are still being talked about, and worked on in some commercial quarters, to be sure. I have no doubt that there are people and units at Google who are interested in linked data, who are doing research and experimentation in that area, who are hoping to find wider uses for linked data at Google, although I do not think it is true that linked data is currently fundamentally core to Google’s services or products or how they work. What they have not done is taken over the web, or become a widely accepted fact in the industry. It is simply not true that “every major ecommerce site” has an architecture built on linked data. It is certainly true that some commercial sector actors continue to experiment with and explore uses of linked data.
But in fact, I would say that libraries and the allied cultural heritage sector, along with limited involvement from governmental agencies (especially in the UK, although not to the extent some would like, with 2010 cancellation of a program) and scholarly publishing (mainly I think of Nature Publishing), are primary drivers of linked data research and implementation currently. We are some of the leaders in linked data research, we are not following “where everyone else is going” in the private sector.
There’s nothing necessarily wrong with libraries being the drivers in researching and implementing interesting and useful technology in the “information retrieval” domain — our industry was a leader in information retrieval technology 40-80 years ago, it would be nice to be so again, sure!
But we what we don’t have is “everyone else is doing it” as a motivation or justification for our campaign — not that it must be a good idea because the major players on the web are investing heavily in it (they aren’t), and not that we will be able to inter-operate with everyone else the way we want if we just transition all of our infrastructure to linked data because that’s where everyone else will be too (they won’t necessarily, and everyone using linked data isn’t alone sufficient for inter-operability anyway, there needs to be coordination on vocabularies as well, just to start).
My Experiences in Data and Service Interoperability Challenges
For the past 7+ years, my primary work has involved integrating services and data from disparate systems, vendors, and sources, in the library environment. I have run into many challenges and barriers to my aspired integrations. They often have to do with difficulties in data interoperability/integration; or in the utility of our data, difficulties in getting what I actually need out of data. These are the sorts of issues linked data is meant to be at home in.
However, seldom in my experience do I run into a problem where simply transitioning infrastructure to linked data would provide a solution or fundamental advancement. The barriers often have at their roots business models (entities that have data you want to interoperate with, but don’t want their data to be shared because it keeping it close is of business value to them; or that simply have no business interest in investing in the technology needed to share data better); or lack of common shared domain models (vocabulary control); or lack of person power to create/record the ‘facts’ needed in machine-readable format.
Linked data would be neither necessary nor sufficient to solving most of the actual barriers I run into. Simply transitioning to a linked data-based infrastructure without dealing with the business or domain model issues would not help at all; and linked data is not needed to solve the business or domain model issues, and of unclear aid in addressing them: A major linked data campaign may not be the most efficient, cost effective, or quickest way to solve those problems.
Here are some examples.
What Serial Holdings Do We Have?
In our link resolver, powered by Umlaut, a request might come in for a particular journal article, say the made up article “Doing Things in Libraries”, by Melville Dewey, on page 22 of Volume 50 Issue 2 (1912) of the Journal of Doing Things.
I would really like my software to tell the user if we have this specific article in a bound print volume of the Journal of Doing Things, exactly which of our location(s) that bound volume is located at, and if it’s currently checked out (from the limited collections, such as off-site storage, we allow bound journal checkout).
My software can’t answer this question, because our records are insufficient. Why? Not all of our bound volumes are recorded at all, because when we transitioned to a new ILS over a decade ago, bound volume item records somehow didn’t make it. Even for bound volumes we have — or for summary of holdings information on bib/copy records — the holdings information (what volumes/issues are contained) are entered in one big string by human catalogers. This results in output that is understandable to a human reading it (at least one who can figure out what “v.251(1984:Jan./June)-v.255:no.8(1986)” means). But while the information is theoretically input according to cataloging standards — changes in practice over the years, varying practice between libraries, human variation and error, lack of validation from the ILS to enforce the standards, and lack of clear guidance from standards in some areas, mean that the information is not recorded in a way that software can clearly and unambiguously understand it.
This is a problem of varying degrees at other libraries too, including for digitized copies, for presumably similar reasons. In addition to at my own library, I’d like my software to be able to figure out if, say, HathiTrust has a digitized copy of this exact article (digitized copy of that volume and issue of that journal). Or if nearby libraries in WorldCat have a physical bound journal copy, if we don’t here. I can’t really reliably do that either.
We theoretically have a shared data format and domain model for serial holdings, Marc Format for Holdings Data (MFHD). A problem is that not all ILS’s actually implement MFHD, but more than that, that MFHD was designed in a world of printing catalog cards, and doesn’t actually specify the data in the right way to be machine actionable, to answer the questions we want answered. MFHD also allows for a lot of variability in how holdings are recorded, with some patterns simply not recording sufficient information.
In 2007 (!) I advocated more attention to ONIX for Serials Coverage as a domain model, because it does specify the recording of holdings data in a way that could actually serve the purposes I need. That certainly hasn’t happened, I’m not sure there’s been much adoption of the standard at all. It probably wouldn’t be that hard to convert ONIX for Serials Coverage to a linked data vocabulary; that would be fine, if not neccesarily advancing it’s power any. It’s powerful, if it were used, because it captures the data actually needed for the services we need in a way software can use, whether or not it’s represented as linked data. Actually implementing ONIX for Serials Coverage — with or without linked data — in more systems would have been a huge aid to me. Hasn’t happened.
Likewise, we could probably, without too much trouble, create a “linked data” translated version of MFHD. This would solve nothing, neither the problems with MFHD’s expressiveness nor adoption. Neither would having an ILS whose vendor advertises it as “linked data compatible” or whatever, make MFHD work any better. The problems that keep me from being able to do what I want have to do with domain modeling, with adoption of common models throughout the ecosystem, and with human labor to record data. They are not problems the right abstract data model can fix, they are not fundamentally problems of the mechanics of sharing data, but of the common recording of data in common formats with sufficient utility.
Lack of OCLC number or other identifiers in records
Even in a pre-linked data world, we have a bunch of already existing useful identifiers, which serve to, well, link our data. OCLC numbers as identifiers in the library world are prominent for their widespread adoption and (consequent) usefulness.
If several different library catalogs all use OCLC numbers on all their records, we can do a bunch of useful things, because we can easily know when a record in one catalog represents the same thing as a record in another. We can do collection overlap analysis. We can link from one catalog to another — oh, it’s checked out here, but this other library we have a reciprocal borrowing relationship with has a copy. We can easily create union catalogs that merge holdings from multiple libraries onto de-duplicated bibs. We can even “merge records” from different libraries — maybe a bib from one library has 505 contents but the bib from doesn’t, the one that doesn’t can borrow the data and know which bib it applies to. (Unless it’s licensed data they don’t have the right to share, a very real problem, which is not a technical one, and linked data can’t solve either).
We can do all of these things today, even without linked data. Except I can’t, because in my local catalog a great many (I think a majority) of records lack OCLC numbers.
Why? Many of them are legacy records from decades ago, before OCLC was the last library cooperative standing, from before we cared. All the records missing OCLC numbers aren’t legacy though. Many of them are contemporary records supplied by vendors (book jobbers for print, or e-book vendors), which come to us without OCLC numbers. (Why do we get from there instead of OCLC? Convenience? Price? No easy way to figure out how to bulk download all records for a given purchased ebook package from OCLC? Why don’t the vendors cooperate with OCLC enough to have OCLC numbers on their records — I’m not sure. Linked data solves none of these issues.)
Even better, I’d love to be able to figure out if the book represented by a record in my catalog exists in Google Books, with limited excerpts and searchability or even downloadable fulltext. Google Books actually has a pretty good API, and if Google Books data had OCLC numbers in it, I could easily do this. But even though Google Books got a lot of it’s data from OCLC Worldcat, Google Books data only rarely includes OCLC numbers, and does so in entirely undocumented ways.
Lack of OCLC numbers in data is a problem very much about linking data, but it’s not a problem linked data can solve. We have the technology now, the barriers are about human labor power, business models, priorities, costs. Whether the OCLC numbers that are there are in a MARC record in field 035, or expressed as a URI (say, `http://www.worldcat.org/oclc/828033`) and included in linked data — are entirely irrelevant to me, my barriers are about lack of OCLC numbers in the data, I could deal with them in just about any format at all, and linked data formats won’t help appreciably, but I can’t deal with the data being absent.
And in fact, if you convert your catalog to “linked data” but still lack OCLC numbers — you’re still going to have to solve that problem to do anything useful as far as “linking data”. The problem isn’t about whether the data is “linked data”, it’s about whether the data has useful identifiers that can be used to actually link to other data sets.
As you might guess from the fact that so many records in our local catalog don’t have OCLC numbers — most of the records in our local catalog also haven’t been updated since they were added years, decades ago. They might have typos that have since been corrected in WorldCat. They might represent ages ago cataloging practices (now inconsistent with present data) that have since been updated in WorldCat. The WorldCat records might have been expanded to have more useful data (better subjects, updated controlled author names, useful 5xx notes).
Our catalog doesn’t get these changes, because we don’t generally update our records from WorldCat, even for the records that do have OCLC numbers. (Also, naturally, not all of our holdings are actually listed with WorldCat, although this isn’t exactly the same set as those that lack OCLCnums in our local catalog). We could be doing that. Some libraries do, some libraries don’t. Why don’t the libraries that don’t? Some combination of cost (to vendors), local human labor, legacy workflows difficult to change, priorities, lack of support from our ILS software for automating this in an easy way, not wanting to overwrite legacy locally created data specific to the local community, maybe some other things.
Getting our local data to update when someone else has improved it, is again the kind of problem linked data is targeted at, but linked data won’t necessarily solve it, the biggest barriers are not about data format. After all, some libraries sync their records to updated WorldCat copy now, it’s possible with the technology we have now, for some. It’s not fundamentally a problem of mechanics with our data formats.
I wish our ILS software was better architected to support “sync with WorldCat” workflow with as little human intervention as possible. It doesn’t take linked data to do this — some are doing it already, but our vendor hasn’t chosen to prioritize it. And just because software “supports linked data” doesn’t guarantee it will do this. I’d want our vendors focusing on this actual problem (whether solved with or without linked data), not the abstract theoretical goal of “linked data”.
Difficulty of getting format/form info from our data, representing what users care about
One of the things my patrons care most about, when running across a record in the catalog for say, “Pride and Prejudice”, is format/genre issues.
Is a given record the book, or a film? A film on VHS, or DVD (you better believe that matters a lot to a patron!)? Or streaming online video? Or an ebook? Or some weird copy we have on microfiche? Or a script for a theatrical version? Or the recording of a theatrical performance? On CD, or LP, or an old cassette?
And I similarly want to answer this question when interrogating data at remote sources, say, WorldCat, or a neighboring libraries catalog.
It is actually astonishingly difficult to get this information out of MARC — the form/format/genre of a given record, in terms that match our users tasks or desires. Why? Well, because the actual world we are modeling is complicated and constantly changing over the decades, it’s unclear how to formally specify this stuff, especially when it’s changing all the time (Oh, it’s a blu-ray, which is kind of a DVD, but actually different). (I can easily tell you the record you’re looking at represents something that is 4.75″ wide though, in case you cared about that…)
It’s a difficult domain modeling problem. RDA actually tried to address this with better more formal theoretically/intellectually consistent modeling of what form/genre/format is all about. But even in the minority of records we have with RDA tags for this, it doesn’t quite work, I still can’t easily get my software to figure out if the item represented by a record is a CD or a DVD or a blu-ray DVD or what.
Well, it’s a hard problem of domain modeling, harder than it might seem at first glance. A problem that negatively impacts a wide swath of library users across library types. Representing data as linked data won’t solve it, it’s an issue of vocabulary control. Is anyone trying to solve it?
Workset Grouping and Relationships
Related to form/format/genre issues but a distinct issue, is all the different versions of a given work in my catalog.
There might be dozens of Pride and Prejudices. For the ones that are books, do they actually all have the same text in them? I don’t think Austen ever revised it in a new edition, so probably they all do even if published a hundred years apart — but that’s very not true of textbooks, or even general contemporary non-fiction which often exists in several editions with different text. Still, different editions of Pride and Prejudice might have different forwards or prefaces or notes, which might matter in some contexts. Or maybe different pagination, which matters for citation lookup.
And then there’s the movies, the audiobooks, the musical (?). Is the audiobook the exact text of the standard Pride and Prejudice just read aloud? Or an abridged version? Or an entirely new script with the same story? Are two videos the exact same movie one on VHS and one on DVD, or two entirely different dramatizations with different scripts and actors? Or a director’s cut?
These are the kinds of things our patrons care about, to find and identify an item that will meet their needs. But in search results, all I can do is give them a list of dozens of Pride and Prejudices, and let them try to figure it out — or maybe at least segment by video vs print vs audio. Maybe we’re not talking search results, maybe my software knows someone wants a particular edition (say, based on an input citation) and wants to tell the user if we have it, but good luck to my software in trying to figure out if we have that exact edition (or if someone else does, in worldcat or a neighboring library, or Amazon or Google Books).
This is a really hard problem too. And again it’s a problem of domain modeling, and equally of human labor in recording information (we don’t really know if two editions have the exact same text and pagination, someone has to figure it out and record it). Switching to the abstract data model of linked data doesn’t really address the barriers.
The library world made a really valiant effort at creating a domain model to capture these aspects of edition relationships that our users care about: FRBR. It’s seen limited adoption or influence in the 15+ years since it was released, which means it’s also seen limited (if any) additional development or fine-tuning, which anything trying to solve this difficult domain modeling problem will probably need (see RDA’s efforts at form/format/genre!). Linked data won’t solve this problem without good domain modeling, but ironically it’s some of the strongest advocates for “linked data” that I’ve seen arguing most strongly against doing anything more with adoption or development of FRBR ; as far as I am aware, the needed efforts to develop common domain modeling is not being done in the library linked data efforts. Instead, the belief seems to be if you just have linked data and let everyone describe things however they want, somehow it will all come together into something useful that answers the questions our patrons need, there’s no need for any common domain model vocabulary. I don’t believe existing industry experience with linked data, or software engineers experience with data modeling in general, supports this fantasy.
Multiple sources of holdings/licensing information
For the packages of electronic content we license/purchase (ebooks, serials), we have so many “systems of record”. The catalog’s got bib records for items from these packages, the ERM has licensing information, the link resolver has coverage and linking information, oh yeah and then they all need to be in EZProxy too, maybe a few more.
There’s no good way for software to figure out when a record from one system represents the same platform/package/license as in another system. Which means lots of manual work synchronizing things (EZProxy configuration, SFX kb). And things my software can do only with difficulty or simply can’t do at all — like, when presenting URLs to users, figure out if a URL in a catalog is really pointing to the same destination as a URL offered by SFX, even though they’re different URLs (epnet.com vs ebscohost.com?).
So one solution would be “why don’t you buy all these systems from the same vendor, and then they’ll just work together”, which I don’t really like as a suggested solution, and at any rate as a suggestion is kind of antithetical to the aims of the “linked data” movement, amirite?
So the solution would obviously be common identifiers used in all these systems, for platforms, packages and licenses, so software can know that a bib record in the catalog that’s identified as coming from package X for ISSN Y is representing the same access route as an entry in the SFX KB also identified as package X, and hey maybe we can automatically fetch the vendor suggested EZProxy config listed under identifier X too to make sure it’s activated, etc.
Why isn’t this happening already? Lack of cooperation between vendors, lack of labor power to create and maintain common identifiers, lack of resources or competence from our vendors (who can’t always even give us a reliable list in any format at all of what titles with what coverage dates are included in our license) or from our community at large (how well has DLF-ERMI worked out as far as actually being useful?).
In fact, if I imagined an ideal technical infrastructure for addressing this, linked data actually would be a really good fit here! But it could be solved without linked data too, and coming up with a really good linked data implementation won’t solve it, the problems are not mainly technical. We primarily need common identifiers in use between systems, and the barriers to that happening are not that the systems are not using “linked data”.
Google Books won’t link out to me
Google Scholar links back to my systems using OpenURL links. This is great for getting a user who choses to use Google Scholar for discovery back to me to provide access through a licensed or owned copy of what they want. (There are problems with Google Scholar knowing what institution they belong to so they can link back to the right place, but let’s leave that aside for now, it’s still way better than not being there).
I wish Google Books did the same thing. For that matter, I wish Amazon did the same thing. And lots of other people.
They don’t because they have no interest in doing so. Linked data won’t help, even though this is definitely an issue of, well, linking data.
OpenURL, a standard frozen in time
Oh yeah, so let’s talk about OpenURL. It’s been phenomenally succesful in terms of adoption in the library industry. And it works. It’s better that it exists than if it didn’t. It does help link disparate systems from different vendors.
The main problem is that it’s basically abandoned, I don’t know if there’s technically a maintanance group, but if there is, they aren’t doing much to improve OpenURL for scholarly citation linking, the use case it’s been successful in.
For instance, I wish there was a way to identify a citation as referring to a video or audio piece in OpenURL, but there isn’t.
Now, theoretically the “open for extension” aspect of linked data seems relevant here. If things were linked data and you needed a new data element or value, you could just add one. But really, there’s nothing stopping people from doing that with OpenURL now. Even if technically not allowed, you can just decide to say `&genre=video` in your OpenURL, and it probably won’t disturb anything (or you can figure out a way to do that not using the existing `genre` key that really won’t disturb anything).
The problem is that nothing will recognize it and do anything useful with it, and nobody is generating OpenURLs like that too. It’s not really an ‘open for extension’ problem, it’s a problem of getting the ecosystem to do it, of vocabulary consensus and implementation. That’s not a problem that linked data solves.
Linking from the open web to library copies
One of the biggest challenges always in the background of my work, is how we get people from the “open web” to our library owned and licensed resources and library-provided services. (Umlaut is engaged in this “space”).
This is something I’ve written about before (more times than that), so I won’t say too much more about it here.
How could linked data play a role in solving this problem? To be sure, if every web page everywhere included schema.org-type information fully specifying the nature of the scholarly works it was displaying, citing, or talking about — that would make it a lot easier to find a way to take this information and transfer the user to our systems to look up availability for the item cited. If every web page exposed well-specified machine-accessible data in a way that wasn’t linked-data-based, that would be fine too. But something like schema.org does look like the best bet here — but it’s not a bet I’d wager anything of significance on.
It would not be necessary to rebuild our infrastructure to be “based on linked data” in order to take advantage of structured information on external web pages, whether or not that structured information is “linked data”. (There are a whole bunch of other non-trivial challenges and barriers, but replacing our ILS/OPAC isn’t really a necessary one, neither is replacing our internal data format.). And we ourselves have limited influence over what “every web page everywhere” does.
Okay, so why are people excited about Linked Data?
If it’s not clear it will solve our problems, why is there so much effort being put into it? I’m not sure, but here’s some things I’ve observed or heard.
Most people, and especially library decision-makers, agree at this point that libraries have to change and adapt, in some major ways. But they don’t really know what this means, how to do it, what directions to go on. Once there’s a critical mass of buzz about “linked data”, it becomes the easy answer — do what everyone else is doing, including prestigious institutions, and if it ends up wrong, at least nobody can blame you for doing what everyone else agreed should be done. “No one ever got fired for buying IBM”
So linked data has got good marketing and a critical mass, in an environment where decision-makers want to do something but don’t know what to do. And I think that’s huge, but certainly that’s not everything, there are true believers who created that message in the first place, and unlike IBM they aren’t necessarily trying to get your dollars, they really do believe. (Although there are linked data consultants in the library world who make money by convincing you to go all-in on linked data…)
I think we all do know (and I agree) that we need our data and services to inter-operate better — within the library world, and crossing boundaries to the larger IT and internet industry and world. And linked data seems to hold the promise of making that happen, after all those are the goals of linked data. But as I’ve described above, I’m worried it’s a promise long on fantasy and short on specifics. In my experience, the true barriers to this are about good domain modeling, about the human labor to encode data, and about getting people we want to cooperate with us to use the same domain models.
I think those experienced with library metadata realize that good domain modelling (eg vocabulary control), and getting different actors to use the same standard formats is a challenge. I think they believe that linked data will somehow solve this challenge by being “open to extension” — I think this is a false promise, as I’ve tried to argue above. Software and sources need to agree on vocabulary in linked data too, to be able to use each others data. Or use the analog of a ‘crosswalk’, which we can already do, and which does not becomes appreciably easier with linked data — it becomes somewhat easier mechanically to apply a “cross-walk”, but the hard part in my experience is not mechanical application, but the intellectual labor to develop the “cross-walk” rules in the first place and maintain it as vocabularies change.
I think library decision-makers know that we “need our stuff to be in Google”, and have been told “linked data” is the way to do that, without having a clear picture of what “in Google” means. As I’ve said, I think Google’s investment in or commitment to linked data has been exagerated, but yes schema.org markup can be used by Google for rich snippets or Knowledge Graph fact boxes. And yes, I actually agree, our library web pages should use schema.org markup to expose their information in machine-readable markup. This will right now have more powerful results for library information web pages (rich snippets) than it will for catalog pages. But the good thing is it’s not that hard to do for catalog bib pages either, and does not requires rebuilding our entire infrastructure, our MARC data as it is can fairly easily be “cross-walked” to schema.org, as Dan Scott has usefully shown with VuFind, Evergreen, and Koha. Yes, all our “discovery” web pages should do this. Dan Scott reports that it hasn’t had a huge effect, but says it would if only everybody did it:
We don’t see it happening with libraries running Evergreen, Koha, and VuFind yet, realistically because the open source library systems don’t have enough penetration to make it worth a search engine’s effort to add that to their set of possible sources. However, if we as an industry make a concerted effort to implement this as a standard part of crawlable catalogue or discovery record detail pages, then it wouldn’t surprise me in the least to see such suggestions start to appear.
Maybe. I would not invest in an enormous resource-intensive campaign to rework our entire infrastructure based on what we hope Google (or similar actors) will do if we pull it off right — I wouldn’t count on it. But fortunately it doesn’t require that to include schema.org markup on our pages. It can fairly easily be done now with our data in MARC, and should indeed be done now; whatever barriers are keeping us from doing it more with our existing infrastructure, solving them are actually a way easier problem than rebuilding our entire infrastructure.
I think library metadataticians also realize that limited human labor resources to record data are a problem. I think the idea is that with linked data, we can get other people to create our metadata for us, and use it. It’s a nice vision. The barriers are that in fact not “everybody” is using linked data, let alone willing to share it; the existing business model issues that make them reluctant to share their data don’t go away with linked data; they may have no business interest in creating the data we want anyway (or may be hoping “someone else” does it too); and that common or compatible vocabularies are still needed to integrate data in this way. The hard parts are human labor and promulgating shared vocabulary, not the mechanics of combining data.
I think experienced librarians also realize that business model issues are a barrier to integration and sharing of data presently. Perhaps they think that the Linked Open Data campaign will be enough to pressure our vendors, suppliers, partners, and cooperatives to share their data, because they have to be “Linked Open Data” and we’re going to put the pressure on. Maybe they’re right! I hope so.
One linked data advocate told me, okay, maybe linked data is neither necessary nor sufficient to solve our real world problems. But we do have to come up with better and more inter-operable domain models for our data. And as long as we’re doing that, and we have to recreate all this stuff, we might as well do it based on linked data — it’s a good abstract data model, and it’s the one “everyone else is using” (which I don’t agree is happening, but it might be the one others outside the industry end up using — if they end up caring about data interoperability at all — and there are no better candidates, I agree, so okay).
Maybe. But I worry that rather than “might as well use linked data as long as we’re doing”, linked data becomes a distraction and a resource theft (opportunity cost?) from what we really need to do. We need to figure out what our patrons are up to and how we can serve them; and when it comes to data, we need to figure out what kinds of data we need to do that, and to come up with the domain models that capture what we need, and to get enough people (inside or outside the library world) to use compatible data models, and to get all that data recorded (by whom paid for by whom).
Sure that all can be done with linked data, and maybe there are even benefits to doing so. But in the focus on linked data, I worry we end up focusing on how most elegantly to fit our data into “linked data” (which can certainly be an interesting intellectual challenge, a fun game), rather than on how to model it to be useful for the uses we need (and figuring out what those are). I think it’s unjustified to think the rest will take care of itself if it’s just good linked data. The rest is actually the hard part. And I think it’s dangerous to undertake this endeavor as “throw everything else out and start over”, instead of looking for incremental improvements.
The linked data advocate I was talking to also suggested (or maybe it was my own suggestion in conversation, as I tried to look on the bright side): Okay, we know we need to “fix” all sorts of things about our data and inter-operability. We could be doing a lot of that stuff now, without linked data, but we’re not, our vendors aren’t, our consortiums and collaboratives aren’t. Your catalog does not have enough records OCLC numbers in it, or sync it’s data to OCLC, even though it theoretically could, and without linked data. It hasn’t been a priority. But the very successful marketing campaign of “linked data” will finally get people to pay attention to this stuff and do what they should have been doing.
Maybe. I hope so. It could definitely happen. But it won’t happen because linked data is a magic bullet, and it won’t happen without lots of hard work that isn’t about the fun intellectual game of creating domain models in linked data.
What should you do?
Okay, so maybe “linked data” is an unstoppable juggernaut in the library world, or at your library. (It certainly is not in the wider IT/web world, despite what some would have you believe). I certainly don’t think this tl;dr essay will change that.
And maybe that will work out for the best after all. I am not fundamentally opposed to semantic web/linked data/RDF. It’s an interesting technology although I’m not as in love with it as some, I recognize that it surely should play some part in our research and investigation into metadata evolution — even if we’re not sure how succesful it will be in the long-term.
Maybe it’ll all work out. But for you reading this who’s somehow made it this far, here’s what I think you can do to maximize those chances:
Be skeptical. Sure, of me too. If this essay gets any attention, I’m sure there will be plenty of arguments provided for how I’m missing the point or confused. Don’t simply accept claims from promoters or haters, even if everyone else seems to be accepting that — claims that “everyone is doing it”, or that linked data will solve all our problems. Work to understand what’s really going on so you can evaluate benefits and potentials yourself, and understand what it would take to get there. To that end…
Educate yourself about the technology of metadata. About linked data, sure. And about entity-relational modeling and other forms of data modeling, about relational databases, about XML, about what “everyone else” is really doing. Learn a little programming too, not to become a programmer, but to understand better how software and computation work, because all of our work in libraries is so intimately connected to that. Educating yourself on these things is the only way to evaluate claims made by various boosters or haters.
Treat the library as an IT organization. I think libraries already are IT organizations (at least academic libraries) — every single service we provide to our users now has a fundamental core IT component, and most of our services are actually mediated by software between us and our users. But libraries aren’t run recognizing them as IT organizations. This would involve staffing and other resource allocation. It would involve having sufficient leadership and decision-makers that are competent to make IT decisions, or know how to get advice from those who are. It’s about how the library thinks of itself, at all levels, and how decisions are made, and who is consulted when making them. That’s what will give our organizations the competence to make decisions like this, not just follow what everyone else seems to be doing.
Stay user centered. “Linked data” can’t be your goal. You are using linked data to accomplish something to add value to your patrons. We must understand what our patrons are doing, and how to intervene to improve their lives. We must figure out what services and systems we need to do that. Some work to that end, even incomplete and undeveloped if still serious and engaged, comes before figuring out what data we need to create those services. To the extent it’s about data, make sure your data modeling work and choices are about creating the data we need to serve our users, not just fitting into the linked data model. Be careful of “dumbing down” your data to fit more easily into a linked data model, but maybe losing what we actually need in the data to provide the services we need to provide.
Yes, include schema.org markup on your web pages and catalog/discovery pages. To expose it to Google, or to anyone. We don’t need to rework our entire infrastructure to do that, it can be done now, as Dan Scott has awesomely shown. As Google or anyone else significant recognizes more or different vocabularies, make use of them too by including them in your web pages, for sure. And, sure, make all your data (in any format, linked data or not) available on the open web, under an open license. If your vendor agreements prevent you from doing that, complain. Ask everyone else with useful data to do so too. Absolutely.
Avoid “Does it support linked data” as an evaluative question. I think that’s just not the right question to be asking when evaluating adoption or purchase of software. To the extent the question has meaning at all (and it’s not always clear what it means), it is dangerous for the library organization if it takes primacy over the specifics of how it will allow us to provide better services or provide services better.
Of course, put identifiers in your data. I don’t care if it’s as a URI or not, but yeah, make sure every record has an OCLC number. Yeah, every bib should record the LCCN or other identifier of it’s related creators authority records, not just a heading. This is “linked data” advice that I support without reservation, it is what our data needs with or without linked data. Put identifiers everywhere. I don’t care if they are in the form of URLs. Get your vendors to do this too. That your vendors want to give you bibs without OCLC numbers in them isn’t acceptable. Make them work with OCLC, make them see it’s in their business interests to do so, because the customers demand it. If you can get the records from OCLC even if it costs more — it might be worth it. I don’t mean to be an OCLC booster exactly, but shared authority control is what we need (for linked data to live up to it’s promise or for us to accomplish what we need without linked data), and OCLC is currently where it lives. Make OCLC share it’s data too, which it’s been doing already (in contrast to ~5 years ago) — keep them going — they should make it as easy and cheap as possible for even “competitors” to put OCLC numbers, VIAF numbers, any identifiers in their data, regardless of whether OCLC thinks it threatens their own business model, because it’s what we need as a community and OCLC is a non-profit cooperative that represents us.
Who should you trust? Trust nobody, heh. But if you want my personal advice, pay attention to Diane Hillmann. Hillmann is one of the people working in and advocating for linked data that I respect the most, who I think has a clear vision of what it will or won’t or only might do, and how to tie work to actual service goals not just theoretical models. Read what Hillmann writes, invite her to speak at your conferences, and if you need a consultant on your own linked data plans I think you could do a lot worse. If Hillmann had increased influence over our communal linked data efforts, I’d be a lot less worried about them.
Require linked data plans to produce iterative incremental value. I think the biggest threat of “linked data” is that it’s implemented as a campaign that won’t bear fruit until some fairly far distant point, and even then only if everything works out, and in ways many decision-makers don’t fully understand but just have a kind of faith in. That’s a very risky way to undertake major resource-intensive changes. Don’t accept an enormous investment whose value will only be shown in the distant future. As we’re “doing linked data”, figure out ways to get improvements that effect our users positively incrementally, at each stage, iteratively. Plan your steps so each one bears fruit one at a time, not just at the end. (Which incidentally, is good advice for any technology project, or maybe any project at all). Because we need to start improving things for our users now to stay alive. And because that’s the only way to evaluate how well it’s going, and even more importantly to adjust course based on what we learn, as we go. And it’s how we get out of assuming linked data will be a magic bullet if only we can do enough of it, and develop the capacity to understand exactly how it can help us, can’t help us, and will help us only if we do certain other things too. When people who have been working on linked data for literally years advocate for it, ask them to show you their successes, and ask for success in terms of actually improving our library services. If they don’t have much to show, or if they have exciting successes to demonstrate, that’s information to guide you in decision-making, resource allocation, and further question-asking.