Career change

Today is my last day here at Johns Hopkins University Libraries.

After Thanksgiving, I’ll be working, still here in Baltimore, at Friends of the Web, a small software design, development, and consulting company.

I’m excited to be working collaboratively with a small group of other accomplished designers and developers, with a focus on quality. I’m excited by Friends of the Webs’ collaborative and egalitarian values, which show in how they do work and treat each other, how decisions are made, and even in the compensation structure.

Friends of the Web has technical expertise in Rails, Ember (and other MVC Javascript frameworks), as well as iOS development. Also significant in-house professional design expertise.

Their clientele is intentionally diverse; a lot of e-commerce, but also educational and cultural institutions, among others.

They haven’t done work for libraries before, but are always interested in approaching new business and technological domains, and are open to accepting work from libraries. I’m hoping that it will work out to keep a hand in the library domain at my new position, although any individual project may or may not work out for contracting with us, depending on if it’s a good fit for everyone’s needs. But if you’re interested in contracting an experienced team of designers and developers (including an engineer with an MLIS and 9 years of experience in the library industry: me!) to work on your library web (or iOS) development needs, please feel free to get in touch to talk about it. You could hypothetically hire just me to work on a project, or have access to a wider team of diverse experience, including design expertise.

Libraries, I love you, but I had to leave you, maybe, at least for now

I actually really love libraries, and have enjoyed working in the industry.

It may or may not be surprising to you that I really love books — the kind printed on dead trees. I haven’t gotten into ebooks, and it’s a bit embarrassing how many boxes of books I moved when I moved houses last month.

I love giant rooms full of books. I feel good being in them.

Even if libraries are moving away from being giant rooms full of books, they’ve still got a lot to like. In a society in which information technology and data are increasingly central, public and academic libraries are “civil society” organizations which can serve user’s information needs and advocate for users, with libraries interests aligned with their users, because libraries are not (mainly) trying to make money off their patrons or their data. This is pretty neat, and important.

In 2004, already a computer programmer, I enrolled in an MLIS program because I wanted to be a “librarian”, not thinking I would still be a software engineer. But I realized that with software so central to libraries, if I were working in a non-IT role I could be working with software I knew could be better but couldn’t do much about — or I could be working making that software better for patrons and staff and the mission of the library.

And I’ve found the problems I work on as a software engineer in an academic library rewarding. Information organization and information retrieval are very interesting areas to be working on. In an academic library specifically, I’ve found the mission of creating services that help our patrons with their research, teaching, and learning to be personally rewarding as well.  And I’ve enjoyed being able to do this work in the open, with most of my software open source, working and collaborating with a community of other library technologists across institutions.  I like working as a part of a community with shared goals, not just at my desk crunching out code.

So why am I leaving?

I guess I could say that at my previous position I no longer saw a path to make the kind of contributions to developing and improving libraries technological infrastructures and capacities that I wanted to make. We could leave it at that.  Or you could say I was burned out. I wasn’t blogging as much. I wasn’t collaborating as much or producing as much code. I had stopped religiously going to Code4Lib conferences. I dropped out of the Code4Lib Journal without a proper resignation or goodbye (sorry editors, and you’re doing a great job).

9 years ago when, with a fresh MLIS, I entered the library industry, it seemed like a really exciting time in libraries, full of potential.  I quickly found the Code4Lib community, which gave me a cohort of peers and an orientation to the problems we faced. We knew that libraries were behind in catching up to the internet age, we knew (or thought we knew) that we had limited time to do something about it before it was “too late”, and we (the code4libbers in this case) thought we could do something about it, making critical interventions from below. I’m not sure how well we (the library industry in general or we upstart code4libbers) have fared in the past decade, or how far we’ve gotten. Many of the Code4Lib cohort I started up with have dropped out of the community too one way or another, the IRC channel seems a dispiriting place to me lately (but maybe that’s just me).  Libraries aren’t necessarily focusing on the areas I think most productive, and now I knew how hard it was to have an impact on that. (But no, I’m not leaving because of linked data, but you can take that essay as my parting gift, or parting shot). I know I’ve made some mistakes in personal interactions, and hadn’t succeeded at building collaboration instead of conflict in some projects I had been involved in, with lasting consequences. I wasn’t engaging in the kinds of discussions and collaborations I wanted to be at my present job, and had run out of ideas of how to change that.

So I needed a change of perspective and circumstance. And wanted to stay in Baltimore (where I just bought a house!). And now here I am at Friends of the Web!  I’m excited to be taking a fresh start in a different sort of organization working with a great collaborative team.

I am also excited by the potential to keep working in the library industry from a completely different frame of reference, as a consulting/contractor.  Maybe that’ll end up happening, maybe it won’t, but if you have library web development or consulting work you’d like discuss, please do ring me up.

What will become of Umlaut?

There is no cause for alarm! Kevin Reiss and his team at Princeton have been working on an Umlaut rollout there (I’m not sure if they are yet in production).  They plan to move forward with their implementation, and Kevin has agreed to be a (co-?)maintainer/owner of the Umlaut project.

Also, Umlaut has been pretty stable code lately, it hasn’t gotten a whole lot of commits but just keeps on trucking and working well. While there were a variety of architectural improvements I would have liked to make, I fully expect Umlaut to remain solid software for a while with or without major changes.

This actually reminds me of how I came to be the Umlaut lead developer in the first place. Umlaut was originally developed by Ross Singer who was working at Georgia Tech at the time. Seeing a priority for improving our “link resolver” experience, and the already existing and supported Umlaut software, after talking to Ross about it, I decided to work on adopting Umlaut here. But before we actually went live in production — Ross had left Georgia Tech, they had decided to stop using Umlaut, and I found myself lead developer! (The more things change… but as far as I know, Hopkins plans to continue using Umlaut).  It threw me for a bit of a loop to suddenly be deploying open source software as a community of one institution, but I haven’t regretted it, I think Umlaut has been very successful for our ability to serve patrons with what they need here, and at other libraries.

I am quite proud of Umlaut, and feel kind of parental towards it. I think intervening in the “last mile” of access, delivery, and other specific-item services is exactly the right place to be, to have the biggest impact on our users. For both long-term strategic concerns — we don’t know where our users will be doing ‘discovery’, but there’s a greater chance we’ll still be in the “last mile” business no matter what. And for immediate patron benefits — our user interviews consistently show that our “Find It” link resolver service is both one of the most used services by our patrons, and one of the services with the highest satisfaction.  And Umlaut’s design as “just in time” aggregator of foreign services is just right for addressing needs as they come up — the architecture worked very well for integrating BorrowDirect consortial disintermediated borrowing into our link resolver and discovery, despite the very slow response times of the remote API.

I think this intervention in “last mile” delivery and access, with a welcome mat to any discovery wherever it happens, is exactly where we need to be to maximize our value to our patrons and “save the time of the reader”/patron, in the context of the affordances we have in our actually existing infrastructures — and I think it has been quite successful.

So why hasn’t Umlaut seen more adoption? I have been gratified and grateful by the adoption it has gotten at a handful of other libraries (including NYU, Princeton, and the Royal Library of Denmark), but I think it’s potential goes further. Is it a failure of marketing? Is it different priorities, are academic libraries simply not interested in intervening to improve research and learning for our patrons, preferring to invest in less concrete directions?  Are in-house technological capacity requirements simply too intimidating (I’ve never tried to sugar coat or under-estimate the need for some local IT capacity to run Umlaut, although I’ve tried to make the TCO as low as I can, I think fairly successfully). Is Umlaut simply too technically challenging for the capacity of actual libraries, even if they think the investment is worth it?

I don’t know, but if it’s from the latter points, I wonder if any access to contractor/vendor support would help, and if any libraries would be interested in paying a vendor/contractor for Umlaut implementation, maintenance, or even cloud hosting as a service. Well, as you know, I’m available now. I would be delighted to keep working on Umlaut for interested libraries. The business details would have to be worked out, but I could see contracting to set up Umlaut for a library, or providing a fully managed cloud service offering of Umlaut. Both are hypothetically things I could do at my new position, if the business details can be worked out satisfactorily for all involved. If you’re interested, definitely get in touch.

Other open source contributions?

I have a few other library-focused open source projects I’ve authored that I’m quite proud of. I will probably not be spending much time on the in the near future. This includes traject, bento_search, and borrow_direct.

I wrote Traject with Bill Dueber, and it will remain in his very capable hands.

The others I’m pretty much sole developer on. But I’m still around on the internet to answer questions, provide advice, or most importantly, accept pull requests for changes needed.  bento_search and borrow_direct are both, in my not so humble opinion, really well-architected and well-written code, which I think should have legs, and which others should find fairly easy to pick up. If you are using one of these projects, send a good pull request or two, and are interested, odds are I’d give you commit/release rights.

What will happen to this blog?

I’m not sure! The focus of this blog has been library technology and technology as implemented in libraries.  I hadn’t been blogging as much as I used to anyway lately. But I don’t anticipate spending as much(any?) time  on libraries in the immediate future, although I suspect I’ll keep following what’s going on for at least a bit.

Will I have much to say on libraries and technology anyway? Will the focus change? We will see!

So long and thanks for all the… fiche?

Hopefully not actually a “so long”, I hope to still be around one way or another. I am thinking of going to the Code4Lib conference in (conveniently for me) Philadelphia in the spring.

Much respect to everyone who’s still in the trenches, often in difficult organizational/political environments, trying to make libraries the best they can be.

Posted in General | 5 Comments

Linked Data Caution

I have been seeing an enormous amount of momentum in the library industry toward “linked data”, often in the form of a fairly ambitious collective project to rebuild much of our infrastructure around data formats built on linked data.

I think linked data technology is interesting and can be useful. But I have some concerns about how it appears to me it’s being approached. I worry that “linked data” is being approached as a goal in and of itself, and what it is meant to accomplish (and how it will or could accomplish those things) is being approached somewhat vaguely.  I worry that this linked data campaign is being approached in a risky way from a “project management” point of view, where there’s no way to know if it’s “working” to accomplish it’s goals until the end of a long resource-intensive process.  I worry that there’s an “opportunity cost” to focusing on linked data in itself as a goal, instead of focusing on understanding our patrons needs, and how we can add maximal value for our patrons.

I am particularly wary of approaches to linked data that seem to assume from the start that we need to rebuild much or all of our local and collective infrastructure to be “based” on linked data, as an end in itself.  And I’m wary of “does it support linked data” as the main question you asked when evaluating software to purchase or adopt.  “Does it support linked data” or “is it based on linked data” can be too vague to even be useful as questions.

I also think some of those advocating for linked data in libraries are promoting an inflated sense of how widespread or successful linked data has been in the wider IT world.  And that this is playing into the existing tendency for “magic bullet” thinking when it comes to information technology decision-making in libraries.

This long essay is an attempt to explain my concerns, based on my own experiences developing software and using metadata in the library industry. As is my nature, it turned into a far too long thought dump, hopefully not too grumpy.  Feel free to skip around, I hope at least some parts end up valuable.

What is linked data?

The term “linked data” as used in these discussions basically refers to what I’ll call an “abstract data model” for data — a model of how you model data.

The model says that all metadata will be listed as a “triple” of  (1) “subject”,  (2) “predicate” (or relationship), and (3) “object”.

1. Object A [subject] 
2. Is a [predicate] 
3. book [object]

1. Object A [subject] 
2. Has the ISBN [predicate] 
3. "0853453535" [object]

1. Object A 
2. has the title 
3. "Revolution and evolution in the twentieth century"

1. Object A 2. has the author 3. Author N

1. Author N 2. has the first name 3. James

1. Author N 2. has the last name 3. Boggs

Our data is encoded as triples, statements of three parts: subject, predicate, object.

Linked data prefers to use identifiers for as many of these data elements as possible, and in particular identifiers in the form of URI’s.

“Object A” in my example above is basically an identifier, but similar to the “x” or “y” in an algebra problem, it has meaning only in the context of my example; someone elses “Object A” or “x” or “y” in another example might mean something different, and trying to throw them all together you’re going to get conflicts.  URI’s are nice as identifiers in that, being based on domain names, they have a nice way of “namespacing” and avoiding conflicts, they are global identifiers.

# The identifiers I'm using are made up by me, and I use 
# to get across I'm not using standard/conventional
# identifiers used by others. 
1. [subject]
2. [predicate]
3. [object]

# We can see sometimes we still need string literals, not URIs
3. "Revolution and evolution in the twentieth century"



3. "Boggs, James"

I call the linked data model an “abstract data model“, because it is a model for how you model data: As triples.

You still, as with any kind of data modeling, need what I’ll call a “domain model” — a formal listing of the entities you care about (books, people), and what attributes, properties, and relationships with each other those entities have.

In the library world, we’ve always created these formal domain models, even before there were computers. We’ve called it “vocabulary control” and “authority control”.  In linked data, that domain model takes the form of standard shared URI identifiers for entities, properties, and relationships.  Establishing standard shared URI’s with certain meanings for properties or relationships (eg `` will be used to refer to the title, possibly with special technical specification of what we mean exactly by ‘title’) is basically “vocabulary control”, while establishing standard shared URI’s for entities (eg ``) is basically “authority control”.

You still need common vocabularies for your linked data to be inter-operable, there’s no magic in linked data otherwise, linked data just says the data will be encoded in the form of triples, with the vocabularies being encoded in the form of URIs.  (Or, you need what we’ve historically called a “cross-walk” to make data from different vocabularies inter-operable; linked data has certain standard ways to encode cross-walks for software to use them, but no special magic ways to automatically create them).

For an example of vocabulary (or “schema”) built on linked data technology, see

You can see that through aggregating and combining multiple simple “triple” statements, we can build up a complex knowledge graph.  Through basically one simple rule of “all data statements are triples”, we can build up remarkably complex data, and model just about any domain model we’d want.  The library world is full of analytical and theoretically minded people who will find this theoretical elegance very satisfying, the ability to model any data at all as a bunch of triples.  I think it’s kind of neat myself.

You really can model just about any data — any domain model — as linked data triples. We could take AACR2-MARC21 as a domain model, and express it as linked data by establishing a URI to be used as a predicate for every tag-subtag. There would be some tricky parts and edge cases, but once figured out, translation would  be a purely mechanical task — and our data would contain no more information or utility output as linked data than it did originally, nor be any more inter-operable than it was originally, as is true of the output of any automated transformation process.

You can model anything as linked data, but some things are more convenient and some things less convenient. The nature of linked data as being building complex information graphs based on simple triples can actually make the linked data more difficult to deal with practically, as you can see looking at our made up examples above and trying to understand what they mean. By being so abstract and formally simple, it can get confusing.

Some things that might surprise you are kind of inconvenient to model as linked data. It can take some contortions to model an ordered sequence using linked data triples, or to figure out how to model alternate language representations (say of a title) in triples. There are potentially multiple ways to solve these goals, with certain patterns as established as standards for inter-operability, but they can be somewhat confusing to work with.  Domain modeling is difficult already — having to fit your domain model into the linked data abstract model can be a fun intellectual exercise, but the need to undertake that exercise can make the task more difficult.

Other things are more convenient with linked data. You might have been wondering when the “linked” would come in.

Modeling all our data as individual “triples” makes it easier to merge data from multiple sources. You just throw all the triples together (You are still going to need to deal with any conflicts or inconsistencies that come about).   Using URI’s as vocabulary identifiers means that you can throw all this data together from multiple sources, and you won’t have any conflicts, you won’t find one source using MARC tag 100 to mean “main entry” and another source using the 100 tag to mean all sorts of other things (See UNIMARC!).

Linked data vocabularies are always “open for extension”. let’s see we established that there’s a sort of thing as a `` and it has a number of properties and relationships including ``.  But someone realizes, gee, we really want to record the color of the book too. No problem, they just start using `http://mydomain.tld/relationship/color`, or whatever they want. It won’t conflict with any existing data (no need to find an unused MARC tag!), but of course it won’t be useful outside the originator’s own system unless other people adopt this convention, and software is written to recognize and do something with it (open for extension, but we still need to adopt common vocabularies).

And using URI’s is meant to make it more straightforward to combine data from multiple sources in another way, that an http URI actually points to a network location, that could be used to deliver more information about something, say, ``, in the form of more triples. Mechanics to make it easier to assemble (meta)data from multiple sources together.

There are mechanics meant to support aggregating, combining, and sharing data built into the linked data design — but the fundamental problems of vocabulary and authority control, of using the same or overlapping vocabularies (or creating cross-walks), of creating software that recognizes and does something useful with vocabulary elements actually in use, etc,  —  all still exist. So do business model challenges with entities that don’t want to share their data, or human labor power challenges with getting data recorded. I think it’s worth asking if the mechanical difficulties with, say, merging MARC records from different sources, are actually the major barriers to more information sharing/coordination in the present environment, vs these other factors.

“Semantic web” vs “linked data”? vs “RDF”?

The “semantic web” is an older term than “linked data”, but you can consider it to refer to basically the same thing.  Some people cynically suggest “linked data” was meant to rebrand the “semantic web” technology after it failed to get much adoption or live up to it’s hype.  The relationship between the two terms according to Tim Berners-Lee (who invented the web, and is either the inventor or at least a strong proponent of semantic web/linked data) seems to be that “linked data” is the specific technology or implementations of individual buckets of data, while the “semantic web” is the ecosystem that results from lots of people using it.

RDF, which stands for “Resource Description Framework”, and is actually the official name of the abstract data model of “triples”.  Whereas then “linked data” could be understood as data using RDF and URI’s, and the “semantic web” the ecosystem that results from plenty of people doing it. Similarly, “RDF” can be roughly understood as a synonym.

Technicalities aside, “semantic web”, “linked data”, and “RDF” can generally be understood as rough synonyms when you see people discussing them — whatever term they use, they are talking about (meta)data modeled as “triples”, and the systems that are created by lots of such data integrated together over the web.

So. What do you actually want to do? Where are the users?

At a recent NISO forum on The Future of Library Resource Discovery, there was a session where representatives from 4(?) major library software vendors took Q&A from a moderator and the audience.  There was a question about the vendor’s commitment to linked data. The first respondent (who I think was from EBSCO?) said something like

[paraphrased] Linked data is a tool. First you need to decide what you want to do, then linked data may or may not be useful to doing that.

I think that’s exactly right.

Some of the other respondents, perhaps prompted by the first answer, gave similar answers. While others (especially OCLC) remarked of their commitment to linked data and the various places they are using it.  Of these though, I’m not sure any have actually resulted in any currently useful outcomes due to linked data usage.

Four or five years ago, talk of “user-centered design” was big in libraries — and in the software development world in general.  For libraries (and other service organizations), user-centered design isn’t just about software — but software plays a key role in almost any service a contemporary library offers, quite often mediating the service through software, such that user-centered design in libraries probably always involves software.

For academic libraries, with a mission to help our patrons in research, teaching, and learning — user-centered design begins with understanding our patrons’ research and leaning processes.  And figuring out the most significant interventions we can make to improve things for our patrons. What are their biggest pain points? Where can we make the biggest difference? To maximize our effectiveness when there’s an unlimited number of approaches we could take, you want to start with areas you can make a big improvement for the least resource investment.

Even if your institution lacks the resources to do much local research into user behavior, over the past few years a lot of interesting and useful multi-institutional research has been done by various national and international library organizations, such as reports from OCLC [a] [b], JISC [a], and Ithaka [a], [b], as well as various studies done by practitioner and published in journals.

To what extent is the linked data campaign informed by, motivated by, or based on what we know about our users behavior and needs?  To what extent are the goals of the linked data campaign explicit and specific, and are those goals connected back to what our users need from us?  Do we even know what we’re supposed to get out of it at all, beyond “data that’s linked better”, or “data that works well with the systems of entities outside the library industry”? (And for the latter, do we actually understand in what ways we want it to “work well”, for what reasons, and what it takes to accomplish that?)  Are we asking for specific success stories from the pilot projects that have already been done? And connecting them to what we need to do provide our users?

To be clear, I do think goals to increase our own internal staff efficiency, or to improve the quality of our metadata that powers most of our services are legitimate as well. But they still need to be tied back to user needs (for instance, to know the metadata you are improving is actually the metadata you need and the improvements really will help us serve our users better), and be made explicit (so you can evaluate how well efforts at improvement are working).

I think the motivations for the linked data campaign can be somewhat unclear and implicit; when they are made explicit, they are sometimes very ambitious goals which require a lot of pieces falling into place (including third-party cooperation and investment that is hardly assured) for realization only in the long-term — and with unclear or not-made-explicit benefits for our patrons even if realized.  For a major multi-institution multi-year resource-intensive campaign — this seems to me not sufficiently grounded in our user’s needs.

Is everyone else really doing it? Maybe not.

At another linked data presentation I attended recently, a linked data promoter said something along the lines of:

[paraphrased] Don’t do linked data because I say so, or because LC says so. Do it because it’s what’s necessary to keep us relevant in the larger information world, because it’s what everyone else is doing. Linked data is what lets Google give you good search results so quickly. Linked data is used by all the major e-commerce sites, this is how they do can accomplish what they can. 

The thing is, from my observation and understanding of the industry and environment, I just don’t think it’s true that “everyone is doing it”.

Google does use data formats based on the linked data model for it’s “rich snippets” (link to a 2010 paper).  This feature, which gives you a list of links next to a search result, is basically peripheral to the actual Google search.

Google also uses linked data to a somewhat more central extent in it’s Knowledge Graph feature, which provides “facts” in sidebars on search results. But most of the sources of data Google harvests from for it’s Knowledge Graph aren’t actually linked data, rather Google harvests and turns them into linked data internally — and then doesn’t actually expose the linked-data-ified data to the wider world.  In fact, Google has several times announced initiatives to expose the collected and triple-ified data to the wider world, but they have not actually turned into supported products.  This doesn’t necessarily say what advocates might want about the purported central role of linked data to Google, or what it means for linked data’s wider adoption.  As far as I know or can find out, linked data does not play a role in the actual primary Google search results, just in the Knowledge Graph “fact boxes”, and the “rich snippets” associated with results.

In a 2013 blog post, Andreas Blumaeur, arguing for the increased use of linked data, still acknowledges: “Internet companies like Google and Facebook make use of linked data quite hesitantly.”

My sense is that the general industry understanding is that linked data has not caught on like people thought it would in the 2007-2012 heyday, and adoption has in fact slowed and reversed. (Google trend of linked data/semantic web)

An October 2014 post on Hacker News asks: ” A few years ago, it seemed as if everyone was talking about the semantic web as the next big thing. What happened? Are there still startups working in that space? Are people still interested?”

In the ensuing discussion on that thread (which I encourage you to read), you can find many opinions, including:

  • “The way I see it that technology has been on the cusp of being successful for a long time” [but has stayed on the cusp]
  • “A bit of background, I’ve been working in environments next to, and sometimes with, large scale Semantic Graph projects for much of my career — I usually try to avoid working near a semantic graph program due to my long histories of poor outcomes with them.  I’ve seen uncountably large chunks of money put into KM projects that go absolutely nowhere and I’ve come to understand and appreciate many of the foundational problems the field continues to suffer from. Despite a long period of time, progress in solving these fundamental problems seem hopelessly delayed.”
  • “For what it’s worth, I spent last month trying to use RDF tooling (Python bindings, triple stores) for a project recently, and the experience has left me convinced that none of it is workable for an average-size, client-server web application. There may well be a number of good points to the model of graph data, but in practice, 16 years of development have not lead to production-ready tools; so my guess is that another year will not fix it.”
  • But also, to be fair: “There’s really no debate any more. We use the the technology borne by the ‘Semantic Web’ every day.” [Personally I think this claim was short on specifics, and gets disputed a bit in the comments]

At the very least, the discussion reveals that linked data/semantic web is still controversial in the industry at large, it is not an accepted consensus that it is “the future”, it has not “taken over.” And linked data is probably less “trendy” now in the industry at large than it was 4-6 years ago.

Talis was a major UK vendor of ILS/LMS library software, the companies history begins in 1969 as a library cooperative, similar to OCLC’s beginnings. In the mid-2000’s, they started shifting to a strategic focus on semantic web/linked data. In 2011, they actually sold off their library management division to focus primarily on semantic web technology. But quickly thereafter in 2012, they announced “that investment in the semantic web and data marketplace areas would cease. All efforts are now concentrated on the education business. ” They are now in the business of producing “enterprise teaching and learning platform” (compare to Blackboard, if I understand correctly), and apparently fairly succesful at it — but the semantic web focus didn’t pan out. (Wikipedia, Talis Group)

In 2009, The New York Times, to much excitement, announced a project to expose their internal subject vocabulary as linked data in. While the data is still up,  it looks to me like was abandoned in 2010; there has been no further discussion or expansion of the service, and the data looks not to have been updated.  Subject terms have a “latest use” field which seems to be stuck in May or June 2010 for every term I looked at (see Obama, Barak for instance), and no terms seem to be available for subjects that have become newsworthy since 2010 (no Carson, Ben, for instance).

In the semantic web/linked data heydey, a couple attempts to create large linked data databases were announced and generated a lot of interest. Freebase was started in 2007,  acquired by Google in 2010… and shut down in 2014. DBPedia was began much earlier and still exists… but it doesn’t generate the excitement or buzz that it used to. The newer WikiData (2012) still exists, and is considered a successor to Freebase by some.  It is generally acknowledged that none of these projects have lived up to initial hopes with regard to resulting in actual useful user-facing products or services, they remain experiments. A 2013 article, “There’s No Money in Linked Data“, suggests:

….[W]e started exploring the use of notable LD datasets such as DBpedia, Freebase, Geonames and others for a commercial application. However, it turns out that using these datasets in realistic settings is not always easy. Surprisingly, in many cases the underlying issues are not technical but legal barriers erected by the LD data publishers.

In Jan 2014, Paul Houle in “The trouble with DBpedia” argues that the problems are actually about data quality in DBPedia — specifically about vocabulary control, and how automatic creation of terms from use in wikipedia leads to inconsistent vocabularies . Houle thinks there are in fact technical solutions — but he, too, begins from the acknowledgement that DBPedia has not lived up to it’s expected promise.  In a very lengthy slide deck from February 2015, “DBpedia Ontology and Mapping Problems”, vladimiralexiev has a perhaps different diagnosis of the problem, about ontology and vocabulary design, and he thinks he has solutions. Note that he too is coming from an experience of finding DBPedia not working out for his uses.

There’s disagreement about why these experiments haven’t panned out to be more than experiments or what can be done or what promise they (and linked data in general) still have — but pretty widespread agreement in the industry at large that they have not lived up to their initial expected promise or hype, and have as of yet delivered few if any significant user-facing products based upon them.

It is interesting that many diagnoses of the problems there are about the challenges of vocabulary control and developing shared vocabularies, the challenges of producing/extracting sufficient data that is fit to these vocabularies, as well as business model issues — sorts of barriers we are well familiar with in the library industry. Linked data is not a magic bullet that solves these problems, they will remain for us as barriers and challenges to our metadata dreams.

Semantic web and linked data are still being talked about, and worked on in some commercial quarters, to be sure. I have no doubt that there are people and units at Google who are interested in linked data, who are doing research and experimentation in that area, who are hoping to find wider uses for linked data at Google, although I do not think it is true that linked data is currently fundamentally core to Google’s services or products or how they work. What they have not done is taken over the web, or become a widely accepted fact in the industry.  It is simply not true that “every major ecommerce site” has an architecture built on linked data.  It is certainly true that some commercial sector actors continue to experiment with and explore uses of linked data.

But in fact, I would say that libraries and the allied cultural heritage sector, along with limited involvement from governmental agencies (especially in the UK, although not to the extent some would like, with 2010 cancellation of a program) and scholarly publishing (mainly I think of Nature Publishing), are primary drivers of linked data research and implementation currently. We are some of the leaders in linked data research, we are not following “where everyone else is going” in the private sector.

There’s nothing necessarily wrong with libraries being the drivers in researching and implementing interesting and useful technology in the “information retrieval” domain — our industry was a leader in information retrieval technology 40-80 years ago, it would be nice to be so again, sure!

But we what we don’t have is “everyone else is doing it” as a motivation or justification for our campaign — not that it must be a good idea because the major players on the web are investing heavily in it (they aren’t), and not that we will be able to inter-operate with everyone else the way we want if we just transition all of our infrastructure to linked data because that’s where everyone else will be too (they won’t necessarily, and everyone using linked data isn’t alone sufficient for inter-operability anyway, there needs to be coordination on vocabularies as well, just to start).

My Experiences in Data and Service Interoperability Challenges

For the past 7+ years, my primary work has involved integrating services and data from disparate systems, vendors, and sources, in the library environment. I have run into many challenges and barriers to my aspired integrations. They often have to do with difficulties in data interoperability/integration; or in the utility of our data, difficulties in getting what I actually need out of data.  These are the sorts of issues linked data is meant to be at home in.

However, seldom in my experience do I run into a problem where simply transitioning infrastructure to linked data would provide a solution or fundamental advancement. The barriers often have at their roots business models (entities that have data you want to interoperate with, but don’t want their data to be shared because it keeping it close is of business value to them; or that simply have no business interest in investing in the technology needed to share data better);  or lack of common shared domain models (vocabulary control); or lack of person power to create/record the ‘facts’ needed in machine-readable format.

Linked data would be neither necessary nor sufficient to solving most of the actual barriers I run into.  Simply transitioning to a linked data-based infrastructure without dealing with the business or domain model issues would not help at all; and linked data is not needed to solve the business or domain model issues, and of unclear aid in addressing them: A major linked data campaign may not be the most efficient, cost effective, or quickest way to solve those problems.

Here are some examples.

What Serial Holdings Do We Have?

In our link resolver, powered by Umlaut, a request might come in for a particular journal article, say the made up article “Doing Things in Libraries”, by Melville Dewey, on page 22 of Volume 50 Issue 2 (1912) of the Journal of Doing Things.

I would really like my software to tell the user if we have this specific article in a bound print volume of the Journal of Doing Things, exactly which of our location(s) that bound volume is located at, and if it’s currently checked out (from the limited collections, such as off-site storage, we allow bound journal checkout).

My software can’t answer this question, because our records are insufficient. Why? Not all of our bound volumes are recorded at all, because when we transitioned to a new ILS over a decade ago, bound volume item records somehow didn’t make it. Even for bound volumes we have — or for summary of holdings information on bib/copy records — the holdings information (what volumes/issues are contained) are entered in one big string by human catalogers. This results in output that is understandable to a human reading it (at least one who can figure out what “v.251(1984:Jan./June)-v.255:no.8(1986)”  means). But while the information is theoretically input according to cataloging standards — changes in practice over the years, varying practice between libraries, human variation and error, lack of validation from the ILS to enforce the standards, and lack of clear guidance from standards in some areas, mean that the information is not recorded in a way that software can clearly and unambiguously understand it.

This is a problem of varying degrees at other libraries too, including for digitized copies, for presumably similar reasons.  In addition to at my own library, I’d like my software to be able to figure out if, say, HathiTrust has a digitized copy of this exact article (digitized copy of that volume and issue of that journal).  Or if nearby libraries in WorldCat have a physical bound journal copy, if we don’t here.  I can’t really reliably do that either.

We theoretically have a shared data format and domain model for serial holdings, Marc Format for Holdings Data (MFHD). A problem is that not all ILS’s actually implement MFHD, but more than that, that MFHD was designed in a world of printing catalog cards, and doesn’t actually specify the data in the right way to be machine actionable, to answer the questions we want answered. MFHD also allows for a lot of variability in how holdings are recorded, with some patterns simply not recording sufficient information.

In 2007 (!) I advocated more attention to ONIX for Serials Coverage as a domain model, because it does specify the recording of holdings data in a way that could actually serve the purposes I need. That certainly hasn’t happened, I’m not sure there’s been much adoption of the standard at all.  It probably wouldn’t be that hard to convert ONIX for Serials Coverage to a linked data vocabulary; that would be fine, if not neccesarily advancing it’s power any. It’s powerful, if it were used, because it captures the data actually needed for the services we need in a way software can use, whether or not it’s represented as linked data.  Actually implementing ONIX for Serials Coverage — with or without linked data — in more systems would have been a huge aid to me. Hasn’t happened.

Likewise, we could probably, without too much trouble, create a “linked data” translated version of MFHD. This would solve nothing, neither the problems with MFHD’s expressiveness nor adoption. Neither would having an ILS whose vendor advertises it as “linked data compatible” or whatever, make MFHD work any better. The problems that keep me from being able to do what I want have to do with domain modeling, with adoption of common models throughout the ecosystem, and with human labor to record data.  They are not problems the right abstract data model can fix, they are not fundamentally problems of the mechanics of sharing data, but of the common recording of data in common formats with sufficient utility.

Lack of OCLC number or other identifiers in records

Even in a pre-linked data world, we have a bunch of already existing useful identifiers, which serve to, well, link our data.  OCLC numbers as identifiers in the library world are prominent for their widespread adoption and (consequent) usefulness.

If several different library catalogs all use OCLC numbers on all their records, we can do a bunch of useful things, because we can easily know when a record in one catalog represents the same thing as a record in another. We can do collection overlap analysis. We can link from one catalog to another — oh, it’s checked out here, but this other library we have a reciprocal borrowing relationship with has a copy. We can easily create union catalogs that merge holdings from multiple libraries onto de-duplicated bibs. We can even “merge records” from different libraries — maybe a bib from one library has 505 contents but the bib from doesn’t, the one that doesn’t can borrow the data and know which bib it applies to. (Unless it’s licensed data they don’t have the right to share, a very real problem, which is not a technical one, and linked data can’t solve either).

We can do all of these things today, even without linked data. Except I can’t, because in my local catalog a great many (I think a majority) of records lack OCLC numbers.

Why?  Many of them are legacy records from decades ago, before OCLC was the last library cooperative standing, from before we cared.  All the records missing OCLC numbers aren’t legacy though. Many of them are contemporary records supplied by vendors (book jobbers for print, or e-book vendors), which come to us without OCLC numbers. (Why do we get from there instead of OCLC? Convenience? Price?  No easy way to figure out how to bulk download all records for a given purchased ebook package from OCLC? Why don’t the vendors cooperate with OCLC enough to have OCLC numbers on their records — I’m not sure. Linked data solves none of these issues.)

Even better, I’d love to be able to figure out if the book represented by a record in my catalog exists in Google Books, with limited excerpts and searchability or even downloadable fulltext. Google Books actually has a pretty good API, and if Google Books data had OCLC numbers in it, I could easily do this. But even though Google Books got a lot of it’s data from OCLC Worldcat, Google Books data only rarely includes OCLC numbers, and does so in entirely undocumented ways.

Lack of OCLC numbers in data is a problem very much about linking data, but it’s not a problem linked data can solve. We have the technology now, the barriers are about human labor power, business models, priorities, costs.  Whether the OCLC numbers that are there are in a MARC record in field 035, or expressed as a URI (say, ``) and included in linked data — are entirely irrelevant to me, my barriers are about lack of OCLC numbers in the data, I could deal with them in just about any format at all, and linked data formats won’t help appreciably, but I can’t deal with the data being absent.

And in fact, if you convert your catalog to “linked data” but still lack OCLC numbers — you’re still going to have to solve that problem to do anything useful as far as “linking data”.  The problem isn’t about whether the data is “linked data”, it’s about whether the data has useful identifiers that can be used to actually link to other data sets.

Data Staleness/Correctness

As you might guess from the fact that so many records in our local catalog don’t have OCLC numbers — most of the records in our local catalog also haven’t been updated since they were added years, decades ago. They might have typos that have since been corrected in WorldCat. They might represent ages ago cataloging practices (now inconsistent with present data) that have since been updated in WorldCat.  The WorldCat records might have been expanded to have more useful data (better subjects, updated controlled author names, useful 5xx notes).

Our catalog doesn’t get these changes, because we don’t generally update our records from WorldCat, even for the records that do have OCLC numbers.  (Also, naturally, not all of our holdings are actually listed with WorldCat, although this isn’t exactly the same set as those that lack OCLCnums in our local catalog). We could be doing that. Some libraries do, some libraries don’t. Why don’t the libraries that don’t?  Some combination of cost (to vendors), local human labor, legacy workflows difficult to change, priorities, lack of support from our ILS software for automating this in an easy way, not wanting to overwrite legacy locally created data specific to the local community, maybe some other things.

Getting our local data to update when someone else has improved it, is again the kind of problem linked data is targeted at, but linked data won’t necessarily solve it, the biggest barriers are not about data format.  After all, some libraries sync their records to updated WorldCat copy now, it’s possible with the technology we have now, for some. It’s not fundamentally a problem of mechanics with our data formats.

I wish our ILS software was better architected to support “sync with WorldCat” workflow with as little human intervention as possible. It doesn’t take linked data to do this — some are doing it already, but our vendor hasn’t chosen to prioritize it.  And just because software “supports linked data” doesn’t guarantee it will do this. I’d want our vendors focusing on this actual problem (whether solved with or without linked data), not the abstract theoretical goal of “linked data”.

Difficulty of getting format/form info from our data, representing what users care about

One of the things my patrons care most about, when running across a record in the catalog for say, “Pride and Prejudice”, is format/genre issues.

Is a given record the book, or a film? A film on VHS, or DVD (you better believe that matters a lot to a patron!)? Or streaming online video? Or an ebook? Or some weird copy we have on microfiche? Or a script for a theatrical version?  Or the recording of a theatrical performance? On CD, or LP, or an old cassette?

And I similarly want to answer this question when interrogating data at remote sources, say, WorldCat, or a neighboring libraries catalog.

It is actually astonishingly difficult to get this information out of MARC — the form/format/genre of a given record, in terms that match our users tasks or desires.  Why? Well, because the actual world we are modeling is complicated and constantly changing over the decades, it’s unclear how to formally specify this stuff, especially when it’s changing all the time (Oh, it’s a blu-ray, which is kind of a DVD, but actually different).  (I can easily tell you the record you’re looking at represents something that is 4.75″ wide though, in case you cared about that…)

It’s a difficult domain modeling problem. RDA actually tried to address this with better more formal theoretically/intellectually consistent modeling of what form/genre/format is all about. But even in the minority of records we have with RDA tags for this, it doesn’t quite work, I still can’t easily get my software to figure out if the item represented by a record is a CD or a DVD or a blu-ray DVD or what. 

Well, it’s a hard problem of domain modeling, harder than it might seem at first glance. A problem that negatively impacts a wide swath of library users across library types. Representing data as linked data won’t solve it, it’s an issue of vocabulary control. Is anyone trying to solve it?

Workset Grouping and Relationships

Related to form/format/genre issues but a distinct issue, is all the different versions of a given work in my catalog.

There might be dozens of Pride and Prejudices. For the ones that are books, do they actually all have the same text in them?  I don’t think Austen ever revised it in a new edition, so probably they all do even if published a hundred years apart — but that’s very not true of textbooks, or even general contemporary non-fiction which often exists in several editions with different text. Still, different editions of Pride and Prejudice might have different forwards or prefaces or notes, which might matter in some contexts.  Or maybe different pagination, which matters for citation lookup.

And then there’s the movies, the audiobooks, the musical (?).  Is the audiobook the exact text of the standard Pride and Prejudice just read aloud? Or an abridged version? Or an entirely new script with the same story?  Are two videos the exact same movie one on VHS and one on DVD, or two entirely different dramatizations with different scripts and actors? Or a director’s cut?

These are the kinds of things our patrons care about, to find and identify an item that will meet their needs. But in search results, all I can do is give them a list of dozens of Pride and Prejudices, and let them try to figure it out — or maybe at least segment by video vs print vs audio.  Maybe we’re not talking search results, maybe my software knows someone wants a particular edition (say, based on an input citation) and wants to tell the user if we have it, but good luck to my software in trying to figure out if we have that exact edition (or if someone else does, in worldcat or a neighboring library, or Amazon or Google Books).

This is a really hard problem too. And again it’s a problem of domain modeling, and equally of human labor in recording information (we don’t really know if two editions have the exact same text and pagination, someone has to figure it out and record it).  Switching to the abstract data model of linked data doesn’t really address the barriers.

The library world made a really valiant effort at creating a domain model to capture these aspects of edition relationships that our users care about: FRBR.  It’s seen limited adoption or influence in the 15+ years since it was released, which means it’s also seen limited (if any) additional development or fine-tuning, which anything trying to solve this difficult domain modeling problem will probably need (see RDA’s efforts at form/format/genre!).  Linked data won’t solve this problem without good domain modeling, but ironically it’s some of the strongest advocates for “linked data” that I’ve seen arguing most strongly against doing anything more with adoption or development of FRBR ; as far as I am aware, the needed efforts to develop common domain modeling is not being done in the library linked data efforts. Instead, the belief seems to be if you just have linked data and let everyone describe things however they want, somehow it will all come together into something useful that answers the questions our patrons need, there’s no need for any common domain model vocabulary.  I don’t believe existing industry experience with linked data, or software engineers experience with data modeling in general, supports this fantasy.

Multiple sources of holdings/licensing information

For the packages of electronic content we license/purchase (ebooks, serials), we have so many “systems of record”.  The catalog’s got bib records for items from these packages, the ERM has licensing information, the link resolver has coverage and linking information, oh yeah and then they all need to be in EZProxy too, maybe a few more.

There’s no good way for software to figure out when a record from one system represents the same platform/package/license as in another system. Which means lots of manual work synchronizing things (EZProxy configuration, SFX kb). And things my software can do only with difficulty or simply can’t do at all — like, when presenting URLs to users, figure out if a URL in a catalog is really pointing to the same destination as a URL offered by SFX, even though they’re different URLs ( vs

So one solution would be “why don’t you buy all these systems from the same vendor, and then they’ll just work together”, which I don’t really like as a suggested solution, and at any rate as a suggestion is kind of antithetical to the aims of the “linked data” movement, amirite?

So the solution would obviously be common identifiers used in all these systems, for platforms, packages and licenses, so software can know that a bib record in the catalog that’s identified as coming from package X for ISSN Y is representing the same access route as an entry in the SFX KB also identified as package X, and hey maybe we can automatically fetch the vendor suggested EZProxy config listed under identifier X too to make sure it’s activated, etc.

Why isn’t this happening already? Lack of cooperation between vendors, lack of labor power to create and maintain common identifiers, lack of resources or competence from our vendors (who can’t always even give us a reliable list in any format at all of what titles with what coverage dates are included in our license) or from our community at large (how well has DLF-ERMI worked out as far as actually being useful?).

In fact, if I imagined an ideal technical infrastructure for addressing this, linked data actually would be a really good fit here! But it could be solved without linked data too, and coming up with a really good linked data implementation won’t solve it, the problems are not mainly technical.  We primarily need common identifiers in use between systems, and the barriers to that happening are not that the systems are not using “linked data”.

Google Books won’t link out to me

Google Scholar links back to my systems using OpenURL links. This is great for getting a user who choses to use Google Scholar for discovery back to me to provide access through a licensed or owned copy of what they want. (There are problems with Google Scholar knowing what institution they belong to so they can link back to the right place, but let’s leave that aside for now, it’s still way better than not being there).

I wish Google Books did the same thing. For that matter, I wish Amazon did the same thing. And lots of other people.

They don’t because they have no interest in doing so. Linked data won’t help, even though this is definitely an issue of, well, linking data.

OpenURL, a standard frozen in time

Oh yeah, so let’s talk about OpenURL. It’s been phenomenally succesful in terms of adoption in the library industry. And it works. It’s better that it exists than if it didn’t. It does help link disparate systems from different vendors.

The main problem is that it’s basically abandoned, I don’t know if there’s technically a maintanance group, but if there is, they aren’t doing much to improve OpenURL for scholarly citation linking, the use case it’s been successful in.

For instance, I wish there was a way to identify a citation as referring to a video or audio piece in OpenURL, but there isn’t.

Now, theoretically the “open for extension” aspect of linked data seems relevant here. If things were linked data and you needed a new data element or value, you could just add one. But really, there’s nothing stopping people from doing that with OpenURL now. Even if technically not allowed, you can just decide to say `&genre=video` in your OpenURL, and it probably won’t disturb anything (or you can figure out a way to do that not using the existing `genre` key that really won’t disturb anything).

The problem is that nothing will recognize it and do anything useful with it, and nobody is generating OpenURLs like that too.  It’s not really an ‘open for extension’ problem, it’s a problem of getting the ecosystem to do it, of vocabulary consensus and implementation. That’s not a problem that linked data solves.

Linking from the open web to library copies

One of the biggest challenges always in the background of my work, is how we get people from the “open web” to our library owned and licensed resources and library-provided services. (Umlaut is engaged in this “space”).

This is something I’ve written about before  (more times than that), so I won’t say too much more about it here.

How could linked data play a role in solving this problem? To be sure, if every web page everywhere included information fully specifying the nature of the scholarly works it was displaying, citing, or talking about — that would make it a lot easier to find a way to take this information and transfer the user to our systems to look up availability for the item cited.  If every web page exposed well-specified machine-accessible data in a way that wasn’t linked-data-based, that would be fine too. But something like does look like the best bet here — but it’s not a bet I’d wager anything of significance on.

It would not be necessary to rebuild our infrastructure to be “based on linked data” in order to take advantage of structured information on external web pages, whether or not that structured information is “linked data”.  (There are a whole bunch of other non-trivial challenges and barriers, but replacing our ILS/OPAC isn’t really a necessary one, neither is replacing our internal data format.). And we ourselves have limited influence over what “every web page everywhere” does.

Okay, so why are people excited about Linked Data?

If it’s not clear it will solve our problems, why is there so much effort being put into it?  I’m not sure, but here’s some things I’ve observed or heard.

Most people, and especially library decision-makers, agree at this point that libraries have to change and adapt, in some major ways. But they don’t really know what this means, how to do it, what directions to go on.  Once there’s a critical mass of buzz about “linked data”, it becomes the easy answer — do what everyone else is doing, including prestigious institutions, and if it ends up wrong, at least nobody can blame you for doing what everyone else agreed should be done.   “No one ever got fired for buying IBM

So linked data has got good marketing and a critical mass, in an environment where decision-makers want to do something but don’t know what to do. And I think that’s huge, but certainly that’s not everything, there are true believers who created that message in the first place, and unlike IBM they aren’t necessarily trying to get your dollars, they really do believe. (Although there are linked data consultants in the library world who make money by convincing you to go all-in on linked data…)

I think we all do know (and I agree) that we need our data and services to inter-operate better — within the library world, and crossing boundaries to the larger IT and internet industry and world. And linked data seems to hold the promise of making that happen, after all those are the goals of linked data.  But as I’ve described above, I’m worried it’s a promise long on fantasy and short on specifics.  In my experience, the true barriers to this are about good domain modeling,  about the human labor to encode data, and about getting people we want to cooperate with us to use the same domain models. 

I think those experienced with library metadata realize that good domain modelling (eg vocabulary control), and getting different actors to use the same standard formats is a challenge. I think they believe that linked data will somehow solve this challenge by being “open to extension” — I think this is a false promise, as I’ve tried to argue above. Software and sources need to agree on vocabulary in linked data too, to be able to use each others data. Or use the analog of a ‘crosswalk’, which we can already do, and which does not becomes appreciably easier with linked data — it becomes somewhat easier mechanically to apply a “cross-walk”, but the hard part in my experience is not mechanical application, but the intellectual labor to develop the “cross-walk” rules in the first place and maintain it as vocabularies change.

I think library decision-makers know that we “need our stuff to be in Google”, and have been told “linked data” is the way to do that, without having a clear picture of what “in Google” means. As I’ve said, I think Google’s investment in or commitment to linked data has been exagerated, but yes markup can be used by Google for rich snippets or Knowledge Graph fact boxes. And yes, I actually agree, our library web pages should use markup to expose their information in machine-readable markup.  This will right now have more powerful results for library information web pages (rich snippets) than it will for catalog pages. But the good thing is it’s not that hard to do for catalog bib pages either, and does not requires rebuilding our entire infrastructure, our MARC data as it is can fairly easily be “cross-walked” to, as Dan Scott has usefully shown with VuFind, Evergreen, and Koha.  Yes, all our “discovery” web pages should do this. Dan Scott reports that it hasn’t had a huge effect, but says it would if only  everybody did it:

We don’t see it happening with libraries running Evergreen, Koha, and VuFind yet, realistically because the open source library systems don’t have enough penetration to make it worth a search engine’s effort to add that to their set of possible sources. However, if we as an industry make a concerted effort to implement this as a standard part of crawlable catalogue or discovery record detail pages, then it wouldn’t surprise me in the least to see such suggestions start to appear.

Maybe. I would not invest in an enormous resource-intensive campaign to rework our entire infrastructure based on what we hope Google (or similar actors) will do if we pull it off right — I wouldn’t count on it.  But fortunately it doesn’t require that to include markup on our pages. It can fairly easily be done now with our data in MARC, and should indeed be done now; whatever barriers are keeping us from doing it more with our existing infrastructure, solving them are actually a way easier problem than rebuilding our entire infrastructure.

I think library metadataticians also realize that limited human labor resources to record data are a problem. I think the idea is that with linked data, we can get other people to create our metadata for us, and use it.  It’s a nice vision. The barriers are that in fact not “everybody” is using linked data, let alone willing to share it; the existing business model issues that make them reluctant to share their data don’t go away with linked data; they may have no business interest in creating the data we want anyway (or may be hoping “someone else” does it too); and that common or compatible vocabularies are still needed to integrate data in this way. The hard parts are human labor and promulgating shared vocabulary, not the mechanics of combining data.

I think experienced librarians also realize that business model issues are a barrier to integration and sharing of data presently. Perhaps they think that the Linked Open Data campaign will be enough to pressure our vendors, suppliers, partners, and cooperatives to share their data, because they have to be “Linked Open Data” and we’re going to put the pressure on. Maybe they’re right! I hope so.

One linked data advocate told me, okay, maybe linked data is neither necessary nor sufficient to solve our real world problems. But we do have to come up with better and more inter-operable domain models for our data. And as long as we’re doing that, and we have to recreate all this stuff, we might as well do it based on linked data — it’s a good abstract data model, and it’s the one “everyone else is using” (which I don’t agree is happening, but it might be the one others outside the industry end up using — if they end up caring about data interoperability at all — and there are no better candidates, I agree, so okay).

Maybe. But I worry that rather than “might as well use linked data as long as we’re doing”, linked data becomes a distraction and a resource theft (opportunity cost?) from what we really need to do.  We need to figure out what our patrons are up to and how we can serve them; and when it comes to data, we need to figure out what kinds of data we need to do that, and to come up with the domain models that capture what we need, and to get enough people (inside or outside the library world) to use compatible data models, and to get all that data recorded (by whom paid for by whom).

Sure that all can be done with linked data, and maybe there are even benefits to doing so. But in the focus on linked data, I worry we end up focusing on how most elegantly to fit our data into “linked data” (which can certainly be an interesting intellectual challenge, a fun game), rather than on how to model it to be useful for the uses we need (and figuring out what those are).  I think it’s unjustified to think the rest will take care of itself if it’s just good linked data. The rest is actually the hard part. And I think it’s dangerous to undertake this endeavor as “throw everything else out and start over”, instead of looking for incremental improvements.

The linked data advocate I was talking to also suggested (or maybe it was my own suggestion in conversation, as I tried to look on the bright side): Okay, we know we need to “fix” all sorts of things about our data and inter-operability. We could be doing a lot of that stuff now, without linked data, but we’re not, our vendors aren’t, our consortiums and collaboratives aren’t.  Your catalog does not have enough records OCLC numbers in it, or sync it’s data to OCLC, even though it theoretically could, and without linked data.  It hasn’t been a priority. But the very successful marketing campaign of “linked data” will finally get people to pay attention to this stuff and do what they should have been doing.

Maybe. I hope so. It could definitely happen. But it won’t happen because linked data is a magic bullet, and it won’t happen without lots of hard work that isn’t about the fun intellectual game of creating domain models in linked data.

What should you do?

Okay, so maybe “linked data” is an unstoppable juggernaut in the library world, or at your library. (It certainly is not in the wider IT/web world, despite what some would have you believe).  I certainly don’t think this tl;dr essay will change that.

And maybe that will work out for the best after all. I am not fundamentally opposed to semantic web/linked data/RDF. It’s an interesting technology although I’m not as in love with it as some, I recognize that it surely should play some part in our research and investigation into metadata evolution — even if we’re not sure how succesful it will be in the long-term.

Maybe it’ll all work out. But for you reading this who’s somehow made it this far, here’s what I think you can do to maximize those chances:

Be skeptical. Sure, of me too. If this essay gets any attention, I’m sure there will be plenty of arguments provided for how I’m missing the point or confused.  Don’t simply accept claims from promoters or haters, even if everyone else seems to be accepting that — claims that “everyone is doing it”, or that linked data will solve all our problems.  Work to understand what’s really going on so you can evaluate benefits and potentials yourself, and understand what it would take to get there. To that end…

Educate yourself about the technology of metadata. About linked data, sure. And about entity-relational modeling and other forms of data modeling, about relational databases, about XML, about what “everyone else” is really doing. Learn a little programming too, not to become a programmer, but to understand better how software and computation work, because all of our work in libraries is so intimately connected to that. Educating yourself on these things is the only way to evaluate claims made by various boosters or haters.

Treat the library as an IT organization. I think libraries already are IT organizations (at least academic libraries) — every single service we provide to our users now has a fundamental core IT component, and most of our services are actually mediated by software between us and our users. But libraries aren’t run recognizing them as IT organizations. This would involve staffing and other resource allocation. It would involve having sufficient leadership and decision-makers that are competent to  make IT decisions, or know how to get advice from those who are. It’s about how the library thinks of itself, at all levels, and how decisions are made, and who is consulted when making them. That’s what will give our organizations the competence to make decisions like this, not just follow what everyone else seems to be doing.

Stay user centered. “Linked data” can’t be your goal. You are using linked data to accomplish something to add value to your patrons. We must understand what our patrons are doing, and how to intervene to improve their lives. We must figure out what services and systems we need to do that. Some work to that end, even incomplete and undeveloped if still serious and engaged,  comes before figuring out what data we need to create those services.  To the extent it’s about data, make sure your data modeling work and choices are about creating the data we need to serve our users, not just fitting into the linked data model.  Be careful of “dumbing down” your data to fit more easily into a linked data model, but maybe losing what we actually need in the data to provide the services we need to provide.

Yes, include markup on your web pages and catalog/discovery pages. To expose it to Google, or to anyone.  We don’t need to rework our entire infrastructure to do that, it can be done now, as Dan Scott has awesomely shown. As Google or anyone else significant recognizes more or different vocabularies, make use of them too by including them in your web pages, for sure. And, sure, make all your data (in any format, linked data or not) available on the open web, under an open license. If your vendor agreements prevent you from doing that, complain. Ask everyone else with useful data to do so too. Absolutely.

Avoid “Does it support linked data” as an evaluative question. I think that’s just not the right question to be asking when evaluating adoption or purchase of software. To the extent the question has meaning at all (and it’s not always clear what it means), it is dangerous for the library organization if it takes primacy over the specifics of how it will allow us to provide better services or provide services better.

Of course, put identifiers in your dataI don’t care if it’s as a URI or not, but yeah, make sure every record has an OCLC number. Yeah, every bib should record the LCCN or other identifier of it’s related creators authority records, not just a heading.  This is “linked data” advice that I support without reservation, it is what our data needs with or without linked data.  Put identifiers everywhere. I don’t care if they are in the form of URLs.  Get your vendors to do this too. That your vendors want to give you bibs without OCLC numbers in them isn’t acceptable. Make them work with OCLC, make them see it’s in their business interests to do so, because the customers demand it.  If you can get the records from OCLC even if it costs more — it might be worth it. I don’t mean to be an OCLC booster exactly, but shared authority control is what we need (for linked data to live up to it’s promise or for us to accomplish what we need without linked data), and OCLC is currently where it lives. Make OCLC share it’s data too, which it’s been doing already (in contrast to ~5 years ago) —  keep them going — they should make it as easy and cheap as possible for even “competitors” to put OCLC numbers, VIAF numbers, any identifiers in their data, regardless of whether OCLC thinks it threatens their own business model, because it’s what we need as a community and OCLC is a non-profit cooperative that represents us.

Who should you trust? Trust nobody, heh. But if you want my personal advice, pay attention to Diane Hillmann. Hillmann is one of the people working in and advocating for linked data that I respect the most, who I think has a clear vision of what it will or won’t or only might do, and how to tie work to actual service goals not just theoretical models.  Read what Hillmann writes, invite her to speak at your conferences, and if you need a consultant on your own linked data plans I think you could do a lot worse. If Hillmann had increased influence over our communal linked data efforts, I’d be a lot less worried about them.

Require linked data plans to produce iterative incremental value. I think the biggest threat of “linked data” is that it’s implemented as a campaign that won’t bear fruit until some fairly far distant point, and even then only if everything works out, and in ways many decision-makers don’t fully understand but just have a kind of faith in.  That’s a very risky way to undertake major resource-intensive changes.  Don’t accept an enormous investment whose value will only be shown in the distant future. As we’re “doing linked data”, figure out ways to get improvements that effect our users positively incrementally, at each stage, iteratively.  Plan your steps so each one bears fruit one at a time, not just at the end. (Which incidentally, is good advice for any technology project, or maybe any project at all). Because we need to start improving things for our users now to stay alive. And because that’s the only way to evaluate how well it’s going, and even more importantly to adjust course based on what we learn, as we go. And it’s how we get out of assuming linked data will be a magic bullet if only we can do enough of it, and develop the capacity to understand exactly how it can help us, can’t help us, and will help us only if we do certain other things too.  When people who have been working on linked data for literally years advocate for it, ask them to show you their successes, and ask for success in terms of actually improving our library services. If they don’t have much to show, or if they have exciting successes to demonstrate, that’s information to guide you in decision-making, resource allocation, and further question-asking.

Posted in General | 13 Comments

Blacklight community survey: What would you like to see added to Blacklight?

Here’s the complete list of answers to “What would you like to see added to Blacklight?” from the Blacklight community survey I distributed last month. 

Some of these are features, some of these are more organizational. I simply paste them all here, with no evaluation on my own part as to desirability or feasibility.

  • Keep up the great work!
  • Less. Simplicity instead of more indirection and magic. While the easy things have stayed easy anything more has seemed to be getting harder and more complicated.

    Search inside indexing patterns and plugin.
    Better, updated, maintained analytics plugin.

  • Support for Elasticsearch
  • Blacklight-maps seems fantastic if you don’t need the geoblacklight features.
  • (1) I’ve had lots of requests for an “OR” option on the main facet limits–like SearchWorks has. The advanced search has this feature. We have a facet for ‘Record Type’ (e.g. publication, object, oral history, film, photograph, etc) and we have users who would like to search across e.g. film or photograph. That could be implemented with a checkbox. Unfortunately it’s a little above my Rails chops & time at this point to implement.
    (2) We do geographic name expansion and language stemming. It would be sweet to be able to let users turn those features off. Jonathan Rochkind wrote an article awhile back on how to do that–again, I unfortunately lack Rails chops & time to implement that.
  • To reduce upgrade/compatibility churn, I wonder if it might be helpful to avoid landing changes in master/release until they are fully baked. For major refactorings/ruby API changes, do all dev in master until the feature is done churning and everyone relevant is satisfied with it being complete. As opposed to right now it seems as if iterative development on new features sometimes happens in masters and even in releases, before a full picture of what the final API will look like exists. Eg SearchBuilder refactorings.
  • A more active and transparent Blacklight development process. We would be happy to contribute more, but it’s difficult to know a longer-term vision of the community.
  • Integration with Elasticsearch
Posted in General | Leave a comment

Blacklight Strengths, Weaknesses, Health, and Future

My Own Personal Opinion Analysis of Blacklight Strength, Weaknesses, Health, and Future

My reflections on the Blacklight Community Survey results, and my own experiences with BL.

What people like about BL is it’s flexibility; what people don’t like is it’s complexity and backwards/forwards compatibility issues.

Developing any software, especially shared library/gem software, it is difficult to create a package which is on the one hand very flexible, extensible, and customizable; and on the other maintains a simple and consistent codebase, backwards compatibility with easy upgrades, and simple installation with a shallow learning curve for common use cases.

In my software engineering career, I see these tensions as one of the fundamental challenges in developing shared software. It’s not unique to Blacklight, but I think Blacklight is having some difficulties in weathering that challenge.

I think the diversity of Blacklight versions in use is a negative indicator for community health. People on old unsupported versions of BL (or Rails) can run into bugs which nobody can fix for them; and even if they put in work on debugging and fixing them themselves, it’s less likely to lead to a patch that can be of use to the larger BL community, since they’re working on an old version. It reduces the potential size of our collaborative development community. And it puts those running old versions of BL (or Rails) in a difficult spot eventually after much deferred upgrading, when they find themselves on unmaintained software with a very challenging upgrade path across many versions.

Also, if when a new BL release is dropped it’s not actually put into production by anyone (not even core committers?) for many months, that increases the chances that severe bugs are present but not yet found in even months-old releases (we have seen this happen), which can be a vicious circle that makes people even more reluctant to upgrade.

And we have some idea why BL applications aren’t being upgraded: Even though only a bare minority of respondents reported going through a major BL upgrade, issues with difficulty of upgrading are a very major represented theme in reported biggest challenges with Blacklight.  I know I personally have found that maintaining a BL app responsibly (which to me means keeping up with Rails and BL releases without too much lag) has had a much higher “total cost of ownership” than I expected or desire; you can maybe guess that part of my motivation in releasing this survey was to see if I was alone, I see I am not.

I think these pain points are likely to get worse: many existing BL deployments may have been originally written for BL 5.x and not yet had to deal with them but will; and many people currently using a “release and forget and never upgrade” practice may come to realize this is untenable. (“Software is a growing organism”, Ranganathan’s fifth law. Wait, Ranganathan wasn’t talking about software?)

To be fair, Blacklight core developers have gotten much better at backwards compatibility — in BL 4.x and especially 5.x — in the sense that backwards-incompatible changes within a major BL version are attempted, with much success, to be kept minimal to non-existent (in keeping with semver‘s requirements for release labelling).  This is a pretty major accomplishment.

But the backwards compatibility is not accomplished by minimizing code or architectural churn or change. Rather the changes are still pretty fast and furious, but the old behavior is left in and marked deprecated. Ironically, this has the effect of making the BL codebase even more complicated and hard to understand, with multiple duplicative or incompatible architectural elements co-existing and sometimes never fully disappearing. (More tensions between different software quality values, inherent challenges to any large software project.)  In BL 5.x, the focus on maintaining backwards compat was fierce — but we sometimes got deprecated behavior in one 5.x release, with suggested new behavior, where that suggested new behavior was sometimes itself deprecated in a future 5.x release in favor of yet newer behavior.  Backwards compatibility is strictly enforced, but the developer’s burden of keeping up with churn may not be as lightened as one would expect.

Don’t get me wrong, I think some of the 5.x changes are great designs. I like the new SearchBuilder architecture. But it was dropped in bits and pieces over multiple 5.x releases, without much documentation for much of it, making it hard to keep up with as a non-core developer not participating in writing it.  And the current implementation still has, to my mind, some inconsistencies or non-optimal choices (like `!` on the end of a method or lack thereof being inconsistently used to signal a method mutates the receiver vs returns-a-dup) — which now that they are in a release, need to be maintained for backwards compatibility (or if changed in a major version drop, still cause backwards compat challenges for existing app maintainers; just labeling it a major version doesn’t reduce these challenges, only reducing the velocity of such changes does).

In my own personal opinion, Blacklight’s biggest weakness and biggest challenge for continued and increased success is figuring out ways to maintain the flexibility, while significantly reducing code complexity, architectural complexity, code churn, and backwards incompatibility/deprecation velocity.

What can be done (in my own personal opinion)?

These challenges are not unique to Blacklight, they are tensions and challenges, in my observation/opinion/experience with nearly any shared non-trivial codebase.  But Blacklight can, perhaps, choose to take a different tack to approaching them, focus on different priorities in code evolution, think of practices to adopt to strike a better balance.

The first step is consensus on the nature of the problem (which we may not have, this is just my own opinion on the nature of the problem; I’m hoping this survey can help people think about BL’s strengths and weaknesses and build consensus).

In own brainstorming about possible approaches, I come up with a few, tentative, brainstorm-quality, possibilities:

  • Require documentation for major architectural components. We’ve built a culture in BL (and much of open source world) that a feature isn’t done and ready to merge until it’s covered by tests; I think we should have a similar culture around documentation, a feature isn’t done and ready to merge until documented. Which we lack in BL, and much of the open source world. But this can add a challenge, in a codebase with high churn, you now have to make sure to update the docs lest they become out of date and inaccurate too (something BL also hasn’t always kept up with)….
  • Accept less refactoring of internal architecture to make the code cleaner and more elegant.  Sometimes you’ve just got to stick with what you’ve got, for longer, even if the change would improve code architecture, as many of them have.  There’s an irony here. Often the motivation for an internal architectural refactoring is to better support something one wants to do. You can do that in the current codebase, but in a hacky not really supported way, that’s likely to break in future BL versions. You want to introduce the architecture that will let you do what you want in a ‘safer’ way, for forwards compatibility. But the irony is that the constant refactorings to introduce these better architectures actually have a net reduction on forwards compatibility, as they are always breaking some existing code.
  • Be cautious of the desire to expel functionality to external plugins. External BL plugins generally receive less attention, they are likely to not be up to date with current BL, and it has been difficult to figure out what version of an external plugin actually is compatible with what version of BL. If you’re always on the bleeding edge, you don’t notice, but if you have an older version of BL and are maybe trying to upgrade to a new one, figuring out plugin compatibility in BL can be a major nightmare. Expelling code to plugins makes core BL easier to maintain, but at the cost of making the plugins harder to maintain, less likely to receive maintainance, and harder to use for BL installers.  If the plugin code is an edge case not used by many people that may make sense. But I continue to worry about the expulsion of MARC support to a plugin. MARC is not used by as many BL implementers as it used to be, but “library catalog/discovery” is still BL use for a third of survey respondents, and MARC is still used by nearly half.
  • Do major refactorings in segregated branches, only merging into master (and including in releases) when they are “fully baked”. What does fully baked mean? I guess maybe it means understanding the use cases that will need to be supported; having a ‘story’ about how to use the architecture for those use cases; having a few people actually looked over the code and tried it out and given feedback.In BL 5.x, there were a couple major architectural refactorings, but they were released in dribs and drabs over multiple BL releases, sometimes reversing themselves, sometimes after realizing there were important use cases which couldn’t be supported. This adds TCO/maintenance burden to BL implementers, and adds backwards-compat-maintaining burder to BL core developers when they realize something already released should have been done differently.If I understand right, the primary motivation for some of the major 5.x-6.0 architectural refactorings was to support ElasticSearch as an alternate back-end. But, while these refactorings have already been released, there has actually been no demonstration of using ElasticSearch as a front-end, it’s not done yet. Without such demonstrations trying and testing the architecture, how confident can we be that these refactorings will actually be sufficient or the right direction for the goal? Yet more redesigns may be needed before we get there.

When I brought up this last point with a core BL developer, he said that it was unrealistic to expect this could be possible, because of the limited developer resources available to BL.

It’s true that there are very few developers making non-trivial commits to BL, and that BL does function in an environment of limited developer resources, which is a challenge. However, in fact, studies have shown that most succesful open source projects have the vast majority of commits contributed by only 1-3 developers. (Darnit, I can’t find the cite now).

I wonder if beyond developer resources as a ‘quantity’, the nature of the developers and their external constraints matters. Are many of the core BL developers working for vendors, where most hours need to be billable to clients on particular projects which need to be completed as quickly as practical?  Or working for universities that have a similar ‘entrepeneurial’ approach where most developer hours are spent on ‘sprints’ for particular features on particular projects?

Is anyone given time to steward the overall direction and architectural soundness of Blacklight?  If nobody really has such time, it’s actually a significant accomplishment that BL’s architecture has continued to evolve and improve regardless. But it’s not a surprise that it’s done so in a fairly chaotic and high-churn way, where people need to get just enough to accomplish the project in front of them into BL, and into a release asap.

I suspect that BL may, at this point in it’s development, need a bit more formality and transparency in who makes major decisions. (Eg, who decided that supporting ElasticSearch was a top priority, and how?) (And I say this as someone that, five years ago at the beginning of BL, didn’t think we needed any more formality there then a bunch of involved developers reaching consensus on a weekly phone call (that I don’t think happens anymore?). But I’ve learned from experience, and BL is at a different point in it’s life cycle now.)

In software projects where I do have some say (I haven’t made major commits to BL in at least 2-3 years), where they are projects that are expected to have long lives, I’ve come to try to push for a sort of “slow programming” (compare to ‘slow food’ etc) approach. Consider changes carefully, even at the cost of reducing velocity of improvements, release nothing into master before it’s time, prioritize backwards compatibility over time (not just over major releases, but actual calendar time). Treat your code like a bonsai tree, not a last-minute term paper.  But sometimes you can get away with this sometimes you can’t, sometimes your stakeholders will let you sometimes they won’t, and sometimes it isn’t really the right decision.

Software design is hard!

Posted in General | 2 Comments

Blacklight Community Survey Results

On August 20th I announced a Blacklight Community Survey to the blacklight and code4lib listservs, and it was also forwarded on to the hydra listserv by a member.

Between August 20th and September 2nd, I received 18 responses. After another week of no responses, I shut off the survey. It’s taken me until now to report the results, sorry!

The Survey was implemented using Google Docs. You can see the survey instrument here, access the summary of results from Google Docs here, and the complete spreadsheet of responses here.  The survey was intentionally anonymous.

My own summary with limited discussion follows below. 

Note: The summary of results incorrectly reports 24 responses rather than 18; I accidentally didn’t delete some test data before releasing the survey, and had no way to update the summary count. However, the spreadsheet is accurate; and the summaries for individual questions are accurate (you’ll see they each add up to 18 responses or fewer), except for the Blacklight version questions which have a couple test answers in the summary version. Sorry!

I am not sure if 18 responses should be considered a lot or a little, or what percentage of Blacklight implementations it represents. It should definitely not be considered a valid statistical sample; I think of it more like getting together people who happen to be at a conference to talk about their experiences with Blacklight, but I think such a view into Blacklight experiences is still useful.

I do suspect that Hydra gets more use than these results would indicate, and Hydra users of Blacklight are under-represented. I’m not sure why, but some guesses might be that Hydra implementations of blacklight are disproportionately done by vendors/contractors, or are more likely to be “release and forget about it” implementations — in either case meaning the host institutions are less likely to maintain a relationship to the Blacklight community, and find out about or care to respond to the survey.

Institutional Demographics and Applications

The majority (12 out of 18) respondents are Academic Libraries. Along with one public library, one museum, one vendor/contractor, two national libraries or consortiums, and one ‘other’.

I was unsurprised to see that the majority of use of Blacklight is for “special collection” or “institutional repository” type use. Only 1/3rd of respondents use Blacklight for a “Library catalog/discovery” application, with the rest “A Single special-purpose collection” (5 of 18), “Institutional/Digital collections repository (multiple collections)” (11, the majority of 18 respondents), or “Other” (4).

At my place of work, when we first adopted Blacklight the primary use case for existing implementations and developers were library catalog/discovery, but I had seen the development efforts mostly focusing on other use cases lately, and it makes sense to see a shift in uses to majority “repository” or “special-purpose collection” uses along with that.

A majority (1o of 18) respondents run more than 1 Blacklight application, which I did find a bit surprising, but may go along with “repository” type use, where each repo or collection gets it’s own BL app?  6 respondents run only one BL app, and 2 respondents are working on BL app(s) in development not yet in production.

Only 3 respondents (including myself) use Blacklight to host “No digital content, just metadata records”; 3 more just digital content, and the remaining 12 (the majority) some of each.

A full 8 of 18 include at least some MARC-origin metadata in their apps, 2 more than the number reporting using their app for “Library catalog/discovery”. Not quite a majority, but it seems MARC is definitely not dead in BL-land. “Dublin Core” and “Content from a Fedora Repository”, at 9 respondents each, only barely beat out MARC.

With 9 respondents reporting using “Content from a Fedora Repo”, and 11 reporting “Institutional/Digital collections repository” I expected this would mean lots of Hydra use. But in a later question we’ll examine in more detail later, only 4 respondents reported using “hydra-head (hydra framework)” in their app, which I find surprising. I don’t know if this is accurate, or respondents missed or didn’t understand the checkbox at that later question.

Versions of Blacklight in Use, and Experience with Upgrading

Two respondents is actually still deploying an app with Blacklight 3.x.

Two more are still on Blacklight 4.x — one of those runs multiple apps with some of them already on 5.x but at least one not yet upgraded; the other runs only one app on BL 4.30.

The rest of respondents on all on a Blacklight 5.x, but they are diverse 5.x releases from 5.5 to 5.14.  At the time the survey data was collected, only four of 18 respondents had crossed the BL 5.12 boundary, where lots of deprecations and refactorings were introduced. 5.12 had been released for about 5 months at that point.  That is, many months after a given BL version was released, most BL implementations (at least in this sample) still had not upgraded to it.

Just over half of respondents, 10 of 18 have never actually upgraded a Blacklight app across a major version (eg 3.x to 4.x or 4.x to 5.x); the other 8 have.

Oddly, the two respondents reporting themselves to be still running at least one BL 3.x app also said they did have experience upgrading a BL app across major versions. Makes me wonder why some of their apps are still on 3.x. None of the respondents still deploying 4.x said they had experience upgrading a BL app across a major version.

It seems that BL apps are in general not being quickly upgraded to keep up with BL releases. Live production BL deployments in the wild use a variety of BL versions, even across major versions, and some may have never been upgraded since install.

Solr Versions In Use

Only 16 of 18 respondents reported the version of Solr they are using (actually we asked for the lowest version of Solr they were using, if they had multiple Solrs used with BL).

A full 14 of these 16 are using some variety of Solr 4.x, with a large variety of 4.x Solrs in use from 4.0 to 4.10.

No respondents were still running Solr 3.x, but one poor soul is still running Solr 1.4. And only one respondent was running a Solr 5.x. It sounds like it may be possible for BL to drop support for Solr 3.x (or has that already happened), but requiring Solr 5.x would probably be premature.

I’m curious how many people have upgraded their Solr, and how often; it may be that the preponderance of Solr 4.x indicates that most installations were first deployed when Solr was in 4.x.

Rails Versions in Use

Four of 18 respondents are still using Rails 3.x, the rest have upgraded to 4.x — although not all to 4.2.

Those using Rails 3.x also tended to be the ones still reporting old BL versions in use, including BL 3.x.  I suspect this means that a lot of installations get deployed and never have any dependencies upgraded. Recall 10 of 18 respondents have never upgraded BL across a major version.  Although many of the people reporting running old Rails and old BL have upgraded BL across a major version (I don’t know if this means they used to be running even older versions of BL, of that they’ve upgraded some but not others).

If it isn’t broke don’t fix it might sometimes work, for a “deploy and done” project that never receives any new features or development. But I suspect a lot of these institutions are going to find themselves in trouble when they realize they are eventually running old unsupported versions of Rails, ruby, or BL, especially if a security vulnerability is discovered.  Even if a backport security patch is released for an old unsupported Rails or ruby version they are using (no guarantee), they may lack local expertise to actually apply those upgrades; or upgrading Rails may require upgrading BL as well to work with later Rails, which can be a very challenging task.

Local Blacklight Development Practices and Dependencies

A full 16 of 18 respondents report apps that include locally-developed custom features. 1 more respondent didn’t answer, only 1 said their app(s) did not.

I was surprised to see that only 2 respondents said they had hired a third-party vendor or contractor to install, configure, or develop a BL app. 2 more had hired a contractor; and 2 more said they were vendors/contractors for others.

I know there are people doing a healthy business in Blacklight consulting, especially Hydra; I am guessing that most of their clients are not enough involved in the BL community to see and/or want to answer this survey. (And I’m guessing many of those installations, unless the vendor/contractor has a maintenance contract, were also “deploy and ignore” installations which have not been upgraded since release).

So almost everyone is doing local implementation of features, but not by hiring a vendor/contractor, actually doing them in-house.

I tried to list every Blacklight plugin gem I could find distributed, and ask respondents which they used. The leaders were blacklight_advanced_search (53%) and blacklight_range_limit (39%).  Next were geoblacklight and hydra-head, each with 4 respondents (31%) claiming use. Again, I’m mystified how so few respondents can be using hydra-head when so many report IR/fedora uses. No other plugin got more than 3 respondents claiming use. I was surprised that only one respondent claimed sufia use.

Blacklight Satisfaction and Evaluation

Asking how satisfied you are with blacklight, on a scale of 1 (least) to 5 (most), the median score was 4, pretty respectable.

Looking at free form answers for what people like, don’t like, or want from Blacklight.

A major trend in what people like is Blacklight’s flexibility, customizability, and extensibility:

  • “The easily extendable and overridable features make developing on top of Blacklight a pleasure.”
  • “…Easy to configure faceting and fields.”
  • “…the ability to reuse other community plugins.”
  • “The large number of plugins that enhance the search experience…”
  • “We have MARC plus lots of completely randomly-organized bespoke cataloging systems. Blacklight committed from the start to be agnostic as to the source of records, and that was exactly what we needed. The ability to grow with Blacklight’s feature set from back when I started using it, that was great…”
  • “Easily configurable, Easily customizable, Ability to tap into the search params logic, Format specific partial rendering”

Major trends in what people don’t like or find most challenging about Blacklight is difficulty of upgrading BL:

  • “When we have heavily customized Blacklight applications, upgrading across major versions is a significant stumbling block.”
  • “Being bound together with a specific Bootstrap causes enormous headaches with updating”
  • “Upgrades and breaking of backwards compatibility. Porting changes back into overridden partials because much customization relies on overriding partials. Building custom, complicated, special purpose searches using Blacklight-provided methods [is a challenge].”
  • “Upgrading is obviously a pain-point; although many of the features in newer versions of Blacklight are desirable, we haven’t prioritized upgrading our internal applications to use the latest and greatest.”
  • “Varied support for plugins over versions [is a challenge].”
  • “And doing blacklight upgrades, which usually means rewriting everything.”
  • “Rapid pace of development. New versions are released very quickly, and staying up to date with the latest version is challenging at times. Also, sometimes it seems that major changes to Blacklight (for example, move from Bootstrap 2 to Bootstrap 3) are quasi-dictated by needs of one (or a handful) of particular institutions, rather than by consensus of a wider group of adopters/implementors. Also, certain Blacklight plugins get neglected and start to become less and less compatible with newer versions of Blacklight, or don’t use the latest methods/patterns, which makes it more of a challenge to maintain one’s app.”
  • “Getting ready for the upgrade to 6.0. We’ve done a lot of local customizations and overrides to Blacklight and some plugins that are deprecated.”

As well as difficulty in understanding the Blacklight codebase:

  • “Steep learning curve coming from vanilla rails MVC. Issues well expressed by B Armintor here:”
  • “Code churn in technical approach (often I knew how something was done but find out it has changed since the last time I looked). Can sometimes be difficult to debug the source given the layers of abstraction (probably a necessary evil however).”
  • “Too much dinking around and mucking through lengthy instructions and config files is required to do simple things. BL requires someone with substantial systems skills to spend a lot of time to use — a luxury most organizations don’t have. Skinning BL is much more painful than it needs to be as is making modifications to BL behaviors. BL requires far more time to get running and has more technical/skill dependencies than other things we maintain. In all honesty, what people here seem to like best about BL is actually functionality delivered by solr.”
  • “Figuring out how to alter blacklight to do our custom development.”
  • “Understanding and comprehension of how it fits together and how to customise initially.”
  • “Less. Simplicity instead of more indirection and magic. While the easy things have stayed easy anything more has seemed to be getting harder and more complicated. Search inside indexing patterns and plugin. Better, updated, maintained analytics plugin.”
  • “A more active and transparent Blacklight development process. We would be happy to contribute more, but it’s difficult to know a longer-term vision of the community.”

What does it mean?

I’ve separated my own lengthy interpretation, analysis, and evaluation based on my own personal judgement into a subsequent blog post. 

Posted in General | 2 Comments

“Agile Failure Patterns In Organizations”

An interesting essay showed up on Hacker News, called “Agile Failure Patterns In Organizations

Where I am, we’ve made some efforts to move to a more small-a agile iterative and incremental development approach in different ways, and I think it’s been successful in some ways and less successful in others. (Really, I would say we’ve been trying to do this before we’d even heard the word “agile”).

Parts of the essay seem a bit too scrum-focused to me (I’m sold on the general principle of agile development, I’m less sold on Scrum(tm)), and I’m not sure about the list of “Agile Failures at a Team Level”, but the list of “Agile Failures at Organizational Level”… ring some bells for me, loudly.

Agile Failure At Organizational Level:

  • Not having a (product) vision in the first place: If you don’t know, where you are going, any road will take you there.
  • The fallacy of “We know what we need to build”. There is no need for product discovery or hypotheses testing, the senior management can define what is relevant for the product backlog.
  • A perceived loss of control at management level leads to micro-management.
  • The organization is not transparent with regard to vision and strategy hence the teams are hindered to become self-organizing.
  • There is no culture of failure: Teams therefore do not move out of their comfort zones, but instead play safe.
  • The organization is not optimized for a rapid build-test-learn culture and thus departments are moving at different speed levels. The resulting friction caused is likely to equalize previous Agile gains.
  • Senior management is not participating in Agile processes, e.g. sprint demos, despite being a role model. But they do expect a different form of (push) reporting.
  • Not making organizational flaws visible: The good thing about Agile is that it will identify all organizational problems sooner or later. „When you put problem in a computer, box hide answer. Problem must be visible!“ Hideshi Yokoi, former President of the Toyota Production System Support Center in Erlanger, Kentucky, USA
  • Product management is not perceived as the “problem solver and domain expert” within the organization, but as the guys who turn requirements into deliverables, aka “Jira monkeys”.
  • Other departments fail to involve product management from the start. A typical behavior in larger organizations is a kind of silo thinking, featured by local optimization efforts without regard to the overall company strategy, often driven by individual incentives, e.g. bonuses. (Personal agendas are not always aligned with the company strategy.)
  • Core responsibilities of product management are covered by other departments, e.g. tracking, thus leaving product dependent on others for data-driven decisions.
  • Product managers w/o a dedicated team can be problem, if the product management team is oversized by comparison to the size of the engineering team.

How about you, do some of those make you wonder if the author has been studying your organization, they ring so true?

Posted in General | Leave a comment

Carl Grant on the Proquest acquisition of Ex Libris

Worth reading, links to a couple other posts worth reading. I don’t have any comments of my own to add at present, but may in a future blog post.

Posted in General | Leave a comment

Just curious: Do you think there is a market for additional Rails contractors for libraries?

Fellow library tech people and other library people who read this blog, what do you think?

Are there libraries who would be interested in hiring a Rails contractor/consultant to do work for them, of any kind?

I know Data Curation Experts does a great job with what they do — do you think there is work for more than just them, whether on Blacklight/Hydra or other Rails?

Any sense of it, from where you work or what you’ve heard?

I’m just curious, thinking about some things.

Posted in General | 4 Comments

DOAJ API in bento_search 1.5

bento_search is a gem I wrote that lets you search third party search engine APIs with standardized, simple, natural ruby API. It’s focused on ‘scholarly’ sources and use cases.

In the just-released version 1.5, a search engine adapter is included for the Directory of Open Access Journals (DOAJ) article search api.

While there certainly might be circumstances where you want to provide end-users with interactive DOAJ searches, embedded in your application, my main thoughts of use cases are different, and involve back-end known-item lookup in DOAJ.

It’s not a coincidence that bento_search introduced multi-field querying in this same 1.5 release.

The SFX link resolver  is particularly bad at getting users to direct article-level links for open access articles. (Are products from competitors like SerSol or OCLC any better here?). At best, you are usually directed to a journal-level URL for the journal title the article appears in.

But what if the link resolver knew it was probably an open access journal based on ISSN (or at the Umlaut level, based on SFX returning a DOAJ_DIRECTORY_OPEN_ACCESS_JOURNALS_FREE target as valid).  You could take the citation details, look them up in DOAJ to see if you get a match, and if so take the URL returned by DOAJ and return it to the user, knowing it’s going to be open access and not paywalled.

searcher =
results = => {
    :issn       => "0102-3772",
    :volume     => "17",
    :issue      => "3",
    :start_page => "207"
if results.count > 0
   url =
   # => ""
   # hey, maybe we got a doi too. 
   doi = results.first.doi
   # => "10.1590/S0102-37722001000300002"

Or if an application isn’t sure whether an article citation is available open source or not, it could check DOAJ to see if the article is listed there.

Perhaps such a feature will be added to Umlaut at some point.

As more and more is published open access, DOAJ might also be useful as a general large aggregator for metadata enhancement or DOI reverse lookup, for citations in it’s database.

Another known-item-lookup uses of DOAJ might be to fetch an abstract for an article in it’s database.



For anyone interested in using the DOAJ Article Search API (some of whom might arrive here from Google), I found the DOAJ API to be pretty easy to work with and straightforward, but I did encounter a couple tricky parts that are worth sharing.

URI Escaping in a Path component

The DOAJ Search API has the query in the path component of the url, not a query param: /api/v1/search/articles/{search_query]

In the path component of a URI, spaces are not escaped as “+” — “+” just means “+”, and will indeed be interpreted that way by the DOAJ servers.  (Thanks DOAJ api designers for echo’ing back the query in the response, to make my bug there a bit more discoverable!) Spaces are escaped as “%20”.  (Really, escaping spaces as “+” even in query param is an odd legacy practice of unclear standards compliance, but most systems accept it, in the query params after the ? in a URL).

At first I just reached for my trusty ruby stdlib method `CGI.escape`, but that escapes spaces as `+`, resulting in faulty input to the API.  Then I figured maybe I should be using ruby `URI.escape` — that does turn spaces into “%20”, but leaves some things like “/” alone entirely. True, “/” is legal in a URI, but as a path component separator! If I actually wanted it inside the last path component as part of the query, it should be escaped as “%2F”. (I don’t know if that would ever be a useful thing to include a query to this API, but I strive for completeness).

So I settled for ruby `CGI.escape(input).gsub(“+”, “%20″)` — ugly, but okay.

Really, for designing API’s like this, I’d suggest always leaving a query like this in a URI query param where it belongs (” “).  It might initially seem nice to have URLs for search results like ” “, but when you start having multi-word input, or worse complex expression (see next section), it gets less nice quick: ”

Escaping is confusing enough already; stick the convention, there’s a reason the query component of the URI (after the question mark) is called the query component of the URI!

ElasticSearch as used by DOAJ API defaults to OR operator

At first I was confused by the results I was getting from the API, which seemed very low precision, including results that I wasn’t sure why.

The DOAJ Search API docs helpfully tell us that it’s backed by ElasticSearch, and the query string can be most any ElasticSearch query string. 

I realized that for multi-word queries, it was sending them to ElasticSearch, with the default `default_operator` of “OR”, meaning all terms were ‘optional’. And apparently with a very low (1?) `minimum_should_match`

Meaning results included documents with just any one of the search terms. Which didn’t generally produce intuitive or useful results for this corpus and use case — note that DOAJ’s own end-user-facing search uses an “AND” default_operator producing much more precise results.

Well, okay, I can send it any ElasticSearch query, so I’ve just got to prepend a “+” operator to all terms, to make them mandatory. Which gets a bit trickier when you want to support phrases too, as I do; you need to do a bit of tokenization of your own. But doable.

Instead of sending query, which the user may have entered, as:  apple orange “strawberry banana”

Send query: +apple +orange +”strawberry banana”

Or for a fielded search:  bibjson.title:(+apple +orange +”strawberry banana”)

Or for a multi-fielded search where everything is still supposed to be mandatory/AND-ed together, the somewhat imposing:  +bibjson.title:(+apple +orange +”strawberry banana”) +rochkind)

Convoluted, but it works out.

I really like that they allow the API client to send a complete ElasticSearch query, it let me do what I wanted even if it wasn’t what they had anticipated. I’d encourage this pattern for other query API’s — but if you are allowing the client to send in an ElasticSearch (or Solr) query, it would be much more convenient if you also let the client choose the default_operator (Solr `q.op`), and `minimum_should_match` (Solr `mm`).

So, yeah, bento_search

The beauty of bento_search is that one developer figures out these confusing idiosyncracies once (and most of the bento_search targets have such things), encode them in the bento_search logic — and you the bento_search client can be blissfully ignorant of them, you just call methods on a BentoSearch::DoajArticlesEngine same as any other bento_search engine (eg‘apple orange “strawberry banana”‘), and it takes care of the under-the-hood api-specific idiosyncracies, workarounds, or weirdness.

Notes on ElasticSearch

I haven’t looked much at ElasticSearch before, although I’m pretty familiar with it’s cousin Solr.

I started looking at the ElasticSearch docs since DOAJ API told me I could send it any valid ElasticSearch query. I found it familiar, from my Solr work, they are both based on Lucene after all.

I started checking out documentation beyond what I needed (or could make use of) for the DOAJ use too, out of curiosity. I was quite impressed with ElasticSearch’s feature set, and it’s straightforward and consistent API.

One thing to note is ElasticSearch’s really neat query DSL that lets you specify queries as a JSON representation of a query abstract syntax tree — rather than just try to specify what you mean in a textual string query.  For machine-generated queries, this is a great feature, and can make it easier to specify complicated queries than in a textual string query — or make certain things possible that are not even possible at all in the textual string query language.

I recall Erik Hatcher telling me several years ago — possibly before ElasticSearch even existed — that a similar feature was being contemplated for Solr (but taking XML input instead of JSON, naturally).   I’m sure the hypothetical Solr feature would be more powerful than the one in ElasticSearch, but years later it still hasn’t landed in Solr so far as I know, but there it is in ElasticSearch….

I’m going to try to keep my eye on ElasticSearch.

Posted in General | 1 Comment

bento_search 1.5, with multi-field queries

bento_search is a gem I wrote that lets you search third party search engine APIs with standardized, simple, natural ruby API. It’s focused on ‘scholarly’ sources and use cases.

Version 1.5, just released, includes support for multi-field searching:

searcher = ENV['SCOPUS_API_KEY'])
results = => {
    :title  => '"Mystical Anarchism"',
    :author => "Critchley",
    :issn   => "14409917" 

Multi-field searches are always AND’d together, title=X AND author=Y; because that was the only use case I had and seems like mostly what you’d want. (On our existing Blacklight-powered Catalog, we eliminated “All” or “Any” choices for multi-field searches, because our research showed nobody ever wanted “Any”).

As with everything in bento_search, you can use the same API across search engines, whether you are searching Scopus or Google Books or Summon or EBSCOHost, you use the same ruby code to query and get back results of the same classes.

Except, well, multi-field search is not yet supported for Summon or Primo, because I do not have access to those proprietary projects or documentation to make sure I have the implementation right and test it. I’m pretty sure the feature could be added pretty easily to both, by someone who has access (or wants to share it with me as an unpaid ‘contractor’ to add it for you).

What for multi-field querying?

You certainly could expose this feature to end-users in an application using a bento_search powered interactive search. And I have gotten some requests for supporting multi-field search in our bento_search powered ‘articles’ search in our discovery layer; it might be implemented at some point based on this feature.

(I confess I’m still confused why users want to enter text in separate ‘author’ and ‘title’ fields, instead of just entering the author’s name and title in one ‘all fields’ search box, Google-style. As far as I can tell, all bento_search engines perform pretty well with author and title words entered in the general search box. Are users finding differently? Do they just assume it won’t, and want the security, along with the more work, of entering in multiple fields? I dunno).

But I’m actually more interested in this feature for other users than directly exposed interactive search.

It opens up a bunch of possibilities for a under-the-hood known-item identification in various external databases.

Let’s say you have an institutional repository with pre-prints of articles, but it’s only got author and title metadata, and maybe the name of the publication it was eventually published in, but not volume/issue/start-page, which you really want for better citation display and export, analytics, or generation of a more useful OpenURL.

So you take the metadata you do have, and search a large aggregating database to see if you can find a good match, and enhance the metadata with what that external database knows about the article.

Similarly, citations sometimes come into my OpenURL resolver (powered by Umlaut) that lack sufficient metadata for good coverage analysis and outgoing link generation, for which we generally need year/volume/issue/start-page too. Same deal.

Or in the other direction, maybe you have an ISSN/volume/issue/start-page, but don’t have an author and title. Which happens occasionally at the OpenURL link resolver, maybe other places. Again, search a large aggregating database to enhance the metadata, no problem:

results = => {
    :issn       => "14409917",
    :volume     => "10",
    :issue      => "2",
    :start_page => "272"

Or maybe you have a bunch of metadata, but not a DOI — you could use a large citation aggregating database that has DOI information as a reverse-DOI lookup. (Which makes me wonder if CrossRef or another part of the DOI infrastructure might have an API I should write a BentoSearch engine for…)

Or you want to look up an abstract. Or you want to see if a particular citation exists in a particular database for value-added services that database might offer (look inside from Google Books; citation chaining from Scopus, etc).

With multi-field search in bento_search 1.5, you can do a known-item ‘reverse’ lookup in any database supported by bento_search, for these sorts of enhancements and more.

In my next post, I’ll discuss this in terms of DOAJ, a new search engine added to bento_search in 1.5.

Posted in General | 1 Comment