Really slow rspec suite? Use the fuubar formatter!

I am working on a ‘legacy’-ish app that unfortunately has a pretty slow test suite (10 minutes+).

I am working on some major upgrades to some dependencies, that require running the full test suite or a major portion of it iteratively lots of times. I’m starting with a bunch of broken tests, and whittling them down.

It was painful. I was getting really frustrated with the built-in rspec formatters — I’d see an ‘f’ on the output, but wouldn’t know what test had failed until the whole suite finished, or or I could control-c or run with –fail-fast to see the first/some subset of failed tests when they happen, but interrupting the suite so I’d never see other later failures.

Then I found the fuubar rspec formatter.  Perfect!

  • A progress bar makes the suite seem faster psychologically even though it isn’t. There’s reasons a progress bar is considered good UI for a long-running task!
  • Outputs failed spec as they happen, but keep running the whole suite. For a long-running suite, this lets me start investigating a failure as it happens without having to wait for suite to run, while still letting the suite finish to see the total picture of how I’m doing and what other sorts of failures I’m getting.

I recommend fuubar, it’s especially helpful for slow suites. I had been wanting something like this for a couple months, and wondering why it wasn’t a built-in formatter in rspec — just ran across it now in a reddit thread (started by someone else considering writing such a formatter who didn’t know fuubar already existed!).  So I write this blog post to hopefully increase exposure!

Posted in General | Leave a comment

Commercial gmail plugin to turn gmail into a help desk

This looks like an interesting product; I didn’t even know this level of gmail plugin was supported by gmail.

http://www.keeping.com/

Help desk ticketing, with assignment, priorities, notes, and built-in response-time metrics, all within your gmail inbox (support emails are in a separate tab from your regular email).

The cost is $49/month for the ‘unlimited’ plan (capped at 5 users for $29/month).

I think this product could be a good fit for libraries dealing with patron reference/help questions, I think many libraries don’t have very user-friendly interfaces for this at present. I think the price is pretty reasonable at $1000/year, probably cheaper than most alternatives and within the budgets of many libraries.

Posted in General | Leave a comment

Sequential JQuery AJAX using recursive creation of Promises

So I’m in JQuery-land.

I’ve got an array of 100K ID’s on the client-side, and I want to POST them to a back-end API which will respond with JSON, in batches of 100 at a time. So that’s 1000 individual posts.

I don’t want to just loop and create 1000 `$.post`s, because I don’t want the browser trying to do 1000 requests “at once.” So some kind of promise chaining is called for.

But I don’t really even want to create all 1000 promises at once, that’s a lot of things in memory, doing who knows what.  I want to go in sequence through each batch, waiting for each batch to be done, and creating the next promise/AJAX request in the chain only after the first one finishes.

Here’s one way to do it, using a recursive function to create the AJAX promises.

var bigArrayOfIds; // assume exists
var bigArrayLength = bigArrayOfIds.length;
var batchSize = 100;

function batchPromiseRecursive() {
  // note splice is destructive, removing the first batch off
  // the array
  var batch = bigArrayOfIds.splice(0, batchSize);

  if (batch.length == 0) {
    return $.Deferred().resolve().promise();
  }

  return $.post('/endpoint/post', {ids: batch})
    .done(function(serverData) {
      // Do something after each batch finishes. 
      // Update a progress bar is probably a good idea. 
    })
    .fail(function(e) {
      // if a batch fails, say server returns 500,
      // do something here. 
    })
    .then(function() {
      return batchPromiseRecursive();
    });
}
            

batchPromiseRecursive().then(function() {
  // something to do when it's all over. 
});

In this version, if one batch fails, execution stops entirely. In order to record the failure but keep going with the next batches, I think you’d just have to take the then inside the batchPromiseRecursive function, and give it a second error argument, that would convert the failed promise to a succesful one. i haven’t gotten that far. I think JQuery (ES6?) promise API is a bit more confusing/less concise than it could be for converting a failed state a resolved one in your promise chain.

Or maybe I just don’t understand how to use it effectively/idiomatically, I’m fairly new to this stuff. Other ways to improve this code?

Posted in General | Leave a comment

“Apple Encryption Engineers, if Ordered to Unlock iPhone, Might Resist”

From the NYTimes, “Apple Encryption Engineers, if Ordered to Unlock iPhone, Might Resist

SAN FRANCISCO — If the F.B.I. wins its court fight to force Apple’s help in unlocking an iPhone, the agency may run into yet another roadblock:Apple’s engineers.

Apple employees are already discussing what they will do if ordered to help law enforcement authorities. Some say they may balk at the work, while others may even quit their high-paying jobs rather than undermine the security of the software they have already created, according to more than a half-dozen current and former Apple employees.

Do software engineers have professional ethical responsibilities to refuse to do some things even if ordered by their employers?

Posted in General | 1 Comment

Followup: Reliable Capybara JS testing with RackRequestBlocker

My post on Struggling Towards Reliable Capybara Javascript Testing attracted a lot of readers, and some discussion on reddit.

I left there thinking I had basically got my Capybara JS tests reliable enough… but after that, things degraded again.

But now I think I really have fixed it for real, with some block/wait rack middleware based on the original concept by Joel Turkel, which I’ve released as RackRequestBlocker. This is middleware to keep track of ‘outstanding’ requests in your app that were triggered by a feature spec that has finished, and let the main test thread wait until they are complete before DatabaseCleaning and moving on to the next spec.

My RackRequestBlocker implementation is based on the new hotness concurrent-ruby (a Rails5 dependency, great collection of ruby concurrency primitives) instead of Turkel’s use of the older `atomic` gem, and using actual signal/wait logic instead of polling, and refactored to have IMO a more convenient packaged API. Influenced by Dan Dorman’s unfinished attempts to gemify Turkel’s design.

It’s only a few dozen lines of code, check it out for an example of using concurrent-ruby’s primitives to build something concurrent.

And my Capybara JS feature tests now appear to be very very reliable, and I expect them to stay that way. Woot.

To be clear, I also had to turn off DatabaseCleaner transactional strategy entirely, even for non-JS tests.  Just RackRequestBlocker wasn’t enough, neither was just turning off transactional strategy.  Either one by themselves I still had crazy race conditions — including pg:deadlocks… and actual segfaults!

Why? I honestly am not sure. There’s no reason transactional fixture strategy shouldn’t work when used only for non-JS tests, even with RackRequestBlocker.  The segfaults suggests a bug in something C; MRI, pg, poltergeist? (poltergeist was very unpopular in the reddit thread on my original post, but I still think it’s less bad than other options for my situation.)  Bug of some kind in the test_after_commit gem we were using to make things work even with transactional fixture strategy? Honestly, I have no idea — I just accepted it, and was happy to have tests that were working.

Try out RackRequestBlocker, see if it helps with your JS Capybara race condition problems, let me know in comments if you want, I’m curious. I can’t support this super well, I just provide the code as a public service, because I fantasize of the day nobody has to go through as many hours as I have fighting with JS feature tests.

Posted in General | Leave a comment

Struggling Towards Reliable Capybara Javascript Testing

You may have reached this blog post because you’re having terribly frustrating problems with Capybara Javascript-driver feature tests that are unreliable in intermittent and hard to reproduce ways.

You may think this is a rare problem, since Capybara/JS is such a popular tool suite, and you didn’t find too much on it on Google, and what you did find mostly suggested you’d only have problems like this if you were “doing it wrong”, so maybe you just weren’t an experienced enough coder to figure it out.

After researching these problems, my belief is that intermittent test failures are actually fairly endemic to those using Capybara JS drivers, at least/especially with a complex JS front-end environment (Angular, React/Flux, Ember, etc).  It’s not just you, I believe, these problems plague many experienced developers.

This blog post summarizes what I learned trying to make my own JS feature tests reliable — I think there is no magic bullet, but to begin with you can understand the basic architecture and nature of race condition problems in this context; there are a bucket of configuration issues you can double-check to reduce your chances of problems somewhat; turning off rspec random ordering may be surprisingly effective at decreasing intermittent failures; but ultimately, building reliable JS feature tests with Capybara is a huge challenge.

My situation: I had not previously had any experience with new generation front-end JS front-end frameworks. Relatedly, I had previously avoided Capybara JS features, being scared of them (retroactively my intution was somewhat justified), and mostly not testing my (limited) JS.  But at the new gig, I was confronted with supporting a project which: Had relatively intensive front-end JS, for ‘legacy’ reasons using a combination of React and Angular; was somewhat under-tested, with JS feature tests somewhat over-represented in the test suite (things that maybe could have been tested with functional/controller or unit tests, were instead being tested with UI feature tests); and had such intermittent unreliability in the test suite that it made the test suite difficult to use for it’s intended purpose.

I have not ultimately solved the intermittent failures, but I have significantly decreased their frequency, making the test suite more usable.

I also learned a whole lot in the process. If you are a “tldr” type, this post might not be for you, it has become large. My goal is to provide the post I wish I had found before embarking on many, many hours of research and debugging; it may take you a while to read and assimilate, but if you’re as frustrated as I was, hopefully it will save you many more hours of independent research and experimentation to put it all together.

The relevant testing stack in the app I was investigating is: Rails 4.1.x, Rspec 2.x (with rspec-rails), Capybara, DatabaseCleaner, Poltergeist. So that’s what I focused on. Changing any of these components (say, MiniTest for Rspec) could make things come out different, although the general picture is probably the same with any Capybara JS driver.

No blame

To get it out of the way, I’m not blaming Capybara as ‘bad software’ either.

The inherent concurrency involved in the way JS feature tests are done makes things very challenging.

Making things more challenging is that the ‘platform’ that gives us JS feature testing is composed of a variety of components with separate maintainers, all of which are intended to work “mix and match” with different choices of platform components: Rails itself in multiple versions; Capybara, with rspec, or even minitest; with DatabaseCleaner probably not but not presumedly; with your choice of JS browser simulator driver.

All of these components need to work together to try and avoid race conditions, all of these components keep changing and releasing new versions relatively independently and un-syncronized; and all of these components are maintained by people who are deeply committed to making sure their part does it’s job or contract adequately, but there’s not necessarily anyone with the big-picture understanding, authority, and self-assigned responsibility to make the whole integration work.

Such is often the ruby/rails open source environment. It can make it confusing to figure out what’s really going on.

Of Concurrency and Race Conditions in Capybara JS Feature tests

“Concurrency” means a situation where two or more threads or processes are operating “at once”. A “race condition” is when a different outcome can happen each time the same code involving concurrency is run, depending on exactly the order or timing of each concurrent actor (depending on how the OS ends up scheduling the threads/processes, which will not be exactly the same each time).

You necessarily have concurrency in a Capybara Javascript feature test — there are in fact three different concurrent actors (two threads and a separate process) going on, even before you’ve tried to do something fancy with parallel rspec execution (and I would not recommend trying, especially if you are already experiencing intermittent failure problems; there’s enough concurrency in the test stack already), and even if you weren’t using any concurrency in your app itself.

  1. The main thread in the main process that is executing your tests in order.
  2. Another thread in that process is started to run your Rails app in it, to run the simulated browser actions against — this behavior depends on your Capybara driver, but I think all the drivers that support Javascript do this, all drivers but the default :rack_test.
  3. The actual or simulated browser (for Poltergeist, a headless Webkit process; for selenium-webkit, an actual Firefox) that Capybara (via a driver) is controlling, loading pages from the Rails app (2 above) and interacting with them.

There are two main categories of race condition that arise in the Capybara JS feature test stack. Your unreliable tests are probably because of one or both of these. To understand why your tests are failing unreliably and what you can do about it, you need to understand the concurrent architecture of a Capybara JS feature test as above, and these areas of potential race conditions.

The first category — race conditions internal to a single feature test caused by insufficient waiting for AJAX response — is relatively more discussed in “the literature” (what you find googling, blogs, docs, issues, SO’s, etc)  The second category — race conditions in the entire test suite caused by app activity that persists after the test example completes — is actually much more difficult to diagnose and deal with, and is under-treated in ‘the literature’.

1. Race condition WITHIN a single test example: Waiting on Javascript

An acceptance/feature/integration test (I will use those terms interchangeably, we’re talking about tests of the UI) for a web app consists of: Simulate a click (or other interaction) on a page, see if it what results is what you expect. Likely a series of those.

Without Javascript, a click (almost always) results in an HTTP request to the server. The test framework waits for the ensuring response, and sees if it contains what was expected.

With Javascript, the result of an interaction likewise isn’t absolutely instantaneous, but there’s no clear signal (like HTTP request made and then response returned) to see if you’ve waited ‘long enough’ for the expected consequence to happen.

So Capybara ends up waiting just some amount of time while periodically checking the expectation to see if it’s met, up to a maximum amount of time.

The amount of time the Javascript takes to produce the expected change on the page can slightly (or significantly) differ each time you run the tests, with no code changes. If sometimes Capybara is waiting long enough, but other times it isn’t — you get a race condition where sometimes the test passes but others it doesn’t.

This will exhibit straightforwardly as a specific test that sometimes passes and other times doesn’t.  To fix it, you need to make sure Capybara is waiting for results, and willing to wait long enough.

Use the right Capybara API to ensure waits

In older Capybara, developers would often explicitly tell Capybara exactly when to wait and what to wait for with the `wait_until` method.

In Capybara 2.0, author @jnicklas removed the `wait_until` method, explaining that Capybara has sophisticated waiting built-in to many of it’s methods, and wait_until was not neccesary — if you use the Capybara API properly: “For the most part, this behaviour is completely transparent, and you don’t even really have to think about it, because Capybara just does it for you.”

In practice, I think this can end up less transparent than @jnicklas would like, and it can be easier than he hopes to do it wrong.  In addition to the post linked above, additional discussions of using Capybara ‘correctly’ to ensure it’s auto-waiting is in play are here, here and here, plus the Capybara docs.

If you are using the Capybara API incorrectly so it’s not waiting at all, that could result in tests that always fail, but can also result in unreliable intermittently failing race conditions. After all, even if the Javascript involves no AJAX, it does not happen instantaneously. And similarly, in the main thread, moving from one rspec instruction (do a click) to another (see if page body has content) does not happen instantaneously.  The two small amounts of time will vary from run to run, and sometimes the JS may finish before the next ruby statement in your example happens, other times not. Welcome to concurrency.

I ran into just a few feature examples in the app I was working on that had obvious problems in this area.

Making Capybara wait long enough

When Capybara is waiting, how long is it willing to wait before giving up? `Capybara.default_wait_time`, by default 2 seconds. If there are actions that sometimes or always take longer than this, you can increase the `Capybara.default_wait_time` — but do it in a suite-wide `before(:each)` hook, because I think Capybara may reset this value on every run, in at least some versions.

You can also run specific examples or sections of code with a longer wait value by wrapping in a `using_wait_time N do` block. 

At first I spent quite a bit of time playing with this, because it’s fairly understandable and seemed like it could be causing problems. But I don’t think I ended up finding any examples in my app-at-hand that actually needed a longer wait time, that was not the problem.

I do not recommend trying to patch in `wait_until` again, or to patch in various “wait_for _jquery_ajax”, “wait_for_angular”, etc., methods you can find googling.  You introduce another component that could have bugs (or could become buggy with a future version of JQuery/Ajax/Capybara/poltergeist/whatever), you’re fighting against the intention of Capybara, you’re making things even more complicated and harder to debug, and even if it works you’re tying yourself even further to your existing implementation, as there is no reliable way to wait on an AJAX request with the underlying actual browser API. My app-in-hand had some attempts in these directions, but even figuring out if they were working (especially for Angular) was non-trivial. Better just fix your test to wait properly on the expected UI, if you at all can.

In fact, while this stuff is confusing at first, it’s a lot less confusing — and has a lot more written about it on the web — than the other category of Capybara race condition…

2. Race condition BETWEEN test examples: Feature test leaving unfinished business, Controller actions still not done processing when test ends.

So your JS feature test is simulating interactions with a web page, making all sorts of Javascript happen.

Some of this Javascript is AJAX that triggers requests against Rails controllers — running in the Rails app launched in another thread by the Capybara driver.

At some point the test example gets to the end, and has tested everything it’s going to test.

What if, at this point, there is still code running in the Rails app? Maybe an AJAX request was made and the Capybara test didn’t bother waiting for the response.

Anatomy of a Race Condition

RSpec will go on to the next test, but the Rails code is still running. The main thread running the tests will now run DatabaseCleaner.clean, and clear out the database — and the Rails code that was still in progress (in another thread) finds the database cleaned out from under it. Depending on the Rails config, maybe the Rails app now even tries to reload all the classes for dev-mode class reloading, and the code that was still in progress finds class constants undefined and redefined from under it. These things are all likely to cause exceptions to be raised by the in-progress-in-background-thread code.

Or maybe the code unintentionally in progress in the background isn’t interrupted, but it continues to make changes to the database that mess with the new test example that rspec has moved on to, causing that example to fail.

It’s a mess. Rspec assumes each test is run in isolation, when there’s something else running and potentially making changes to the test database concurrently, all bets are off. The presence and the nature of the problem caused depends on exactly how long the unintentional ‘background’ processing takes to complete, and how it lines up on the timeline against the new test, which will vary from run to run, which is what makes this a race condition.

This does happen. I’m pretty sure it’s what was happening to the app I was working on — and still is, I wasn’t able to fully resolve it, although I ameliorated the symptoms with the config I’ll describe below.

The presence and nature of the problem also can depend on which test is ‘next’, which will be different from run to run under random rspec — but I found even re-running the suite with the same seed, the presence and nature of exhibit would vary.

What does it look like?

Tests that fail only when run as part of the entire test suite, but not when run individually. Which sure makes them hard to debug.

One thing you’ll see when this is happening is different tests failing each time. The test that shows up as failing or erroring isn’t actually the one that has the problematic implementation — it’s the previously run JS feature test (or maybe even a JS feature test before that?) that sinned by ending while stuff was still going on in the Rails app.  Which test was the previously run test will vary every run with a different seed, using rspec random testing.  RSpec default output doesn’t tell which was the previous test on a given run; and the RSpec ‘text’ formatter doesn’t really give us the info in the format we want it either (have to translate from human-readable label to test file and line number yourself, which is kind of infeasible sometimes).  I’ve thought about writing an RSpec custom formatter that just prints out file/line information for each example as it goes to give me some hope of figuring out which test is really leaving it’s business unfinished, but haven’t done so.

It can be very hard to recognize when you are suffering from this problem, although when you can’t figure out what the heck else could possibly be going on, that’s a clue. It took me a bunch of hours to realize this was a possible thing, and the thing that was happening to me. Hopefully this very long blog post will save you more time then it costs you to read.

Different tests, especially but not exclusively feature tests, failing/erroring each time you run the very same codebase is a clue.

Another clue is when you see errors reported by RSpec as

     Failure/Error: Unable to find matching line from backtrace

I think that one is always an exception raised as a result of `Capybara.raise_server_errors = true` (the default) in the context of a feature test that left unfinished business. You might make those go away with `Capybara.raise_server_errors = false`, but I really didn’t want to go there, the last thing I want is even less information about what’s going on.

With Postgres, I also believe that `PG::TRDeadlockDetected: ERROR: deadlock detected` exceptions are symptomatic of this problem, although I can’t completely explain it and they may be unrelated (may be DatabaseCleaner-related, more on that later).

And I also still get my phantom_js processes sometimes dying unexpectedly; related? I dunno.

But I think it can also show up as an ordinary unreliable test failure, especially in feature tests.

So just don’t do that?

If I understand right, current Capybara maintainer Thomas Walpole understands this risk, and thinks the answer is: Just don’t do that. You need to understand what your app is doing under-the-hood, and make sure the Capybara test waits for everything to really be done before completing. Fair enough, it’s true that there’s no way to have reliable tests when the ‘unfinished business’ is going on. But it’s easier said than done, especially with complicated front-end JS (Angular, React/Flux, etc), which often actually try to abstract away whether/when an AJAX request is happening, whereas following this advice means we need to know exactly whether, when, and what AJAX requests are happening in an integration test, and deal with them accordingly.

I couldn’t completely get rid of problems that I now strongly suspect are caused by this kind of race condition between test examples, couldn’t completely get rid of the “unfinished business”.

But I managed to make the test suite a lot more reliable — and almost completely reliable once I turned off rspec random test order (doh), by dotting all my i’s in configuration…

Get your configuration right

There are a lot of interacting components in a Capybara JS Feature test, including: Rails itself, rspec, Capybara, DatabaseCleaner, Poltergeist. (Or equivalents or swap-outs for many of these).

They each need to be set up and configured right to avoid edge case concurrency bugs. You’d think this would maybe just happen by installing the gems, but you’d be wrong. There are a number of mis-configurations that can hypothetically result in concurrency race conditions in edge cases (even with all your tests being perfect).

They probably aren’t effecting you, they’re edge cases. But when faced with terribly confusing hard to reproduce race condition unreliable tests, don’t you want to eliminate any known issues?  And when I did all of these things, I did improve my test reliability, even in the presence of presumed continued feature tests that don’t wait on everything (race condition category #2 above).

Update your dependencies

When googling, I found many concurrency-related issues filed for the various dependencies.  I’m afraid I don’t keep a record of them. But Rspec, Capybara, DatabaseCleaner, and Poltegeist have all had at least some known concurrency issues (generally with how they all relate to each other) in the past.

Update to the latest versions of all of them, to at least not be using a version with a known concurrency-related bug that’s been fixed.

I’m still on Rspec 2.x, but at least I updated to the last Rspec 2.x (2.14.1). And updated DatabaseCleaner, Capybara, and Poltegeist to the latest I could.

Be careful configuring DatabaseCleaner — do not use the shared connection monkey-patch

DatabaseCleaner is used to give all your tests a fresh-clean database to reduce unintentional dependencies.

For non-JS-feature tests, you probably have DatabaseCleaner configured with the :transaction method — this is pretty cool, it makes each test example happen in an uncommitted transaction, and then just rolls back the transaction after every example. Very fast, very isolated!

But this doesn’t work with feature tests, because of the concurrency. Since JS feature tests boot a Rails app in another thread from your actual tests, using a different database connection, the running app wouldn’t be able to see any of the fixture/factory setup done in your main test thread in an uncommitted transaction.

So you probably have some config in spec/spec_helper.rb or spec/rails_helper.rb to try and do your JS feature tests using a different DatabaseCleaner mode.

Go back and look at the DatabaseCleaner docs and see if you are set up as currently recommended. Recently DatabaseCleaner README made a couple improvements to the recommended setup, making it more complicated but more reliable. Do what it says. 

My previous setup wasn’t always properly identifying the right tests that really needed the non-:transaction method, the improved suggestion does it with a `Capybara.current_driver == :rack_test` test, which should always work. 

Do make sure to set `config.use_transactional_fixtures = false`, as the current suggestion will warn you about if you don’t. 

Do use append_after instead of append to add your `DatabaseCleaner.clean` hook, to make sure database cleaning happens after Capybara is fully finished with it’s own cleanup. (It probably doesn’t matter, but why take the risk).

It shouldn’t matter if you use :truncation or :deletion strategy; everyone uses “:truncation” because “it’s faster”, but the DatabaseCleaner documentation actually says: “So what is fastest out of :deletion and :truncation? Well, it depends on your table structure and what percentage of tables you populate in an average test.” I don’t believe the choice matters for the concurrency-related problems we’re talking about.

Googing, you’ll find various places on the web advising (or copying advice from other places) monkey-patching Rails ConnectionPool with a “shared_connection” implementation originated by José Valim to make :transaction strategy work even with Capybara JS feature tests. Do not do this. ActiveRecord has had a difficult enough time with concurrency without intentionally breaking it or violating it’s contract — ActiveRecord ConnectionPool intends to give each thread it’s own database connection. This hack is intentionally breaking that. IF you have any tests that are exhibiting “race conditions between examples” (a spec ending while activity is still going on in the Rails app), this hack WILL make it a lot WORSE.  Hacking the tricky concurrency related parts of ActiveRecord ConnectionPool is not the answer. Not even if lots of blog posts from years ago tell you to do it, not even if the README or wiki page for one of the components tells you to (I know one does, but now I can’t find it to cite it on a hall of shame), they are wrong. (This guy agrees with me, so do others if you google).  It was a clever idea José had, but it did not work out, and should not still be passed around the web.

Configure Rails under test to reduce concurrency and reduce concurrency-related problems

In a newly generated Rails 4.x app, if you look in `./config/environments/test.rb`, you’ll find this little hint, which you probably haven’t noticed before:

# Do not eager load code on boot. This avoids loading your whole application
 # just for the purpose of running a single test. If you are using a tool that
 # preloads Rails for running tests, you may have to set it to true.
 config.eager_load = false

If that sounds suggestive, it’s because by saying “a tool that preloads Rails for running tests”, this comment is indeed trying to talk about Capybara with a JS driver, which loads a Rails app in an extra thread. It’s telling you to set eager_load to true if you’re doing that.

Except in at least some (maybe all?) versions of Rails 4.x, setting `config.eager_load = true` will change the default value of `config.allow_concurrency` from false to true. So by changing that, you may now have `config.allow_concurrency`.

You don’t want that, at least not if you’re dealing with horrible horrible race condition test suite already. Why not, you may ask, our whole problem is concurrency, shouldn’t we be better off telling Rails to allow it?  Well, what this config actually does (in Rails 4.x, in 5.x I dunno) is control whether the Rails app itself will force every request to wait in line and be served on one at a time (allow_concurrency false), or create multiple threads (more threads, even more concurrency!) to handle multiple overlapping requests.

This configuration might make your JS feature tests even slower, but when I’m already dealing with a nightmare of unreliable race condition feature tests, the last thing I want is even more concurrency.

I’d set:

config.allow_concurrency = false
config.eager_load = true

Here in this Rails issue you can find a very confusing back and forth about whether `config.allow_concurrency = false` is really necessary for Capybara-style JS feature tests, or if maybe only the allow_concurrency setting is necessary and you don’t really need to change `eager_load` at all, or if the reason you need to set one or another is actually a bug in Rails, which was fixed in a Rails patch release, so what you need to do may depend on what version you are using… at the end of it I still wasn’t sure what the Rails experts were recommending or what was going on. I just set them both. Slower tests are better than terribly terribly unreliable tests, and I’m positive this is the safest configuration.

All this stuff has been seriously refactored in Rails 5.0.  In the best case, it will make it all just work, they’re doing some very clever stuff in Rails 5 to try and allow class-reloading even in the presence of concurrency. In the worst case, it’ll just be a new set of weirdness, bugs, and mis-documentation for us to figure out. I haven’t looked at it seriously yet. (As I write this, 5.0.0.beta.2 is just released).

Why not make sure Warden test helpers are set up right

It’s quite unlikely to be related to this sort of problems, but if you’re using Warden test helpers for devise, as recommended on the devise wiki for use with Capybara, you may not have noticed the part about cleaning up with `Warden.test_reset!` in an `after` hook.

This app had the Warden test helpers in it, but wasn’t doing the clean-up properly. When scouring the web for anything related to Capybara, I found this, and fixed it up to do as recommended. It’s really probably not related to the failures you’re having, but might as well as set things up as documented while you’re at it.

I wouldn’t bother with custom Capybara cleanup

While trying to get things working, I tried various custom `after` hooks with Capybara cleanup, various of `Capybara.reset_session!`, `driver.reset!` and others. I went down a rabbit hole trying to figure out exactly what these methods do, which varies from driver to driver, and what they should do, is there a bug in a driver’s implementation?

None of it helped ultimately. Capybara does it’s own cleanup for itself, it’s probably good enough (especially if DatabaseCleaner.cleanup is properly set up with `after_append` to run after Capybara’s cleanup as it should).  Spending a bunch of hours trying to debug or customize this didn’t get me much enlightenment or test reliability improvements.

The Nuclear Option: Rack Request Blocker

Joel Turkel noticed the “unfinished business race condition” problem (his blog post helped me realize I was on the right track), and came up with some fairly tricky rack middleware attempting to deal with it by preventing the Rails app from accepting more requests if an outstanding thing is still going on from a feature test that didn’t wait on it.

Dan Dorman turned it into a gem. 

I experimented with this, and it seemed to both make my tests much slower (not unexpected), and also not cure my problem, I was still getting race condition failures for some reason. So I abandoned it.

But you could try it, I include it for completeness — it is theoretically the only path to actually guaranteeing against feature test “unfinished business”.

At first I thought it was really doing nothing different than `config.allow_concurrency = false`, already built into Rails was doing (allow_concurrency false puts in the Rack::Lock middleware already included with Rails).

But it actually is a bit more powerful — it will allow a unit test (or any test including those not using Capybara JS driver) to wait on the absolute completion of any unfinished business left by a feature test, and at the beginning of the example. Theoretically. I’m not sure why it didn’t work for me, it’s something you could try.

Sadly, maybe disable RSpec config.order = “random”

I did all of these things.  Things did get better. (I think? The trick with non-reproducible failures is you never know if you are just having a run of luck, but I’m pretty sure I improved it).  But they weren’t fixed. I still had unreliable tests.

Somewhere towards the end of this after many hours, I realized my problem was really about the feature tests not waiting on ‘unfinished business’ (I didn’t discover these things in the same order this post is written!), and it would obviously be best to fix that. But I had some pretty complex semi-‘legacy’ front-end JS using a combination of Angular and React (neither of which I had experience with), it just wasn’t feasible, I just wanted it to be over.

You know what did it?

Commenting out `config.order = “random”` from rspec configuration.

At first I had no idea why this would matter — sure, some random orders might be more likely to trigger race conditions then others, but it’s not just a magic seed, it’s turning off random test ordering altogether.

Aha. Because when a JS feature test follows another JS feature test, `config.allow_concurrency = false` is decent (although far from perfect) at holding up the second feature test until the ‘unfinished business’ is complete — it won’t eliminate overlap, but it’ll reduce it.

But when one (or several or a dozen) ordinary tests follow the JS feature test with ‘unfinished business’, they don’t have `allow_concurrency = false` to protect them, since they aren’t using the full Rails stack with middleware effected by this.

If you turn off random test ordering, all your feature tests end up running in sequence together, and all your other tests end up running in sequence together, without intermingling.

That was the magic that got me to, if not 100% reliable without race condition, pretty darn reliable, enough that i only occasionally see race condition failure now.

I don’t feel great about turning off test order randomization, but I also remember when we all wrote tests before rspec even invented the feature, and we did fine. There’s probably also a way to get Rspec to randomize order _within_ types/directories, but still run all feature tests in a block, which should be just as good.

Postscript: Aggressively Minimize JS Feature Tests

I have come to the conclusion that it is extremely challenging and time-consuming to get Capybara JS feature tests to work reliably, and that this is a necessary consequence of the architecture involved. As a result, your best bet is to avoid or ruthlessly minimize the number of feature tests you write.

The problem is that what is necessary to avoid feature test “unfinished business” is counter to the very reason I write tests.

I want and need my tests to test interface (in the case of a feature test this really is user interface, in other cases API), independent of implementation.  If I refactor or rewrite the internals of an implementation, but intend the interface remains the same — I need to count on my tests passing if and only if the interface indeed remains the same. That’s one of the main reasons I have tests. That’s the assumption behind the “red-green-refactor” cycle of TDD (not that I do TDD really myself strictly, but I think that workflow does capture the point of tests).

@twalpole, the current maintainer of Capybara, is aware of the “unfinished business” problem, and says that you basically just need to write your tests to make sure they wait:

So you either need to have an expectation in your flaky tests that checks for a change that occurs when all ajax requests are completed or with enough knowledge about your app it may be possible to write code to make sure there are no ajax requests ongoing (if ALL ajax requests are made via jquery then you could have code that keeps checking until the request count is 0 for instance) and run that in an after hook that you need to define so it runs before the capybara added after hook that resets sessions….

….you still need to understand exactly what your app is doing on a given page you’re testing.

The problem with this advice is it means the way a test is written is tightly coupled to the implementation, and may need to be changed every time the implementation (especially JS code) is changed. Which kind of ruins the purpose of tests for me.

It’s also very challenging to do if you have complex JS front-end (angular, react, Ember, etc), which often intentionally abstracts away exactly when AJAX requests are occuring. You’ve got to go spelunking through abstraction layers to do it — to write the test right in the first place, and again every time there’s any implementation change which might effect things.

Maybe even worse, fancy new JS front-end techniques often result in AJAX requests which result in no visible UI change (to transparently sync state on the back end, maybe only producing UI change in error cases, “optimistic update” style), which means to write a test that properly waits for “unfinished business” you’d need to violate another piece of Capybara advice, as original Capybara author @jnicklas wrote, “I am firmly convinced that asserting on the state of the interface is in every way superior to asserting on the state of your model objects in a full-stack test” — Capybara is written to best support use cases where you only test the UI, not the back-end.

It’s unfortunate, because there are lots of things that make UI-level integration/feature tests attractive:

  • You’re testing what ultimately matters, what the user actually experiences. Lower-level tests can pass with the actual user-facing app still broken if based on wrong assumptions, but feature tests can’t.
  • You haven’t figured out a great way to test your JS front-end in pure JS, and integrate into your CI, but you already know how to write ruby/rails feature tests.
  • You are confronting an under-tested “legacy” app, whose internals you don’t fully understand, and you need better testing to be confident in your refactoring — it makes a lot of sense to start with UI feature tests, and is sometimes even recommended for approaching an under-tested legacy codebase.

There are two big reasons to try and avoid feature tests with a JS-heavy front-end though: 1) They’re slow (inconvenient), and 2) They are nearly infeasible to make work reliably (damning, especially on a legacy codebase).

Until/unless there’s a robust, well-maintained (ideally by Capybara itself, to avoid yet another has-to-coordinate component) lower-level solution along the lines rack_request_blocker, I think all we can do is avoid Capybara JS feature tests as much as possible — stick to the bare minimum of ‘happy path’ scenarios you can get away with (also common feature-test advice), it’s less painful than the alternative.


 

If you’re looking for consulting or product development with Rails or iOS, I work at Friends of the Web, a small company that does that.

Posted in General | 3 Comments

A tiny gem: #dig backfill for older rubies

Excited about #dig in MRI 2.3.0?  Want to use it in your gem code, but don’t want your gem to require MRI 2.3.0 yet?

I got you covered:

https://github.com/jrochkind/dig_rb

It’ll add in a pure-ruby #dig implementation if Hash/Array/Struct don’t have #dig yet. If they already have #dig defined, it’ll do nothing to them, so you can use dig_rb on gem code meant to run on any ruby. When run on MRI 2.3.0, you’ll be using native implementation, on other rubies dig_rb’s pure ruby implementation.

Note: JRuby 9k doesn’t support #dig yet either, and dig_rb will work fine there too.

Posted in General | Leave a comment

American Libraries adds Gale quotes in without author’s knowledge

From a blog post by Patricia Hswe and Stewart Varner.

TL;DR: Patricia Hswe and I wrote an article for American Libraries and the editors added some quotes from a vendor talking about their products without telling us. We asked them to fix it and they said no.

I guess Gale Cengage paid the ALA for placement or something? I can’t think of any other reason the ALA would commission an article which has the hard requirement of including quotes from Gale PR staff in it?

Sounds me like one can’t trust the ALA to be objective representative of our profession, if they’re accepting payment in return for quoting vendor PR staff in articles in their publication that are ostensibly editorial.

What they’ll do different next time is make sure the authors are on the same page, or just use their own in-house authors instead of librarians. They’d rather use librarians because it makes the article look better, hey, that’s what Gale is (presumably) paying them for. Heck, they can probably get some librarians to go along with it too, alas.

I actually started to Google wondering “Hey, is American Libraries actually published by Gale Cengage? Because that would explain things…” before remembering “Wait a second, American Libraries is published by the ALA… aren’t they supposed to represent their members, not vendors?”

Posted in General | 2 Comments

Career change

Today is my last day here at Johns Hopkins University Libraries.

After Thanksgiving, I’ll be working, still here in Baltimore, at Friends of the Web, a small software design, development, and consulting company.

I’m excited to be working collaboratively with a small group of other accomplished designers and developers, with a focus on quality. I’m excited by Friends of the Webs’ collaborative and egalitarian values, which show in how they do work and treat each other, how decisions are made, and even in the compensation structure.

Friends of the Web has technical expertise in Rails, Ember (and other MVC Javascript frameworks), as well as iOS development. Also significant in-house professional design expertise.

Their clientele is intentionally diverse; a lot of e-commerce, but also educational and cultural institutions, among others.

They haven’t done work for libraries before, but are always interested in approaching new business and technological domains, and are open to accepting work from libraries. I’m hoping that it will work out to keep a hand in the library domain at my new position, although any individual project may or may not work out for contracting with us, depending on if it’s a good fit for everyone’s needs. But if you’re interested in contracting an experienced team of designers and developers (including an engineer with an MLIS and 9 years of experience in the library industry: me!) to work on your library web (or iOS) development needs, please feel free to get in touch to talk about it. You could hypothetically hire just me to work on a project, or have access to a wider team of diverse experience, including design expertise.

Libraries, I love you, but I had to leave you, maybe, at least for now

I actually really love libraries, and have enjoyed working in the industry.

It may or may not be surprising to you that I really love books — the kind printed on dead trees. I haven’t gotten into ebooks, and it’s a bit embarrassing how many boxes of books I moved when I moved houses last month.

I love giant rooms full of books. I feel good being in them.

Even if libraries are moving away from being giant rooms full of books, they’ve still got a lot to like. In a society in which information technology and data are increasingly central, public and academic libraries are “civil society” organizations which can serve user’s information needs and advocate for users, with libraries interests aligned with their users, because libraries are not (mainly) trying to make money off their patrons or their data. This is pretty neat, and important.

In 2004, already a computer programmer, I enrolled in an MLIS program because I wanted to be a “librarian”, not thinking I would still be a software engineer. But I realized that with software so central to libraries, if I were working in a non-IT role I could be working with software I knew could be better but couldn’t do much about — or I could be working making that software better for patrons and staff and the mission of the library.

And I’ve found the problems I work on as a software engineer in an academic library rewarding. Information organization and information retrieval are very interesting areas to be working on. In an academic library specifically, I’ve found the mission of creating services that help our patrons with their research, teaching, and learning to be personally rewarding as well.  And I’ve enjoyed being able to do this work in the open, with most of my software open source, working and collaborating with a community of other library technologists across institutions.  I like working as a part of a community with shared goals, not just at my desk crunching out code.

So why am I leaving?

I guess I could say that at my previous position I no longer saw a path to make the kind of contributions to developing and improving libraries technological infrastructures and capacities that I wanted to make. We could leave it at that.  Or you could say I was burned out. I wasn’t blogging as much. I wasn’t collaborating as much or producing as much code. I had stopped religiously going to Code4Lib conferences. I dropped out of the Code4Lib Journal without a proper resignation or goodbye (sorry editors, and you’re doing a great job).

9 years ago when, with a fresh MLIS, I entered the library industry, it seemed like a really exciting time in libraries, full of potential.  I quickly found the Code4Lib community, which gave me a cohort of peers and an orientation to the problems we faced. We knew that libraries were behind in catching up to the internet age, we knew (or thought we knew) that we had limited time to do something about it before it was “too late”, and we (the code4libbers in this case) thought we could do something about it, making critical interventions from below. I’m not sure how well we (the library industry in general or we upstart code4libbers) have fared in the past decade, or how far we’ve gotten. Many of the Code4Lib cohort I started up with have dropped out of the community too one way or another, the IRC channel seems a dispiriting place to me lately (but maybe that’s just me).  Libraries aren’t necessarily focusing on the areas I think most productive, and now I knew how hard it was to have an impact on that. (But no, I’m not leaving because of linked data, but you can take that essay as my parting gift, or parting shot). I know I’ve made some mistakes in personal interactions, and hadn’t succeeded at building collaboration instead of conflict in some projects I had been involved in, with lasting consequences. I wasn’t engaging in the kinds of discussions and collaborations I wanted to be at my present job, and had run out of ideas of how to change that.

So I needed a change of perspective and circumstance. And wanted to stay in Baltimore (where I just bought a house!). And now here I am at Friends of the Web!  I’m excited to be taking a fresh start in a different sort of organization working with a great collaborative team.

I am also excited by the potential to keep working in the library industry from a completely different frame of reference, as a consulting/contractor.  Maybe that’ll end up happening, maybe it won’t, but if you have library web development or consulting work you’d like discuss, please do ring me up.

What will become of Umlaut?

There is no cause for alarm! Kevin Reiss and his team at Princeton have been working on an Umlaut rollout there (I’m not sure if they are yet in production).  They plan to move forward with their implementation, and Kevin has agreed to be a (co-?)maintainer/owner of the Umlaut project.

Also, Umlaut has been pretty stable code lately, it hasn’t gotten a whole lot of commits but just keeps on trucking and working well. While there were a variety of architectural improvements I would have liked to make, I fully expect Umlaut to remain solid software for a while with or without major changes.

This actually reminds me of how I came to be the Umlaut lead developer in the first place. Umlaut was originally developed by Ross Singer who was working at Georgia Tech at the time. Seeing a priority for improving our “link resolver” experience, and the already existing and supported Umlaut software, after talking to Ross about it, I decided to work on adopting Umlaut here. But before we actually went live in production — Ross had left Georgia Tech, they had decided to stop using Umlaut, and I found myself lead developer! (The more things change… but as far as I know, Hopkins plans to continue using Umlaut).  It threw me for a bit of a loop to suddenly be deploying open source software as a community of one institution, but I haven’t regretted it, I think Umlaut has been very successful for our ability to serve patrons with what they need here, and at other libraries.

I am quite proud of Umlaut, and feel kind of parental towards it. I think intervening in the “last mile” of access, delivery, and other specific-item services is exactly the right place to be, to have the biggest impact on our users. For both long-term strategic concerns — we don’t know where our users will be doing ‘discovery’, but there’s a greater chance we’ll still be in the “last mile” business no matter what. And for immediate patron benefits — our user interviews consistently show that our “Find It” link resolver service is both one of the most used services by our patrons, and one of the services with the highest satisfaction.  And Umlaut’s design as “just in time” aggregator of foreign services is just right for addressing needs as they come up — the architecture worked very well for integrating BorrowDirect consortial disintermediated borrowing into our link resolver and discovery, despite the very slow response times of the remote API.

I think this intervention in “last mile” delivery and access, with a welcome mat to any discovery wherever it happens, is exactly where we need to be to maximize our value to our patrons and “save the time of the reader”/patron, in the context of the affordances we have in our actually existing infrastructures — and I think it has been quite successful.

So why hasn’t Umlaut seen more adoption? I have been gratified and grateful by the adoption it has gotten at a handful of other libraries (including NYU, Princeton, and the Royal Library of Denmark), but I think it’s potential goes further. Is it a failure of marketing? Is it different priorities, are academic libraries simply not interested in intervening to improve research and learning for our patrons, preferring to invest in less concrete directions?  Are in-house technological capacity requirements simply too intimidating (I’ve never tried to sugar coat or under-estimate the need for some local IT capacity to run Umlaut, although I’ve tried to make the TCO as low as I can, I think fairly successfully). Is Umlaut simply too technically challenging for the capacity of actual libraries, even if they think the investment is worth it?

I don’t know, but if it’s from the latter points, I wonder if any access to contractor/vendor support would help, and if any libraries would be interested in paying a vendor/contractor for Umlaut implementation, maintenance, or even cloud hosting as a service. Well, as you know, I’m available now. I would be delighted to keep working on Umlaut for interested libraries. The business details would have to be worked out, but I could see contracting to set up Umlaut for a library, or providing a fully managed cloud service offering of Umlaut. Both are hypothetically things I could do at my new position, if the business details can be worked out satisfactorily for all involved. If you’re interested, definitely get in touch.

Other open source contributions?

I have a few other library-focused open source projects I’ve authored that I’m quite proud of. I will probably not be spending much time on the in the near future. This includes traject, bento_search, and borrow_direct.

I wrote Traject with Bill Dueber, and it will remain in his very capable hands.

The others I’m pretty much sole developer on. But I’m still around on the internet to answer questions, provide advice, or most importantly, accept pull requests for changes needed.  bento_search and borrow_direct are both, in my not so humble opinion, really well-architected and well-written code, which I think should have legs, and which others should find fairly easy to pick up. If you are using one of these projects, send a good pull request or two, and are interested, odds are I’d give you commit/release rights.

What will happen to this blog?

I’m not sure! The focus of this blog has been library technology and technology as implemented in libraries.  I hadn’t been blogging as much as I used to anyway lately. But I don’t anticipate spending as much(any?) time  on libraries in the immediate future, although I suspect I’ll keep following what’s going on for at least a bit.

Will I have much to say on libraries and technology anyway? Will the focus change? We will see!

So long and thanks for all the… fiche?

Hopefully not actually a “so long”, I hope to still be around one way or another. I am thinking of going to the Code4Lib conference in (conveniently for me) Philadelphia in the spring.

Much respect to everyone who’s still in the trenches, often in difficult organizational/political environments, trying to make libraries the best they can be.

Posted in General | 10 Comments

Linked Data Caution

I have been seeing an enormous amount of momentum in the library industry toward “linked data”, often in the form of a fairly ambitious collective project to rebuild much of our infrastructure around data formats built on linked data.

I think linked data technology is interesting and can be useful. But I have some concerns about how it appears to me it’s being approached. I worry that “linked data” is being approached as a goal in and of itself, and what it is meant to accomplish (and how it will or could accomplish those things) is being approached somewhat vaguely.  I worry that this linked data campaign is being approached in a risky way from a “project management” point of view, where there’s no way to know if it’s “working” to accomplish it’s goals until the end of a long resource-intensive process.  I worry that there’s an “opportunity cost” to focusing on linked data in itself as a goal, instead of focusing on understanding our patrons needs, and how we can add maximal value for our patrons.

I am particularly wary of approaches to linked data that seem to assume from the start that we need to rebuild much or all of our local and collective infrastructure to be “based” on linked data, as an end in itself.  And I’m wary of “does it support linked data” as the main question you asked when evaluating software to purchase or adopt.  “Does it support linked data” or “is it based on linked data” can be too vague to even be useful as questions.

I also think some of those advocating for linked data in libraries are promoting an inflated sense of how widespread or successful linked data has been in the wider IT world.  And that this is playing into the existing tendency for “magic bullet” thinking when it comes to information technology decision-making in libraries.

This long essay is an attempt to explain my concerns, based on my own experiences developing software and using metadata in the library industry. As is my nature, it turned into a far too long thought dump, hopefully not too grumpy.  Feel free to skip around, I hope at least some parts end up valuable.

What is linked data?

The term “linked data” as used in these discussions basically refers to what I’ll call an “abstract data model” for data — a model of how you model data.

The model says that all metadata will be listed as a “triple” of  (1) “subject”,  (2) “predicate” (or relationship), and (3) “object”.

1. Object A [subject] 
2. Is a [predicate] 
3. book [object]

1. Object A [subject] 
2. Has the ISBN [predicate] 
3. "0853453535" [object]

1. Object A 
2. has the title 
3. "Revolution and evolution in the twentieth century"

1. Object A 2. has the author 3. Author N

1. Author N 2. has the first name 3. James

1. Author N 2. has the last name 3. Boggs

Our data is encoded as triples, statements of three parts: subject, predicate, object.

Linked data prefers to use identifiers for as many of these data elements as possible, and in particular identifiers in the form of URI’s.

“Object A” in my example above is basically an identifier, but similar to the “x” or “y” in an algebra problem, it has meaning only in the context of my example; someone elses “Object A” or “x” or “y” in another example might mean something different, and trying to throw them all together you’re going to get conflicts.  URI’s are nice as identifiers in that, being based on domain names, they have a nice way of “namespacing” and avoiding conflicts, they are global identifiers.

# The identifiers I'm using are made up by me, and I use 
# example.org to get across I'm not using standard/conventional
# identifiers used by others. 
1. http://example.org/book/oclcnum/828033 [subject]
2. http://example.org/relationship/is_member_of_class [predicate]
3. http://example.org/type/Book [object]

# We can see sometimes we still need string literals, not URIs
1. http://example.org/book/oclcnum/828033 
2. http://example.org/relationship/has_title 
3. "Revolution and evolution in the twentieth century"

1. http://example.org/book/oclcnum/828033 
2. http://example.org/relationship/has_author 
3. http://example.org/lccn/79128112

1. http://example.org/lccn/79128112 
2. http://example.org/relationship/is_member_of_class 
3. http://example.org/type/Person

1. http://example.org/lccn/79128112 
2. http://example.org/relationship/has_name 
3. "Boggs, James"

I call the linked data model an “abstract data model“, because it is a model for how you model data: As triples.

You still, as with any kind of data modeling, need what I’ll call a “domain model” — a formal listing of the entities you care about (books, people), and what attributes, properties, and relationships with each other those entities have.

In the library world, we’ve always created these formal domain models, even before there were computers. We’ve called it “vocabulary control” and “authority control”.  In linked data, that domain model takes the form of standard shared URI identifiers for entities, properties, and relationships.  Establishing standard shared URI’s with certain meanings for properties or relationships (eg `http://example.org/relationship/has_title` will be used to refer to the title, possibly with special technical specification of what we mean exactly by ‘title’) is basically “vocabulary control”, while establishing standard shared URI’s for entities (eg `http://example.org/lccn/79128112`) is basically “authority control”.

You still need common vocabularies for your linked data to be inter-operable, there’s no magic in linked data otherwise, linked data just says the data will be encoded in the form of triples, with the vocabularies being encoded in the form of URIs.  (Or, you need what we’ve historically called a “cross-walk” to make data from different vocabularies inter-operable; linked data has certain standard ways to encode cross-walks for software to use them, but no special magic ways to automatically create them).

For an example of vocabulary (or “schema”) built on linked data technology, see schema.org.

You can see that through aggregating and combining multiple simple “triple” statements, we can build up a complex knowledge graph.  Through basically one simple rule of “all data statements are triples”, we can build up remarkably complex data, and model just about any domain model we’d want.  The library world is full of analytical and theoretically minded people who will find this theoretical elegance very satisfying, the ability to model any data at all as a bunch of triples.  I think it’s kind of neat myself.

You really can model just about any data — any domain model — as linked data triples. We could take AACR2-MARC21 as a domain model, and express it as linked data by establishing a URI to be used as a predicate for every tag-subtag. There would be some tricky parts and edge cases, but once figured out, translation would  be a purely mechanical task — and our data would contain no more information or utility output as linked data than it did originally, nor be any more inter-operable than it was originally, as is true of the output of any automated transformation process.

You can model anything as linked data, but some things are more convenient and some things less convenient. The nature of linked data as being building complex information graphs based on simple triples can actually make the linked data more difficult to deal with practically, as you can see looking at our made up examples above and trying to understand what they mean. By being so abstract and formally simple, it can get confusing.

Some things that might surprise you are kind of inconvenient to model as linked data. It can take some contortions to model an ordered sequence using linked data triples, or to figure out how to model alternate language representations (say of a title) in triples. There are potentially multiple ways to solve these goals, with certain patterns as established as standards for inter-operability, but they can be somewhat confusing to work with.  Domain modeling is difficult already — having to fit your domain model into the linked data abstract model can be a fun intellectual exercise, but the need to undertake that exercise can make the task more difficult.

Other things are more convenient with linked data. You might have been wondering when the “linked” would come in.

Modeling all our data as individual “triples” makes it easier to merge data from multiple sources. You just throw all the triples together (You are still going to need to deal with any conflicts or inconsistencies that come about).   Using URI’s as vocabulary identifiers means that you can throw all this data together from multiple sources, and you won’t have any conflicts, you won’t find one source using MARC tag 100 to mean “main entry” and another source using the 100 tag to mean all sorts of other things (See UNIMARC!).

Linked data vocabularies are always “open for extension”. let’s see we established that there’s a sort of thing as a `http://example.org/type/Book` and it has a number of properties and relationships including `http://example.org/relationship/has_title`.  But someone realizes, gee, we really want to record the color of the book too. No problem, they just start using `http://mydomain.tld/relationship/color`, or whatever they want. It won’t conflict with any existing data (no need to find an unused MARC tag!), but of course it won’t be useful outside the originator’s own system unless other people adopt this convention, and software is written to recognize and do something with it (open for extension, but we still need to adopt common vocabularies).

And using URI’s is meant to make it more straightforward to combine data from multiple sources in another way, that an http URI actually points to a network location, that could be used to deliver more information about something, say, `http://example.org/book/oclcnum/828033`, in the form of more triples. Mechanics to make it easier to assemble (meta)data from multiple sources together.

There are mechanics meant to support aggregating, combining, and sharing data built into the linked data design — but the fundamental problems of vocabulary and authority control, of using the same or overlapping vocabularies (or creating cross-walks), of creating software that recognizes and does something useful with vocabulary elements actually in use, etc,  —  all still exist. So do business model challenges with entities that don’t want to share their data, or human labor power challenges with getting data recorded. I think it’s worth asking if the mechanical difficulties with, say, merging MARC records from different sources, are actually the major barriers to more information sharing/coordination in the present environment, vs these other factors.

“Semantic web” vs “linked data”? vs “RDF”?

The “semantic web” is an older term than “linked data”, but you can consider it to refer to basically the same thing.  Some people cynically suggest “linked data” was meant to rebrand the “semantic web” technology after it failed to get much adoption or live up to it’s hype.  The relationship between the two terms according to Tim Berners-Lee (who invented the web, and is either the inventor or at least a strong proponent of semantic web/linked data) seems to be that “linked data” is the specific technology or implementations of individual buckets of data, while the “semantic web” is the ecosystem that results from lots of people using it.

RDF, which stands for “Resource Description Framework”, and is actually the official name of the abstract data model of “triples”.  Whereas then “linked data” could be understood as data using RDF and URI’s, and the “semantic web” the ecosystem that results from plenty of people doing it. Similarly, “RDF” can be roughly understood as a synonym.

Technicalities aside, “semantic web”, “linked data”, and “RDF” can generally be understood as rough synonyms when you see people discussing them — whatever term they use, they are talking about (meta)data modeled as “triples”, and the systems that are created by lots of such data integrated together over the web.

So. What do you actually want to do? Where are the users?

At a recent NISO forum on The Future of Library Resource Discovery, there was a session where representatives from 4(?) major library software vendors took Q&A from a moderator and the audience.  There was a question about the vendor’s commitment to linked data. The first respondent (who I think was from EBSCO?) said something like

[paraphrased] Linked data is a tool. First you need to decide what you want to do, then linked data may or may not be useful to doing that.

I think that’s exactly right.

Some of the other respondents, perhaps prompted by the first answer, gave similar answers. While others (especially OCLC) remarked of their commitment to linked data and the various places they are using it.  Of these though, I’m not sure any have actually resulted in any currently useful outcomes due to linked data usage.

Four or five years ago, talk of “user-centered design” was big in libraries — and in the software development world in general.  For libraries (and other service organizations), user-centered design isn’t just about software — but software plays a key role in almost any service a contemporary library offers, quite often mediating the service through software, such that user-centered design in libraries probably always involves software.

For academic libraries, with a mission to help our patrons in research, teaching, and learning — user-centered design begins with understanding our patrons’ research and leaning processes.  And figuring out the most significant interventions we can make to improve things for our patrons. What are their biggest pain points? Where can we make the biggest difference? To maximize our effectiveness when there’s an unlimited number of approaches we could take, you want to start with areas you can make a big improvement for the least resource investment.

Even if your institution lacks the resources to do much local research into user behavior, over the past few years a lot of interesting and useful multi-institutional research has been done by various national and international library organizations, such as reports from OCLC [a] [b], JISC [a], and Ithaka [a], [b], as well as various studies done by practitioner and published in journals.

To what extent is the linked data campaign informed by, motivated by, or based on what we know about our users behavior and needs?  To what extent are the goals of the linked data campaign explicit and specific, and are those goals connected back to what our users need from us?  Do we even know what we’re supposed to get out of it at all, beyond “data that’s linked better”, or “data that works well with the systems of entities outside the library industry”? (And for the latter, do we actually understand in what ways we want it to “work well”, for what reasons, and what it takes to accomplish that?)  Are we asking for specific success stories from the pilot projects that have already been done? And connecting them to what we need to do provide our users?

To be clear, I do think goals to increase our own internal staff efficiency, or to improve the quality of our metadata that powers most of our services are legitimate as well. But they still need to be tied back to user needs (for instance, to know the metadata you are improving is actually the metadata you need and the improvements really will help us serve our users better), and be made explicit (so you can evaluate how well efforts at improvement are working).

I think the motivations for the linked data campaign can be somewhat unclear and implicit; when they are made explicit, they are sometimes very ambitious goals which require a lot of pieces falling into place (including third-party cooperation and investment that is hardly assured) for realization only in the long-term — and with unclear or not-made-explicit benefits for our patrons even if realized.  For a major multi-institution multi-year resource-intensive campaign — this seems to me not sufficiently grounded in our user’s needs.

Is everyone else really doing it? Maybe not.

At another linked data presentation I attended recently, a linked data promoter said something along the lines of:

[paraphrased] Don’t do linked data because I say so, or because LC says so. Do it because it’s what’s necessary to keep us relevant in the larger information world, because it’s what everyone else is doing. Linked data is what lets Google give you good search results so quickly. Linked data is used by all the major e-commerce sites, this is how they do can accomplish what they can. 

The thing is, from my observation and understanding of the industry and environment, I just don’t think it’s true that “everyone is doing it”.

Google does use data formats based on the linked data model for it’s “rich snippets” (link to a 2010 paper).  This feature, which gives you a list of links next to a search result, is basically peripheral to the actual Google search.

Google also uses linked data to a somewhat more central extent in it’s Knowledge Graph feature, which provides “facts” in sidebars on search results. But most of the sources of data Google harvests from for it’s Knowledge Graph aren’t actually linked data, rather Google harvests and turns them into linked data internally — and then doesn’t actually expose the linked-data-ified data to the wider world.  In fact, Google has several times announced initiatives to expose the collected and triple-ified data to the wider world, but they have not actually turned into supported products.  This doesn’t necessarily say what advocates might want about the purported central role of linked data to Google, or what it means for linked data’s wider adoption.  As far as I know or can find out, linked data does not play a role in the actual primary Google search results, just in the Knowledge Graph “fact boxes”, and the “rich snippets” associated with results.

In a 2013 blog post, Andreas Blumaeur, arguing for the increased use of linked data, still acknowledges: “Internet companies like Google and Facebook make use of linked data quite hesitantly.”

My sense is that the general industry understanding is that linked data has not caught on like people thought it would in the 2007-2012 heyday, and adoption has in fact slowed and reversed. (Google trend of linked data/semantic web)

An October 2014 post on Hacker News asks: ” A few years ago, it seemed as if everyone was talking about the semantic web as the next big thing. What happened? Are there still startups working in that space? Are people still interested?”

In the ensuing discussion on that thread (which I encourage you to read), you can find many opinions, including:

  • “The way I see it that technology has been on the cusp of being successful for a long time” [but has stayed on the cusp]
  • “A bit of background, I’ve been working in environments next to, and sometimes with, large scale Semantic Graph projects for much of my career — I usually try to avoid working near a semantic graph program due to my long histories of poor outcomes with them.  I’ve seen uncountably large chunks of money put into KM projects that go absolutely nowhere and I’ve come to understand and appreciate many of the foundational problems the field continues to suffer from. Despite a long period of time, progress in solving these fundamental problems seem hopelessly delayed.”
  • “For what it’s worth, I spent last month trying to use RDF tooling (Python bindings, triple stores) for a project recently, and the experience has left me convinced that none of it is workable for an average-size, client-server web application. There may well be a number of good points to the model of graph data, but in practice, 16 years of development have not lead to production-ready tools; so my guess is that another year will not fix it.”
  • But also, to be fair: “There’s really no debate any more. We use the the technology borne by the ‘Semantic Web’ every day.” [Personally I think this claim was short on specifics, and gets disputed a bit in the comments]

At the very least, the discussion reveals that linked data/semantic web is still controversial in the industry at large, it is not an accepted consensus that it is “the future”, it has not “taken over.” And linked data is probably less “trendy” now in the industry at large than it was 4-6 years ago.

Talis was a major UK vendor of ILS/LMS library software, the companies history begins in 1969 as a library cooperative, similar to OCLC’s beginnings. In the mid-2000’s, they started shifting to a strategic focus on semantic web/linked data. In 2011, they actually sold off their library management division to focus primarily on semantic web technology. But quickly thereafter in 2012, they announced “that investment in the semantic web and data marketplace areas would cease. All efforts are now concentrated on the education business. ” They are now in the business of producing “enterprise teaching and learning platform” (compare to Blackboard, if I understand correctly), and apparently fairly succesful at it — but the semantic web focus didn’t pan out. (Wikipedia, Talis Group)

In 2009, The New York Times, to much excitement, announced a project to expose their internal subject vocabulary as linked data in. While the data is still up,  it looks to me like was abandoned in 2010; there has been no further discussion or expansion of the service, and the data looks not to have been updated.  Subject terms have a “latest use” field which seems to be stuck in May or June 2010 for every term I looked at (see Obama, Barak for instance), and no terms seem to be available for subjects that have become newsworthy since 2010 (no Carson, Ben, for instance).

In the semantic web/linked data heydey, a couple attempts to create large linked data databases were announced and generated a lot of interest. Freebase was started in 2007,  acquired by Google in 2010… and shut down in 2014. DBPedia was began much earlier and still exists… but it doesn’t generate the excitement or buzz that it used to. The newer WikiData (2012) still exists, and is considered a successor to Freebase by some.  It is generally acknowledged that none of these projects have lived up to initial hopes with regard to resulting in actual useful user-facing products or services, they remain experiments. A 2013 article, “There’s No Money in Linked Data“, suggests:

….[W]e started exploring the use of notable LD datasets such as DBpedia, Freebase, Geonames and others for a commercial application. However, it turns out that using these datasets in realistic settings is not always easy. Surprisingly, in many cases the underlying issues are not technical but legal barriers erected by the LD data publishers.

In Jan 2014, Paul Houle in “The trouble with DBpedia” argues that the problems are actually about data quality in DBPedia — specifically about vocabulary control, and how automatic creation of terms from use in wikipedia leads to inconsistent vocabularies . Houle thinks there are in fact technical solutions — but he, too, begins from the acknowledgement that DBPedia has not lived up to it’s expected promise.  In a very lengthy slide deck from February 2015, “DBpedia Ontology and Mapping Problems”, vladimiralexiev has a perhaps different diagnosis of the problem, about ontology and vocabulary design, and he thinks he has solutions. Note that he too is coming from an experience of finding DBPedia not working out for his uses.

There’s disagreement about why these experiments haven’t panned out to be more than experiments or what can be done or what promise they (and linked data in general) still have — but pretty widespread agreement in the industry at large that they have not lived up to their initial expected promise or hype, and have as of yet delivered few if any significant user-facing products based upon them.

It is interesting that many diagnoses of the problems there are about the challenges of vocabulary control and developing shared vocabularies, the challenges of producing/extracting sufficient data that is fit to these vocabularies, as well as business model issues — sorts of barriers we are well familiar with in the library industry. Linked data is not a magic bullet that solves these problems, they will remain for us as barriers and challenges to our metadata dreams.

Semantic web and linked data are still being talked about, and worked on in some commercial quarters, to be sure. I have no doubt that there are people and units at Google who are interested in linked data, who are doing research and experimentation in that area, who are hoping to find wider uses for linked data at Google, although I do not think it is true that linked data is currently fundamentally core to Google’s services or products or how they work. What they have not done is taken over the web, or become a widely accepted fact in the industry.  It is simply not true that “every major ecommerce site” has an architecture built on linked data.  It is certainly true that some commercial sector actors continue to experiment with and explore uses of linked data.

But in fact, I would say that libraries and the allied cultural heritage sector, along with limited involvement from governmental agencies (especially in the UK, although not to the extent some would like, with 2010 cancellation of a program) and scholarly publishing (mainly I think of Nature Publishing), are primary drivers of linked data research and implementation currently. We are some of the leaders in linked data research, we are not following “where everyone else is going” in the private sector.

There’s nothing necessarily wrong with libraries being the drivers in researching and implementing interesting and useful technology in the “information retrieval” domain — our industry was a leader in information retrieval technology 40-80 years ago, it would be nice to be so again, sure!

But we what we don’t have is “everyone else is doing it” as a motivation or justification for our campaign — not that it must be a good idea because the major players on the web are investing heavily in it (they aren’t), and not that we will be able to inter-operate with everyone else the way we want if we just transition all of our infrastructure to linked data because that’s where everyone else will be too (they won’t necessarily, and everyone using linked data isn’t alone sufficient for inter-operability anyway, there needs to be coordination on vocabularies as well, just to start).

My Experiences in Data and Service Interoperability Challenges

For the past 7+ years, my primary work has involved integrating services and data from disparate systems, vendors, and sources, in the library environment. I have run into many challenges and barriers to my aspired integrations. They often have to do with difficulties in data interoperability/integration; or in the utility of our data, difficulties in getting what I actually need out of data.  These are the sorts of issues linked data is meant to be at home in.

However, seldom in my experience do I run into a problem where simply transitioning infrastructure to linked data would provide a solution or fundamental advancement. The barriers often have at their roots business models (entities that have data you want to interoperate with, but don’t want their data to be shared because it keeping it close is of business value to them; or that simply have no business interest in investing in the technology needed to share data better);  or lack of common shared domain models (vocabulary control); or lack of person power to create/record the ‘facts’ needed in machine-readable format.

Linked data would be neither necessary nor sufficient to solving most of the actual barriers I run into.  Simply transitioning to a linked data-based infrastructure without dealing with the business or domain model issues would not help at all; and linked data is not needed to solve the business or domain model issues, and of unclear aid in addressing them: A major linked data campaign may not be the most efficient, cost effective, or quickest way to solve those problems.

Here are some examples.

What Serial Holdings Do We Have?

In our link resolver, powered by Umlaut, a request might come in for a particular journal article, say the made up article “Doing Things in Libraries”, by Melville Dewey, on page 22 of Volume 50 Issue 2 (1912) of the Journal of Doing Things.

I would really like my software to tell the user if we have this specific article in a bound print volume of the Journal of Doing Things, exactly which of our location(s) that bound volume is located at, and if it’s currently checked out (from the limited collections, such as off-site storage, we allow bound journal checkout).

My software can’t answer this question, because our records are insufficient. Why? Not all of our bound volumes are recorded at all, because when we transitioned to a new ILS over a decade ago, bound volume item records somehow didn’t make it. Even for bound volumes we have — or for summary of holdings information on bib/copy records — the holdings information (what volumes/issues are contained) are entered in one big string by human catalogers. This results in output that is understandable to a human reading it (at least one who can figure out what “v.251(1984:Jan./June)-v.255:no.8(1986)”  means). But while the information is theoretically input according to cataloging standards — changes in practice over the years, varying practice between libraries, human variation and error, lack of validation from the ILS to enforce the standards, and lack of clear guidance from standards in some areas, mean that the information is not recorded in a way that software can clearly and unambiguously understand it.

This is a problem of varying degrees at other libraries too, including for digitized copies, for presumably similar reasons.  In addition to at my own library, I’d like my software to be able to figure out if, say, HathiTrust has a digitized copy of this exact article (digitized copy of that volume and issue of that journal).  Or if nearby libraries in WorldCat have a physical bound journal copy, if we don’t here.  I can’t really reliably do that either.

We theoretically have a shared data format and domain model for serial holdings, Marc Format for Holdings Data (MFHD). A problem is that not all ILS’s actually implement MFHD, but more than that, that MFHD was designed in a world of printing catalog cards, and doesn’t actually specify the data in the right way to be machine actionable, to answer the questions we want answered. MFHD also allows for a lot of variability in how holdings are recorded, with some patterns simply not recording sufficient information.

In 2007 (!) I advocated more attention to ONIX for Serials Coverage as a domain model, because it does specify the recording of holdings data in a way that could actually serve the purposes I need. That certainly hasn’t happened, I’m not sure there’s been much adoption of the standard at all.  It probably wouldn’t be that hard to convert ONIX for Serials Coverage to a linked data vocabulary; that would be fine, if not neccesarily advancing it’s power any. It’s powerful, if it were used, because it captures the data actually needed for the services we need in a way software can use, whether or not it’s represented as linked data.  Actually implementing ONIX for Serials Coverage — with or without linked data — in more systems would have been a huge aid to me. Hasn’t happened.

Likewise, we could probably, without too much trouble, create a “linked data” translated version of MFHD. This would solve nothing, neither the problems with MFHD’s expressiveness nor adoption. Neither would having an ILS whose vendor advertises it as “linked data compatible” or whatever, make MFHD work any better. The problems that keep me from being able to do what I want have to do with domain modeling, with adoption of common models throughout the ecosystem, and with human labor to record data.  They are not problems the right abstract data model can fix, they are not fundamentally problems of the mechanics of sharing data, but of the common recording of data in common formats with sufficient utility.

Lack of OCLC number or other identifiers in records

Even in a pre-linked data world, we have a bunch of already existing useful identifiers, which serve to, well, link our data.  OCLC numbers as identifiers in the library world are prominent for their widespread adoption and (consequent) usefulness.

If several different library catalogs all use OCLC numbers on all their records, we can do a bunch of useful things, because we can easily know when a record in one catalog represents the same thing as a record in another. We can do collection overlap analysis. We can link from one catalog to another — oh, it’s checked out here, but this other library we have a reciprocal borrowing relationship with has a copy. We can easily create union catalogs that merge holdings from multiple libraries onto de-duplicated bibs. We can even “merge records” from different libraries — maybe a bib from one library has 505 contents but the bib from doesn’t, the one that doesn’t can borrow the data and know which bib it applies to. (Unless it’s licensed data they don’t have the right to share, a very real problem, which is not a technical one, and linked data can’t solve either).

We can do all of these things today, even without linked data. Except I can’t, because in my local catalog a great many (I think a majority) of records lack OCLC numbers.

Why?  Many of them are legacy records from decades ago, before OCLC was the last library cooperative standing, from before we cared.  All the records missing OCLC numbers aren’t legacy though. Many of them are contemporary records supplied by vendors (book jobbers for print, or e-book vendors), which come to us without OCLC numbers. (Why do we get from there instead of OCLC? Convenience? Price?  No easy way to figure out how to bulk download all records for a given purchased ebook package from OCLC? Why don’t the vendors cooperate with OCLC enough to have OCLC numbers on their records — I’m not sure. Linked data solves none of these issues.)

Even better, I’d love to be able to figure out if the book represented by a record in my catalog exists in Google Books, with limited excerpts and searchability or even downloadable fulltext. Google Books actually has a pretty good API, and if Google Books data had OCLC numbers in it, I could easily do this. But even though Google Books got a lot of it’s data from OCLC Worldcat, Google Books data only rarely includes OCLC numbers, and does so in entirely undocumented ways.

Lack of OCLC numbers in data is a problem very much about linking data, but it’s not a problem linked data can solve. We have the technology now, the barriers are about human labor power, business models, priorities, costs.  Whether the OCLC numbers that are there are in a MARC record in field 035, or expressed as a URI (say, `http://www.worldcat.org/oclc/828033`) and included in linked data — are entirely irrelevant to me, my barriers are about lack of OCLC numbers in the data, I could deal with them in just about any format at all, and linked data formats won’t help appreciably, but I can’t deal with the data being absent.

And in fact, if you convert your catalog to “linked data” but still lack OCLC numbers — you’re still going to have to solve that problem to do anything useful as far as “linking data”.  The problem isn’t about whether the data is “linked data”, it’s about whether the data has useful identifiers that can be used to actually link to other data sets.

Data Staleness/Correctness

As you might guess from the fact that so many records in our local catalog don’t have OCLC numbers — most of the records in our local catalog also haven’t been updated since they were added years, decades ago. They might have typos that have since been corrected in WorldCat. They might represent ages ago cataloging practices (now inconsistent with present data) that have since been updated in WorldCat.  The WorldCat records might have been expanded to have more useful data (better subjects, updated controlled author names, useful 5xx notes).

Our catalog doesn’t get these changes, because we don’t generally update our records from WorldCat, even for the records that do have OCLC numbers.  (Also, naturally, not all of our holdings are actually listed with WorldCat, although this isn’t exactly the same set as those that lack OCLCnums in our local catalog). We could be doing that. Some libraries do, some libraries don’t. Why don’t the libraries that don’t?  Some combination of cost (to vendors), local human labor, legacy workflows difficult to change, priorities, lack of support from our ILS software for automating this in an easy way, not wanting to overwrite legacy locally created data specific to the local community, maybe some other things.

Getting our local data to update when someone else has improved it, is again the kind of problem linked data is targeted at, but linked data won’t necessarily solve it, the biggest barriers are not about data format.  After all, some libraries sync their records to updated WorldCat copy now, it’s possible with the technology we have now, for some. It’s not fundamentally a problem of mechanics with our data formats.

I wish our ILS software was better architected to support “sync with WorldCat” workflow with as little human intervention as possible. It doesn’t take linked data to do this — some are doing it already, but our vendor hasn’t chosen to prioritize it.  And just because software “supports linked data” doesn’t guarantee it will do this. I’d want our vendors focusing on this actual problem (whether solved with or without linked data), not the abstract theoretical goal of “linked data”.

Difficulty of getting format/form info from our data, representing what users care about

One of the things my patrons care most about, when running across a record in the catalog for say, “Pride and Prejudice”, is format/genre issues.

Is a given record the book, or a film? A film on VHS, or DVD (you better believe that matters a lot to a patron!)? Or streaming online video? Or an ebook? Or some weird copy we have on microfiche? Or a script for a theatrical version?  Or the recording of a theatrical performance? On CD, or LP, or an old cassette?

And I similarly want to answer this question when interrogating data at remote sources, say, WorldCat, or a neighboring libraries catalog.

It is actually astonishingly difficult to get this information out of MARC — the form/format/genre of a given record, in terms that match our users tasks or desires.  Why? Well, because the actual world we are modeling is complicated and constantly changing over the decades, it’s unclear how to formally specify this stuff, especially when it’s changing all the time (Oh, it’s a blu-ray, which is kind of a DVD, but actually different).  (I can easily tell you the record you’re looking at represents something that is 4.75″ wide though, in case you cared about that…)

It’s a difficult domain modeling problem. RDA actually tried to address this with better more formal theoretically/intellectually consistent modeling of what form/genre/format is all about. But even in the minority of records we have with RDA tags for this, it doesn’t quite work, I still can’t easily get my software to figure out if the item represented by a record is a CD or a DVD or a blu-ray DVD or what. 

Well, it’s a hard problem of domain modeling, harder than it might seem at first glance. A problem that negatively impacts a wide swath of library users across library types. Representing data as linked data won’t solve it, it’s an issue of vocabulary control. Is anyone trying to solve it?

Workset Grouping and Relationships

Related to form/format/genre issues but a distinct issue, is all the different versions of a given work in my catalog.

There might be dozens of Pride and Prejudices. For the ones that are books, do they actually all have the same text in them?  I don’t think Austen ever revised it in a new edition, so probably they all do even if published a hundred years apart — but that’s very not true of textbooks, or even general contemporary non-fiction which often exists in several editions with different text. Still, different editions of Pride and Prejudice might have different forwards or prefaces or notes, which might matter in some contexts.  Or maybe different pagination, which matters for citation lookup.

And then there’s the movies, the audiobooks, the musical (?).  Is the audiobook the exact text of the standard Pride and Prejudice just read aloud? Or an abridged version? Or an entirely new script with the same story?  Are two videos the exact same movie one on VHS and one on DVD, or two entirely different dramatizations with different scripts and actors? Or a director’s cut?

These are the kinds of things our patrons care about, to find and identify an item that will meet their needs. But in search results, all I can do is give them a list of dozens of Pride and Prejudices, and let them try to figure it out — or maybe at least segment by video vs print vs audio.  Maybe we’re not talking search results, maybe my software knows someone wants a particular edition (say, based on an input citation) and wants to tell the user if we have it, but good luck to my software in trying to figure out if we have that exact edition (or if someone else does, in worldcat or a neighboring library, or Amazon or Google Books).

This is a really hard problem too. And again it’s a problem of domain modeling, and equally of human labor in recording information (we don’t really know if two editions have the exact same text and pagination, someone has to figure it out and record it).  Switching to the abstract data model of linked data doesn’t really address the barriers.

The library world made a really valiant effort at creating a domain model to capture these aspects of edition relationships that our users care about: FRBR.  It’s seen limited adoption or influence in the 15+ years since it was released, which means it’s also seen limited (if any) additional development or fine-tuning, which anything trying to solve this difficult domain modeling problem will probably need (see RDA’s efforts at form/format/genre!).  Linked data won’t solve this problem without good domain modeling, but ironically it’s some of the strongest advocates for “linked data” that I’ve seen arguing most strongly against doing anything more with adoption or development of FRBR ; as far as I am aware, the needed efforts to develop common domain modeling is not being done in the library linked data efforts. Instead, the belief seems to be if you just have linked data and let everyone describe things however they want, somehow it will all come together into something useful that answers the questions our patrons need, there’s no need for any common domain model vocabulary.  I don’t believe existing industry experience with linked data, or software engineers experience with data modeling in general, supports this fantasy.

Multiple sources of holdings/licensing information

For the packages of electronic content we license/purchase (ebooks, serials), we have so many “systems of record”.  The catalog’s got bib records for items from these packages, the ERM has licensing information, the link resolver has coverage and linking information, oh yeah and then they all need to be in EZProxy too, maybe a few more.

There’s no good way for software to figure out when a record from one system represents the same platform/package/license as in another system. Which means lots of manual work synchronizing things (EZProxy configuration, SFX kb). And things my software can do only with difficulty or simply can’t do at all — like, when presenting URLs to users, figure out if a URL in a catalog is really pointing to the same destination as a URL offered by SFX, even though they’re different URLs (epnet.com vs ebscohost.com?).

So one solution would be “why don’t you buy all these systems from the same vendor, and then they’ll just work together”, which I don’t really like as a suggested solution, and at any rate as a suggestion is kind of antithetical to the aims of the “linked data” movement, amirite?

So the solution would obviously be common identifiers used in all these systems, for platforms, packages and licenses, so software can know that a bib record in the catalog that’s identified as coming from package X for ISSN Y is representing the same access route as an entry in the SFX KB also identified as package X, and hey maybe we can automatically fetch the vendor suggested EZProxy config listed under identifier X too to make sure it’s activated, etc.

Why isn’t this happening already? Lack of cooperation between vendors, lack of labor power to create and maintain common identifiers, lack of resources or competence from our vendors (who can’t always even give us a reliable list in any format at all of what titles with what coverage dates are included in our license) or from our community at large (how well has DLF-ERMI worked out as far as actually being useful?).

In fact, if I imagined an ideal technical infrastructure for addressing this, linked data actually would be a really good fit here! But it could be solved without linked data too, and coming up with a really good linked data implementation won’t solve it, the problems are not mainly technical.  We primarily need common identifiers in use between systems, and the barriers to that happening are not that the systems are not using “linked data”.

Google Books won’t link out to me

Google Scholar links back to my systems using OpenURL links. This is great for getting a user who choses to use Google Scholar for discovery back to me to provide access through a licensed or owned copy of what they want. (There are problems with Google Scholar knowing what institution they belong to so they can link back to the right place, but let’s leave that aside for now, it’s still way better than not being there).

I wish Google Books did the same thing. For that matter, I wish Amazon did the same thing. And lots of other people.

They don’t because they have no interest in doing so. Linked data won’t help, even though this is definitely an issue of, well, linking data.

OpenURL, a standard frozen in time

Oh yeah, so let’s talk about OpenURL. It’s been phenomenally succesful in terms of adoption in the library industry. And it works. It’s better that it exists than if it didn’t. It does help link disparate systems from different vendors.

The main problem is that it’s basically abandoned, I don’t know if there’s technically a maintanance group, but if there is, they aren’t doing much to improve OpenURL for scholarly citation linking, the use case it’s been successful in.

For instance, I wish there was a way to identify a citation as referring to a video or audio piece in OpenURL, but there isn’t.

Now, theoretically the “open for extension” aspect of linked data seems relevant here. If things were linked data and you needed a new data element or value, you could just add one. But really, there’s nothing stopping people from doing that with OpenURL now. Even if technically not allowed, you can just decide to say `&genre=video` in your OpenURL, and it probably won’t disturb anything (or you can figure out a way to do that not using the existing `genre` key that really won’t disturb anything).

The problem is that nothing will recognize it and do anything useful with it, and nobody is generating OpenURLs like that too.  It’s not really an ‘open for extension’ problem, it’s a problem of getting the ecosystem to do it, of vocabulary consensus and implementation. That’s not a problem that linked data solves.

Linking from the open web to library copies

One of the biggest challenges always in the background of my work, is how we get people from the “open web” to our library owned and licensed resources and library-provided services. (Umlaut is engaged in this “space”).

This is something I’ve written about before  (more times than that), so I won’t say too much more about it here.

How could linked data play a role in solving this problem? To be sure, if every web page everywhere included schema.org-type information fully specifying the nature of the scholarly works it was displaying, citing, or talking about — that would make it a lot easier to find a way to take this information and transfer the user to our systems to look up availability for the item cited.  If every web page exposed well-specified machine-accessible data in a way that wasn’t linked-data-based, that would be fine too. But something like schema.org does look like the best bet here — but it’s not a bet I’d wager anything of significance on.

It would not be necessary to rebuild our infrastructure to be “based on linked data” in order to take advantage of structured information on external web pages, whether or not that structured information is “linked data”.  (There are a whole bunch of other non-trivial challenges and barriers, but replacing our ILS/OPAC isn’t really a necessary one, neither is replacing our internal data format.). And we ourselves have limited influence over what “every web page everywhere” does.

Okay, so why are people excited about Linked Data?

If it’s not clear it will solve our problems, why is there so much effort being put into it?  I’m not sure, but here’s some things I’ve observed or heard.

Most people, and especially library decision-makers, agree at this point that libraries have to change and adapt, in some major ways. But they don’t really know what this means, how to do it, what directions to go on.  Once there’s a critical mass of buzz about “linked data”, it becomes the easy answer — do what everyone else is doing, including prestigious institutions, and if it ends up wrong, at least nobody can blame you for doing what everyone else agreed should be done.   “No one ever got fired for buying IBM

So linked data has got good marketing and a critical mass, in an environment where decision-makers want to do something but don’t know what to do. And I think that’s huge, but certainly that’s not everything, there are true believers who created that message in the first place, and unlike IBM they aren’t necessarily trying to get your dollars, they really do believe. (Although there are linked data consultants in the library world who make money by convincing you to go all-in on linked data…)

I think we all do know (and I agree) that we need our data and services to inter-operate better — within the library world, and crossing boundaries to the larger IT and internet industry and world. And linked data seems to hold the promise of making that happen, after all those are the goals of linked data.  But as I’ve described above, I’m worried it’s a promise long on fantasy and short on specifics.  In my experience, the true barriers to this are about good domain modeling,  about the human labor to encode data, and about getting people we want to cooperate with us to use the same domain models. 

I think those experienced with library metadata realize that good domain modelling (eg vocabulary control), and getting different actors to use the same standard formats is a challenge. I think they believe that linked data will somehow solve this challenge by being “open to extension” — I think this is a false promise, as I’ve tried to argue above. Software and sources need to agree on vocabulary in linked data too, to be able to use each others data. Or use the analog of a ‘crosswalk’, which we can already do, and which does not becomes appreciably easier with linked data — it becomes somewhat easier mechanically to apply a “cross-walk”, but the hard part in my experience is not mechanical application, but the intellectual labor to develop the “cross-walk” rules in the first place and maintain it as vocabularies change.

I think library decision-makers know that we “need our stuff to be in Google”, and have been told “linked data” is the way to do that, without having a clear picture of what “in Google” means. As I’ve said, I think Google’s investment in or commitment to linked data has been exagerated, but yes schema.org markup can be used by Google for rich snippets or Knowledge Graph fact boxes. And yes, I actually agree, our library web pages should use schema.org markup to expose their information in machine-readable markup.  This will right now have more powerful results for library information web pages (rich snippets) than it will for catalog pages. But the good thing is it’s not that hard to do for catalog bib pages either, and does not requires rebuilding our entire infrastructure, our MARC data as it is can fairly easily be “cross-walked” to schema.org, as Dan Scott has usefully shown with VuFind, Evergreen, and Koha.  Yes, all our “discovery” web pages should do this. Dan Scott reports that it hasn’t had a huge effect, but says it would if only  everybody did it:

We don’t see it happening with libraries running Evergreen, Koha, and VuFind yet, realistically because the open source library systems don’t have enough penetration to make it worth a search engine’s effort to add that to their set of possible sources. However, if we as an industry make a concerted effort to implement this as a standard part of crawlable catalogue or discovery record detail pages, then it wouldn’t surprise me in the least to see such suggestions start to appear.

Maybe. I would not invest in an enormous resource-intensive campaign to rework our entire infrastructure based on what we hope Google (or similar actors) will do if we pull it off right — I wouldn’t count on it.  But fortunately it doesn’t require that to include schema.org markup on our pages. It can fairly easily be done now with our data in MARC, and should indeed be done now; whatever barriers are keeping us from doing it more with our existing infrastructure, solving them are actually a way easier problem than rebuilding our entire infrastructure.

I think library metadataticians also realize that limited human labor resources to record data are a problem. I think the idea is that with linked data, we can get other people to create our metadata for us, and use it.  It’s a nice vision. The barriers are that in fact not “everybody” is using linked data, let alone willing to share it; the existing business model issues that make them reluctant to share their data don’t go away with linked data; they may have no business interest in creating the data we want anyway (or may be hoping “someone else” does it too); and that common or compatible vocabularies are still needed to integrate data in this way. The hard parts are human labor and promulgating shared vocabulary, not the mechanics of combining data.

I think experienced librarians also realize that business model issues are a barrier to integration and sharing of data presently. Perhaps they think that the Linked Open Data campaign will be enough to pressure our vendors, suppliers, partners, and cooperatives to share their data, because they have to be “Linked Open Data” and we’re going to put the pressure on. Maybe they’re right! I hope so.

One linked data advocate told me, okay, maybe linked data is neither necessary nor sufficient to solve our real world problems. But we do have to come up with better and more inter-operable domain models for our data. And as long as we’re doing that, and we have to recreate all this stuff, we might as well do it based on linked data — it’s a good abstract data model, and it’s the one “everyone else is using” (which I don’t agree is happening, but it might be the one others outside the industry end up using — if they end up caring about data interoperability at all — and there are no better candidates, I agree, so okay).

Maybe. But I worry that rather than “might as well use linked data as long as we’re doing”, linked data becomes a distraction and a resource theft (opportunity cost?) from what we really need to do.  We need to figure out what our patrons are up to and how we can serve them; and when it comes to data, we need to figure out what kinds of data we need to do that, and to come up with the domain models that capture what we need, and to get enough people (inside or outside the library world) to use compatible data models, and to get all that data recorded (by whom paid for by whom).

Sure that all can be done with linked data, and maybe there are even benefits to doing so. But in the focus on linked data, I worry we end up focusing on how most elegantly to fit our data into “linked data” (which can certainly be an interesting intellectual challenge, a fun game), rather than on how to model it to be useful for the uses we need (and figuring out what those are).  I think it’s unjustified to think the rest will take care of itself if it’s just good linked data. The rest is actually the hard part. And I think it’s dangerous to undertake this endeavor as “throw everything else out and start over”, instead of looking for incremental improvements.

The linked data advocate I was talking to also suggested (or maybe it was my own suggestion in conversation, as I tried to look on the bright side): Okay, we know we need to “fix” all sorts of things about our data and inter-operability. We could be doing a lot of that stuff now, without linked data, but we’re not, our vendors aren’t, our consortiums and collaboratives aren’t.  Your catalog does not have enough records OCLC numbers in it, or sync it’s data to OCLC, even though it theoretically could, and without linked data.  It hasn’t been a priority. But the very successful marketing campaign of “linked data” will finally get people to pay attention to this stuff and do what they should have been doing.

Maybe. I hope so. It could definitely happen. But it won’t happen because linked data is a magic bullet, and it won’t happen without lots of hard work that isn’t about the fun intellectual game of creating domain models in linked data.

What should you do?

Okay, so maybe “linked data” is an unstoppable juggernaut in the library world, or at your library. (It certainly is not in the wider IT/web world, despite what some would have you believe).  I certainly don’t think this tl;dr essay will change that.

And maybe that will work out for the best after all. I am not fundamentally opposed to semantic web/linked data/RDF. It’s an interesting technology although I’m not as in love with it as some, I recognize that it surely should play some part in our research and investigation into metadata evolution — even if we’re not sure how succesful it will be in the long-term.

Maybe it’ll all work out. But for you reading this who’s somehow made it this far, here’s what I think you can do to maximize those chances:

Be skeptical. Sure, of me too. If this essay gets any attention, I’m sure there will be plenty of arguments provided for how I’m missing the point or confused.  Don’t simply accept claims from promoters or haters, even if everyone else seems to be accepting that — claims that “everyone is doing it”, or that linked data will solve all our problems.  Work to understand what’s really going on so you can evaluate benefits and potentials yourself, and understand what it would take to get there. To that end…

Educate yourself about the technology of metadata. About linked data, sure. And about entity-relational modeling and other forms of data modeling, about relational databases, about XML, about what “everyone else” is really doing. Learn a little programming too, not to become a programmer, but to understand better how software and computation work, because all of our work in libraries is so intimately connected to that. Educating yourself on these things is the only way to evaluate claims made by various boosters or haters.

Treat the library as an IT organization. I think libraries already are IT organizations (at least academic libraries) — every single service we provide to our users now has a fundamental core IT component, and most of our services are actually mediated by software between us and our users. But libraries aren’t run recognizing them as IT organizations. This would involve staffing and other resource allocation. It would involve having sufficient leadership and decision-makers that are competent to  make IT decisions, or know how to get advice from those who are. It’s about how the library thinks of itself, at all levels, and how decisions are made, and who is consulted when making them. That’s what will give our organizations the competence to make decisions like this, not just follow what everyone else seems to be doing.

Stay user centered. “Linked data” can’t be your goal. You are using linked data to accomplish something to add value to your patrons. We must understand what our patrons are doing, and how to intervene to improve their lives. We must figure out what services and systems we need to do that. Some work to that end, even incomplete and undeveloped if still serious and engaged,  comes before figuring out what data we need to create those services.  To the extent it’s about data, make sure your data modeling work and choices are about creating the data we need to serve our users, not just fitting into the linked data model.  Be careful of “dumbing down” your data to fit more easily into a linked data model, but maybe losing what we actually need in the data to provide the services we need to provide.

Yes, include schema.org markup on your web pages and catalog/discovery pages. To expose it to Google, or to anyone.  We don’t need to rework our entire infrastructure to do that, it can be done now, as Dan Scott has awesomely shown. As Google or anyone else significant recognizes more or different vocabularies, make use of them too by including them in your web pages, for sure. And, sure, make all your data (in any format, linked data or not) available on the open web, under an open license. If your vendor agreements prevent you from doing that, complain. Ask everyone else with useful data to do so too. Absolutely.

Avoid “Does it support linked data” as an evaluative question. I think that’s just not the right question to be asking when evaluating adoption or purchase of software. To the extent the question has meaning at all (and it’s not always clear what it means), it is dangerous for the library organization if it takes primacy over the specifics of how it will allow us to provide better services or provide services better.

Of course, put identifiers in your dataI don’t care if it’s as a URI or not, but yeah, make sure every record has an OCLC number. Yeah, every bib should record the LCCN or other identifier of it’s related creators authority records, not just a heading.  This is “linked data” advice that I support without reservation, it is what our data needs with or without linked data.  Put identifiers everywhere. I don’t care if they are in the form of URLs.  Get your vendors to do this too. That your vendors want to give you bibs without OCLC numbers in them isn’t acceptable. Make them work with OCLC, make them see it’s in their business interests to do so, because the customers demand it.  If you can get the records from OCLC even if it costs more — it might be worth it. I don’t mean to be an OCLC booster exactly, but shared authority control is what we need (for linked data to live up to it’s promise or for us to accomplish what we need without linked data), and OCLC is currently where it lives. Make OCLC share it’s data too, which it’s been doing already (in contrast to ~5 years ago) —  keep them going — they should make it as easy and cheap as possible for even “competitors” to put OCLC numbers, VIAF numbers, any identifiers in their data, regardless of whether OCLC thinks it threatens their own business model, because it’s what we need as a community and OCLC is a non-profit cooperative that represents us.

Who should you trust? Trust nobody, heh. But if you want my personal advice, pay attention to Diane Hillmann. Hillmann is one of the people working in and advocating for linked data that I respect the most, who I think has a clear vision of what it will or won’t or only might do, and how to tie work to actual service goals not just theoretical models.  Read what Hillmann writes, invite her to speak at your conferences, and if you need a consultant on your own linked data plans I think you could do a lot worse. If Hillmann had increased influence over our communal linked data efforts, I’d be a lot less worried about them.

Require linked data plans to produce iterative incremental value. I think the biggest threat of “linked data” is that it’s implemented as a campaign that won’t bear fruit until some fairly far distant point, and even then only if everything works out, and in ways many decision-makers don’t fully understand but just have a kind of faith in.  That’s a very risky way to undertake major resource-intensive changes.  Don’t accept an enormous investment whose value will only be shown in the distant future. As we’re “doing linked data”, figure out ways to get improvements that effect our users positively incrementally, at each stage, iteratively.  Plan your steps so each one bears fruit one at a time, not just at the end. (Which incidentally, is good advice for any technology project, or maybe any project at all). Because we need to start improving things for our users now to stay alive. And because that’s the only way to evaluate how well it’s going, and even more importantly to adjust course based on what we learn, as we go. And it’s how we get out of assuming linked data will be a magic bullet if only we can do enough of it, and develop the capacity to understand exactly how it can help us, can’t help us, and will help us only if we do certain other things too.  When people who have been working on linked data for literally years advocate for it, ask them to show you their successes, and ask for success in terms of actually improving our library services. If they don’t have much to show, or if they have exciting successes to demonstrate, that’s information to guide you in decision-making, resource allocation, and further question-asking.

Posted in General | 21 Comments