Virtual Shelf Browse

We know that some patrons like walking the physical stacks, to find books on a topic of interest to them through that kind of browsing of adjacently shelved items.

I like wandering stacks full of books too, and hope we can all continue to do so.

But in an effort to see if we can provide an online experience that fulfills some of the utility of this kind of browsing, we’ve introduced a Virtual Shelf Browse that lets you page through books online, in the order of their call numbers.

An online shelf browse can do a number of things you can’t do physically walking around the stacks:

  • You can do it from home, or anywhere you have a computer (or mobile device!)
  • It brings together books from various separate physical locations in one virtual stack. Including multiple libraries, locations within libraries, and our off-site storage.
  • It includes even checked out books, and in some cases even ebooks (if we have a call number on record for them)
  • Place one item at multiple locations in a Virtual Shelf, if we have more than one call number on record for it. There’s always more than one way you could classify or characterize a work; a physical item can only be in one place at a time, but not so in a virtual display.

The UI is based on the open source stackview code released by the Harvard Library Innovation Lab. Thanks to Harvard for sharing their code, and to @anniejocaine for helping me understand the code, and accepting my pull requests with some bug fixes and tweaks.

This is to some extent an experiment, but we hope it opens up new avenues for browsing and serendipitous discovery for our patrons.

You can drop into one example place in the virtual shelf browse here, or drop into our catalog to do your own searches — the Virtual Shelf Browse is accessed by navigating to an individual item detail page, and then clicking the Virtual Shelf Browse button in the right sidebar.  It seemed like the best way to enter the Virtual Shelf was from an item of interest to you, to see what other items are shelved nearby.

Screenshot 2015-07-23 15.09.12

Our Shelf Browse is based on ordering by Library of Congress Call Numbers. Not all of our items have LC call numbers, so not every item appears in the virtual shelf, or has a “Virtual Shelf Browse” button to provide an entry point to it. Some of our local collections are shelved locally with LC call numbers, and these are entirely present. For other collections —  which might be shelved under other systems or in closed stacks and not assigned local shelving call numbers — we can still place them in the virtual shelf if we can find a cataloger-suggested call number in the MARC bib 050 or similar fields. So for those collections, some items might appear in the Virtual Shelf, others not.

On Call Numbers, and Sorting

Library call number systems — from LC, to Dewey, to Sudocs, or even UDC — are a rather ingenious 19th century technology for organizing books in a constantly growing collection such that similar items are shelved nearby. Rather ingenious for the 19th century anyway.

It was fun to try to bringing this technology — and the many hours of cataloger work that’s gone into constructing call numbers — into the 21st century to continue providing value in an online display.

It was also challenging in some ways. It turns out the nature of ordering of Library of Congress call numbers particularly is difficult to implement in computer software, there are a bunch of odd cases where to a human it might be clear what the proper ordering is  (at least to a properly trained human? and different libraries might even order differently!), but difficult to encode all the cases into software.

The newly released Lcsort ruby gem does a pretty marvelous job of allowing sorting of LC call numbers that properly sorts a lot of them — I won’t say it gets every valid call number, let alone local practice variation, right, but it gets a lot of stuff right including such crowd-pleasing oddities as:

  • `KF 4558 15th .G6` sorts after `KF 4558 2nd .I6`
  • `Q11 .P6 vol. 12 no. 1` sorts after `Q11 .P6 vol. 4 no. 4`
  • Can handle suffixes after cutters as in popular local practice (and NLM call numbers), eg `R 179 .C79ab`
  • Variations in spacing or punctuation that should not matter for sorting, `R 169.B59.C39` vs `R169 B59C39 1990` `R169 .B59 .C39 1990` etc.

Lcsort is based on the cummulative knowledge of years of library programmer attempts to sort LC calls, including an original implementation based on much trial and error by Bill Dueber of the University of Michigan, a port to ruby by Nikitas Tampakis of Princeton University Library, advice and test cases based on much trial and error from Naomi Dushay of Stanford, and a bunch more code wrangling by me.

I do encourage you to check out Lcsort for any LC call number ordering needs, if you can do it in ruby — or even port it to another language if you can’t. I think it works as well or better as anything our community of library technologies has done yet in the open.

Check out my code — rails_stackview

This project was possible only because of the work of so many that had gone before, and been willing to share their work, from Harvard’s stackview to all the work that went into figuring out how to sort LC call numbers.

So it only makes sense to try to share what I’ve done too, to integrate a stackview call number shelf browse in a Blacklight Rails app.  I have shared some components in a Rails engine at rails_stackview

In this case, I did not do what I’d have done in the past, and try to make a rock-solid, general-purpose, highly flexible and configurable tool that integrated as brainlessly as possible out of the box with a Blacklight app. I’ve had mixed success trying to do that before, and came to think it might have been over-engineering and YAGNI to try. Additionally, there are just too many ways to try to do this integration — and too many versions of Blacklight changes to keep track of — I just wasn’t really sure what was best and didn’t have the capacity for it.

So this is just the components I had to write for the way I chose to do it in the end, and for my use cases. I did try to make those components well-designed for reasonable flexibility, or at least future extension to more flexibility.

But it’s still just pieces that you’d have to assemble yourself into a solution, and integrate into your Rails app (no real Blacklight expectations, they’re just tools for a Rails app) with quite a bit of your own code.  The hardest part might be indexing your call numbers for retrieval suitable to this UI.

I’m curious to see if this approach to sharing my pieces instead of a fully designed flexible solution might still ends up being useful to anyone, and perhaps encourage some more virtual shelf browse implementations.

On Indexing

Being a Blacklight app, all of our data was already in Solr. It would have been nice to use the existing Solr index as the back-end for the virtual shelf browse, especially if it allowed us to do things like a virtual shelf browse limited by existing Solr facets. But I did not end up doing so.

To support this kind of call-number-ordered virtual shelf browse, you need your data in a store of some kind that supports some basic retrieval operations: Give me N items in order by some field, starting at value X, either ascending or descending.

This seems simple enough; but the fact that we want a given single item in our existing index to be able to have multiple call numbers makes it a bit tricky. In fact, a Solr index isn’t really easily capable of doing what’s needed. There are various ways to work around it and get what you need from Solr: Naomi Dushay at Stanford has engaged in some truly heroic hacks to do it, involving creating a duplicate mirror indexing field where all the call numbers are reversed to sort backwards. And Naomi’s solution still doesn’t really allow you to limit by existing Solr facets or anything.

That’s not the solution I ended up using. Instead, I just de-normalize to another ‘index’ in a table in our existing application rdbms, with one row per call number instead of one row per item.  After talking to the Princeton folks at a library meet-up in New Haven, and hearing this was there back-end store plan for supporting ‘browse’ functions, I realized — sure, why not, that’ll work.

So how do I get them indexed in rdbms table? We use traject for indexing to Solr here, for Blacklight.  Traject is pretty flexible, and it wasn’t too hard to modify our indexing configuration so that as the indexer goes through each input record, creating a Solr Document for each one — it also, in the same stream, creates 0 to many rows in an RDBMS for each call number encountered.

We don’t do any “incremental” indexing to Solr in the first place, we just do a bulk/mass index every night recreating everything from the current state of the canonical catalog. So the same strategy applies to building the call numbers table, it’s just recreated from scratch nightly.  After racking my brain to figure out how to do this without disturbing performance or data integrity in the rdbms table — I realized, hey, no problem, just index to a temporary table first, then when done swap it into place and delete the former one.

I included a snapshotted, completely unsupported, example of how we do our indexing with traject, in the rails_stackview documentation.  It ends up a bit hacky, and makes me wish traject let me re-use some of it’s code a little bit more concisely to do this kind of a bifurcated indexing operation — but it still worked out pretty well, and leaves me pretty satisfied with traject as our indexing solution over past tools we had used.

I had hoped that adding the call number indexing to our existing traject mass index process would not slow down the indexing at all. I think this hope was based on some poorly-conceived thought process like “Traject is parallel multi-core already, so, you know, magic!”  It didn’t quite work out that way, the additional call number indexing adds about 10% penalty to our indexing time, taking our slow mass indexing from a ~10 hour to an ~11 hour process.  We run our indexing on a fairly slow VM with 3 cores assigned to it. It’s difficult to profile a parallel multi-threaded pipeline process like traject, I can’t completely wrap my head around it, but I think it’s possible on a faster machine, you’d have bottlenecks in different parts of the pipeline, and get less of a penalty.

On call numbers designed for local adjustment, used universally instead

Another notable feature of the 19th century technology of call numbers that I didn’t truly appreciate until this project — call number systems often, and LC certainly,  are designed to require a certain amount of manual hand-fitting to a particular local collection.  The end of the call number has ‘cutter numbers’ that are typically based on the author’s name, but which are meant to be hand-fitted by local catalogers to put the book just the right spot in the context of what’s already been shelved in a particular local collection.

That ends up requiring a lot more hours of cataloger labor then if a book simply had one true call number, but it’s kind of how the system was designed. I wonder if it’s tenable in the modern era to put that much work into call number assignment though, especially as print (unfortunately) gets less attention.

However, this project sort of serves as an experiment of what happens if you don’t do that local easing. To begin with, we’re combining call numbers that were originally assigned in entirely different local collections (different physical library locations), some of which were assigned before these different libraries even shared the same catalog, and were not assigned with regard to each other as context.  On top of that, we take ‘generic’ call numbers without local adjustment from MARC 050 for books that don’t have locally assigned call numbers (including ebooks where available), so these also haven’t been hand-fit into any local collection.

It does result in occasional oddities, such as different authors with similar last names writing on a subject being interfiled together. Which offends my sensibilities since I know the system when used as designed doesn’t do that. But… I think it will probably not be noticed by most people, it works out pretty well after all.

Posted in General | 3 Comments

Long-standing bug in Chrome (WebKit?) on page not being drawn, scroll:auto, retina

In a project I’m recently working on, I ran into a very odd bug in Chrome (may reproduce in other WebKit browsers, not sure).

My project loads some content via AJAX into a portion of the page. In some cases, the content loaded is not properly displayed, it’s not actually painted by the browser. There is space taken up by it on the page, but it’s kind of as if it had `display:none` set, although not quite like that because sometimes _some_ of the content is displayed but not others.

Various user interactions will force the content to paint, including resizing the browser window.

Googling around, there are various people who have been talking about this bug, or possibly similar bugs, for literally years. Including here and maybe this is the same thing or related, hard to say.

think the conditions that trigger the bug in my case may include:

  • A Mac “retina” screen, the bug may not trigger on ordinary resolutions.
  • Adding/changing content via Javascript in a block on the page that has been set to `overflow: auto` (or just overflow-x or overflow-y auto).

I think both of these things are it, and it’s got something to do with Chrome/WebKit getting confused calculating whether a scrollbar is neccesary (and whether space has to be reserved for it) on a high-resolution “retina” screen, when dynamically loading content.

It’s difficult to google around for this, because nobody seems to quite understand the bug. It’s a big dismaying though that it seems likely this bug — or at least related bugs with retina screens, scrollbar calculation, dynamic content, etc — have existed in Chrome/WebKit for possibly many years.  I am not certain if any tickets are filed in Chrome/WebKit bug tracker on this (or if anyone’s figured out exactly what causes it from Chrome’s point of view).  (this ticket is not quite the same thing, but is also about overflow calculations and retina screens, so could be caused by a common underlying bug).

There are a variety of workarounds suggested on Google, for bugs with Chrome not properly painting dynamically loaded content. Some of them didn’t seem to work for me; others cause a white flash even in browsers that wouldn’t otherwise be effected by the bug; others were inconvenient to apply in my context or required a really unpleasant `timeout` in JS code to tell chrome to do something a few dozen/hundred ms after the dynamic content was loaded. (I think Chrome/WebKit may be smart enough to ignore changes that you immediately undo in some cases, so they don’t trigger any rendering redraw; but here we want to trick Chrome into doing a rendering redraw without actually changing the layout, so, yeah).

Here’s the hacky lesser evil workaround which seems to work for me. Immediately after dynamically loading the content, do this to it’s parent div:

$("#parentDiv").css("opacity", 0.99999).css("opacity", 1.0);

It does leave a `style` element setting opacity to 1.0 sitting around on your parent container after you’re done, oh well.

I haven’t actually tried the solution suggested here, to a problem which may or may not be the same one I have — of simply adding `-webkit-transform: translate3d(0,0,0)` to relevant elements.

One of the most distressing things about this bug is if you aren’t testing on a retina screen (and why/how would you unless your workstation happens to have one), you may not ever notice or be able to reproduce the bug, but you may be ruining the interface for users on retina screens (and find their bug report completely unintelligible and unreproducible if they do report it, whether or not they mention they have a retina screen when they file it, which they probably won’t, they may not even know what this is, let alone guess it’s a pertinent detail).

Also that the solutions are so hacky that I am not confident they won’t stop working in some future version of Chrome that still exhibits the bug.

Oh well, so it goes. I really wish Chrome/WebKit would notice and fix though. Probably won’t happen until someone who works on Chrome/WebKit gets a retina screen and happens to run into the bug themselves.

Posted in General | 1 Comment

“Dutch universities start their Elsevier boycott plan”

“We are entering a new era in publications”, said Koen Becking, chairman of the Executive Board of Tilburg University in October. On behalf of the Dutch universities, he and his colleague Gerard Meijer negotiate with scientific publishers about an open access policy. They managed to achieve agreements with some publishers, but not with the biggest one, Elsevier. Today, they start their plan to boycott Elsevier.

Dutch universities start their Elsevier boycott plan

Posted in General | 2 Comments

“First Rule of Usability? Don’t Listen to Users”

A 15-year-old interesting brief column from noted usability expert Jakob Nielsen, which I saw posted today on reddit:  First Rule of Usability? Don’t Listen to Users

Summary: To design the best UX, pay attention to what users do, not what they say. Self-reported claims are unreliable, as are user speculations about future behavior. Users do not know what they want.

I’m reposting here, even though it’s 15 years old, because I think many of us haven’t assimilated this message yet, especially in libraries, and it’s worth reviewing.

An even worse version of trusting users self-reported claims, I think, is trusting user-facing librarians self-reported claims about what they have generally noticed users self-reporting.  It’s like taking the first problem and adding a game of ‘telephone’ to it.

Nielsen’s suggested solution?

To discover which designs work best, watch users as they attempt to perform tasks with the user interface. This method is so simple that many people overlook it, assuming that there must be something more to usability testing. Of course, there are many ways to watch and many tricks to running an optimal user test or field study. But ultimately, the way to get user data boils down to the basic rules of usability:

  • Watch what people actually do.
  • Do not believe what people say they do.
  • Definitely don’t believe what people predict they may do in the future.

Yep. If you’re not doing this, start. If you’re doing it, you probably need to do it more.  Easier said than done in a typical bureaucratic inertial dysfunctional library organization, I realize.

It also means we have a professional obligation to watch what the users do — and determine how to make things better for them. And then watch again to see if it did. That’s what makes us professionals. We can not simply do what the users say, it is an abrogation of our professional responsibility, and does not actually produce good outcomes for our patrons. Again, yes, this means we need library organizations that allow us to exersize our professional responsibilities and give us the resources to do so.

For real, go read the very short article. And consider what it would mean to develop in libraries taking this into account.

Posted in General | 3 Comments

Yahoo YBoss spell suggest API significantly increases pricing

For a year or two, we’ve been using the Yahoo/YBoss/YDN Spelling Service API to provide spell suggestions for queries in our homegrown discovery layer. (Which provides UI to search the catalog via Blacklight/Solr, as well as an article search powered by EBSCOHost api).

It worked… good enough, despite doing a lot of odd and wrong things. But mainly it was cheap. $0.10 per 1000 spell suggest queries, according to this cached price sheet from April 24 2105. 

However, I got an email today that they are ‘simplifying’ their pricing by charging for all “BOSS Search API” services at $1.80 per 1000 queries, starting June 1.

That’s 18x increase. Previously we paid about $170 a year for spell suggestions from Yahoo, peanuts, worth it even if it didn’t work perfectly. That’s 1.7 million querries for $170, pretty good.  (Honestly, I’m not sure if that’s still making queries it shouldn’t be, in response to something other than user input. For instance, we try to suppress spell check queries on paging through an existing result set, but perhaps don’t do it fully).

But 18x $170 is $3060.  That’s a pretty different value proposition.

Anyone know of any decent cheap spell suggest API’s? It looks like maybe Microsoft Bing has a poorly documented one.  Not sure.

Yeah, we could role our own in-house spell suggestion based on a local dictionary or corpus of some kind. aspell, or Solr’s built-in spell suggest service based on our catalog corpus.  But we don’t only use this for searching the catalog, and even for the catalog I previously found these API’s based on web searches provided better results than a local-corpus-based solution.  The local solutions seemed to false positive (provide a suggestion when the original query was ‘right’) and false negative (refrain from providing a suggestion when it was needed) more often than the web-based API’s. As well, of course, as being more work on us to set up and maintain.

Posted in General | 5 Comments

On approaches to Bridging the Gap in access to licensed resources

A previous post I made reviewing the Ithaka report “Streamlining access to Scholarly Resources” got a lot of attention. Thanks!

The primary issue I’m interested in there: Getting our patrons from a paywalled scholarly citation on the open unauthenticated web, to an authenticated library-licensed copy, or other library services. “Bridging the gap”.

Here, we use Umlaut to turn our “link resolver” into a full-service landing page offering library services for both books and articles:  Licensed online copies, local print copies, and other library services.

This means we’ve got the “receiving” end taken care of — here’s a book and an article example of an Umlaut landing page — the problem reduces to getting the user from the open unauthenticated web to an Umlaut page for the citation in question.

Which is still a tricky problem.  In this post, brief discussion of two things: 1) The new “Google Scholar Button” browser extension from Google, which is interesting in this area, but I think ultimately not enough of a solution to keep me from looking for more, and 2) Possibilities of Zotero open source code toward our end.

The Google Scholar Button

In late April Google released a browser plugin for Chrome and Firefox called the “Google Scholar Button”.

This plugin will extract the title of an article from a page (either text you’ve selected on the page first, or it will try to scrape a title from HTML markup), and give you search results for that article title from Google Scholar, in a little popup window.

Interestingly, this is essentially the same thing a couple of third party software packages have done for a while: The LibX “Magic Button”, and Lazy Scholar.  But now we get it in an official Google release, instead of hacky workarounds to Google’s lack of API from open source.

The Google Scholar Button is basically trying to bridge the same gap we are; it provides a condensed version of google scholar search results, with a link to an open access PDF if Google knows about one (I am still curious how many of these open access PDF’s are not-entirely-licensed copies put up by authors or professors without publisher permissions);

And it in some cases provides an OpenURL link to a library link resolver, which is just what we’re looking for.

However, it’s got some limitations that keep me from considering it a satisfactory ‘Bridging the Gap’ solution:

  • In order to get the OpenURL link to your local library link resolver while you are off campus, you have to set your Google Scholar preferences in your browser, which is pretty confusing to do.
  • The title has to match in Google Scholar’s index of course. Which is definitely extensive enough to still be hugely useful, as evidenced by the open source predecessors to Google Scholar Button trying to do the same thing.
  • But most problematically at all, Google Scholar Button results will only show the local library link resolver link for some citations: The ones that have been registered as having institutional fulltext access in your institutional holdings registered with Google.  I want to get users to the Umlaut landing page for any citation they want, even if we don’t have licensed fulltext (and we might even if Google doesn’t think we do, the holdings registrations are not always entirely accurate), I want to show them local physical copies (especially for books), and ILL and other document delivery services.
    • The full Google Scholar gives a hard-to-find but at least it’s there OpenURL link for “no local fulltext” under a ‘more’ link, but the Google Scholar Button version doesn’t offer even this.
    • Books/monographs might not be the primary use case, but I really want a solution that works for books too — and books are something users may be especially interested in a physical copy instead of online fulltext for, and books are also something that our holdings registration with Google pretty much doesn’t include, even ebooks.  And book titles are a lot less likely to return hits in Google Scholar at all.

I really want a solution that works all or almost all of the time to get the patron to our library landing page, not just some of the time, and my experiments with Google Scholar Button revealed more of a ‘sometimes’ experience.

I’m not sure if the LibX or Lazy Scholar solutions can provide an OpenURL link in all cases, regardless of Google institutional holdings registration.  They are both worth further inquiry for sure.  But Lazy Scholar isn’t open source and I find it’s UI not great for our purposes. And I find LibX a bit too heavy weight for solving this problem, and have some other concerns about it.

So let’s consider another avenue for “Bridging the Gap”….

Zotero’s scraping logic

Instead of trying to take a title and find a hit in a mega-corpus of scholarly citations  like the Google Scholar Button approach, another approach would be to try to extract the full citation details from the source page, and construct an OpenURL to send straight to our landing page.

And, hey, it has occurred to me, there’s some software that already can scrape citation data elements from quite a long list of web sites our patrons might want to start from.  Zotero. (And Mendeley too for that matter).

In fact, you could use Zotero as a method of ‘Bridging the Gap’ right now. Sign up for a Zotero account, install the Zotero extension. When you are on a paywalled citation page on the unauthenticated open web (or a search results page on Google Scholar, Amazon, or other places Zotero can scrape from), first import your citation into Zotero. Then go into your Zotero library, find the citation, and — if you’ve properly set up your OpenURL preferences in Zotero — it’ll give you a link to click on that will take you to your institutional OpenURL resolver. In our case, our Umlaut landing page.

We know from some faculty interviews that some faculty definitely use Zotero, hard to say if a majority do or not. I do not know how many have managed to set up their OpenURL preferences in Zotero, if this is part of their use of it.

Even of those who have, I wonder how many have figured out on their own that they can use Zotero to “bridge the gap” in this way.  But even if we undertook an education campaign, it is a somewhat cumbersome process. You might not want to actually import into your Zotero library, you might want to take a look at the article first. And not everyone chooses to use Zotero, and we don’t want to require them to for a ‘briding the gap’ solution.

But that logic is there in Zotero, the pretty tricky task of compiling and maintaining ‘scraping’ rules for a huge list of sites likely to be desirable as ‘Bridging the Gap’ sources. And Zotero is open source, hmm.

We could imagine adding a feature to Zotero that let the user choose to go right to an institutional OpenURL link after scraping, instead of having to import and navigate to their Zotero library first.  But I’m not sure such a feature would match the goals of the Zotero project, or how to integrate it into the UX in a clear way without disturbing from Zotero’s core functionality.

But again, it’s open source.  We could imagine ‘forking’ Zotero, or extracting just the parts of Zotero that matter for our goal, into our own product that did exactly what we wanted. I’m not sure I have the local resources to maintain a ‘forked’ version of plugins for several browsers.

But Zotero also offers a bookmarklet.  Which doesn’t have as good a UI as the browser plugins, and which doesn’t support all of the scrapers. But which unlike a browser plugin you can install on iOS and Android mobile browsers (although it’s a bit confusing to do so, at least it’s possible).  And which it’s probably ‘less expensive’ for a developer to maintain a ‘fork’ of — we really just want to take Zotero’s scraping behavior, implemented via bookmarklet, and completely replace what you do with it after it’s scraped. Send it to our institutional OpenURL resolver.

I am very intrigued by this possibility, it seems at least worth some investigatory prototypes to have patrons test.  But I haven’t yet figured out how where to actually find the bookmarklet code, and related code in Zotero that may be triggered by it, let alone the next step of figuring out if it can be extracted into a ‘fork’.  I’ve tried looking around on the Zotero repo, but I can’t figure out what’s what.  (I think all of Zotero is open source?).

Anyone know the Zotero devs, and want to see if they want to talk to me about it with any advice or suggestions? Or anyone familiar with the Zotero source code themselves and want to talk to me about it?

Posted in General | 23 Comments

Of ISBN reliability, and the importance of metadata

I write a lot of software that tries to match bibliographic records from one system to another, to try and identify availability of a desired item in various systems.

For instance, if I’m in Google Scholar, I might click on an OpenURL link to our university. And then my software wants to figure out if the item you clicked on is available from any of our various systems — primarily the catalog, or the BorrowDirect consortium, or maybe a free digital version from HathiTrust, a few other places.

You could instead be coming from EBSCOHost, or even WorldCat, or any other third party platform that uses OpenURL to hand-off a citation to a particular institution.

The best way to find these matches is with an identifier, like ISBN, OCLCnum, or LCCN. Trying to search by just author/title, it’s a lot harder for software to be sure it’s found the right thing, and only the right thing. This is why we use identifiers, right? And ISBN is by far the most popular one, the one most likely to be given to my system by a third-party system like Google Scholar — even though there are many titles that don’t have ISBN’s, Google Scholar wont’ send me an OCLCnum or LCCN. We have to work with what we’ve got.

One problem that can arise is when an ISBN seems “wrong” somewhere.

Recently, I was looking for a copy of W.E.B. DuBois’ Darkwater, after I learned that  DuBois had written a speculative fiction short story that appears in that collection! (You can actually read it for free on Project Gutenberg online, but I wanted the print version).

Coming from an OpenURL link from WorldCat, WorldCat gave my system this ISBN for that title: 9780743460606.  That 13-digit ISBN translates into a 10-digit ISBN 074346060X.  (Every 10-digit ISBN has an equivalent 13-digit ISBN which represents the same assigned ISBN, just in 13-digit form).

When my local system powered by Umlaut searched our local catalog for that ISBN — it came up with an entirely different book!  Asphalt, by Carl Rux, Atria Books, 2004.  Because my system assumes if it gets an ISBN match it’s the right book, it offered that copy to me, and I requested it for delivery at the circ desk — and only when it gave me a request confirmation did I realize, hey, that’s not the book I wanted!

It turns out not only OCLC, but the LC Catalog itself lists that same 10-digit version of the ISBN on two different bibliographic records for two entirely different titles. (I don’t know if URLs are persistent, but try this one). LCCN 2003069067 for an edition of DuBois Darkwater , Washington Square Press 2004; and LCCN 2003069638 for a 2004 Atria Press edition of Rux’s Asphalt.  In both records in, that same ISBN 074346060X appears in a MARC 020$a as an ISBN.

So what’s going on? Is there an error in the LC cataloging, which made it into worldcat and many many libraries cataloging?  Or did a publisher illegally re-use the same ISBN twice? (The publisher names appear different, but perhaps they are two different imprints of the same publisher? How else would they wind up with the same ISBN prefix?)

I actually don’t know.  I did go and get a copy of the 2004 Atria Press Asphalt by Rux from our stacks.  It’s a hardcover and no longer has it’s dust jacket, as is typical. But on the verso in the LC Cataloging-in-publication data, it lists a different ISBN: “ISBN 0-7434-7400-7 (alk. paper)”.  It does not list the 074346060X ISBN.  I think the 074346060X may really belong to the DuBois 2004 Washington Square Press edition?  In some cataloging records for the Asphalt/Rux, both  ISBN’s appear, as repeated 020’s.

It took me quite a bit of time to get to the bottom of this (which I still haven’t done actually), a couple hours at least.  I did it because I was curious, and I wanted to make sure there wasn’t an error in my software.  We can’t really “afford” to do this with every mistake or odd thing in our data.  But this is a reminder that our software systems can only be as good as our data.   And data can be very expensive to fix — let’s say this is an error in LC, and LC fixes it, I have no idea how long it would take to make it to WorldCat, or to individual libraries — there are many libraries that don’t routinely download updates/changes from WorldCat, and the correction would probably never make it to them. (If you have a way to report this to LC and feel like it, feel free to do so and update us in comments!)

Also a reminder that periodically downloading updates from WorldCat, to sync your catalog to any changes in the central system, is a really good idea.  It’s time consuming enough for one person to notice an error like this (if it is an error), figure out how to report it, someone to fix it.  That work should result in updated records for everyone, not just individual libraries that happen to notice the issue and manually download new copy or fix it.

It may not be a cataloging error — publishers have sometimes assigned the same ISBN to more than one title. Due to a software error on their part, or not understanding how the ISBN system works — sometimes a publisher figures if the ISBN was previously used in an edition that’s been out of print for 20 years, why not re-use it?  This is not allowed by the ISBN system. It causes havok in computer systems if a publisher does so. But the ISBN registrars could probably be doing a better job of educating publishers about this (it’s not mentioned in Bowker’s FAQ, maybe they think it’s obvious?). Or even applying some kind of financial penalty or punishment if a publisher does this, to make sure there’s a disincentive?

At any rate, as the programmers say, Garbage In, Garbage Out,  our systems can only work with the (meta)data they’ve got — and our catalogers’ and metadata professionals’ work with our data is crucial to our systems ultimate performance.

Posted in General | 2 Comments

unicode normalization in ruby 2.2

Ruby 2.2 finally introduces a #unicode_normalize method on strings. Defaults to :nfc, but you can also normalize to other unicode normalization forms such as :nfd, :nfkc, and :nfkd.


Unicode normalization is something you often have to do when dealing with unicode, whether you knew it or not. Prior to ruby 2.2, you had to install a third-party gem to do this, adding another gem dependency. Of the gems available, some money-patched string in ways I wouldn’t have preferred, some worked only on MRI and not jruby, some had unpleasant performance characteristics, etc.  Here’s some benchmarks I ran a while ago on available gems giving unicode normalization and performance, although since I did those benchmarks new options appeared and performance characteristics changed , but now we don’t need to deal with it, just use the stdlib.

One thing I can’t explain is that the only ruby stdlib documentation I can find on this, suggests the method should be called just `normalize`.  But nope, it’s actually `unicode_normalize`.  Okay. Can anyone explain what’s going on here?

`unicode_normalized?` (not just `normalized?`) is also available, also taking a normalization form argument.

The next major release of Rails, Rails 5, is planned to require ruby 2.2.   I think a lot of other open source will follow that lead.  I’m considering switching some of my projects over to require ruby 2.2 as well, to take advantage of some of the new stdlib like this. Although I’d probably wait until JRuby 9k comes out, planned to support 2.2 stdlib and other changes.  Hopefully soon. In the meantime, I might write some code that uses #unicode_normalize when it’s present, otherwise monkey-patches in a #unicode_normalize method implemented with some other gem — although that still requires making the other gem a dependency.  Which I’ll admit there are some projects I have that really should be unicode normalizing in some places, but I could barely get away without it, and skipped it because I didn’t want to deal with the dependency. Or I could require MRI 2.2 or jruby latest, and just monkey-patch a simple pure-java #unicode_normalize if JRuby and not String.instance_methods.include? :unicode_normalize.

Posted in General | 3 Comments

simple config for faster travis ruby builds

There are a few simple things you can configure in your .travis.yml to make your travis builds faster for ruby builds. They are oddly under-documented by travis in my opinion, so I’m noting them there.


Odds are your ruby/rails app uses nokogiri. (all Rails 4.2 apps do, as nokogiri has become a rails dependency in 4.2)  Some time ago (in the past year I think?) nokogiri releases switched to building libxml and libxslt from source when you install the gem.

This takes a while. On various machines I’ve seen 30 seconds, two minutes, 5 minutes.  I’m not sure how long it usually takes on travis, as travis logs don’t seem to give timing for this sort of thing, but I know I’ve been looking at the travis live web console and seen it paused on “installing nokogiri” for a while.

But you can tell nokogiri to use already-installed libxml/libxslt system libraries if you know the system already has compatible versions installed — which travis seems to — with the ENV variable `NOKOGIRI_USE_SYSTEM_LIBRARIES=true`.  Although I can’t seem to find that documented anywhere by nokogiri, it’s the word on the street, and seems to be so.

You can set such in your .travis.yml thusly:


Use the new Travis architecture

Travis introduced a new architecture on their end using Docker, which is mostly invisible to you as a travis user.  But the architecture is, at the moment, opt-in, at least for existing projects. 

Travis plans to eventually start moving over even existing projects to the new architecture by default. You will still be able to opt-out, which you’d do mainly if your travis VM setup needed “sudo”, which you don’t have access to in the new architecture.

But in the meantime, what we want is to opt-in to the new architecture, even on an existing project. You can do that simply by adding:

sudo: false

To your .travis.yml.

Why do we care?  Well, travis suggests that the new architecture “promises short to nonexistent queue wait times, as well as offering better performance for most use cases.” But even more importantly for us, it lets you do bundler caching too…

Bundler caching

If you’re like me, a significant portion of your travis build time is just installing all those gems. On your personal dev box, you have gems you already installed, and when they’re listed in your Gemfile.lock they just get used, the bundler/rubygems doens’t need to go reinstalling them every time.

But the travis environment normally starts with a clean slate on every build, so every build it has to go reinstalling all your gems from your Gemfile.lock.

Aha, but travis has introduced a caching feature that can cache installed gems.  At first this feature was only available for paid private repos, but now it’s available for free open source repos if you are using the new travis architecture (above).

For most cases, simply add this to your .travis.yml:

cache: bundler

There can be complexities in your environment which require more complex setup to get bundler caching to work, see the travis docs.

Happy travis’ing

The existence of travis offering free CI builds to open source software, and with such a well-designed platform, has seriously helped open source software quality/reliability increase in leaps and bounds. I think it’s one of the things that has allowed the ruby community to deal with fairly quickly changing ruby versions, that you can CI on every commit, on multiple ruby versions even.

I love travis.

It’s odd to me that they don’t highlight some of these settings in their docs better. In general, I think travis docs have been having trouble keeping up with travis changes — travis docs are quite good as far as being written well, but seem to sometimes be missing key information, or including not quite complete or right information for current travis behavior. I can’t even imagine how much AWS CPU time all those libxml/libxslt compilations on every single travis build are costing them!  I guess they’re working on turning on bundler caching by default, which will significantly reduce the number of times nokogiri gets built, once they do.

Posted in General | Leave a comment

subscription libraries, back to the future

Maybe you thought libraries were “the netflix for books”, but in this Wired article, The ‘Netflix for Books’ Just Invaded Amazon’s Turf, it’s not libraries they’re talking about, and it’s not just Amazon’s turf they’re invading. Although they’re talking about the vendor, Oyster, starting to sell books, not just offer a subscription lending library, that’s what they mean by “Amazon’s turf.”  Still, one might have thought that lending books was the “turf” of library’s, but they don’t even get a mention.

Before the existence of public libraries, paid subscription libraries were a thing, both as commercial entities and private clubs, popular in the 18th and 19th centuries. Books were comparatively expensive then compared to now.

The United States played a key role in developing public free libraries, democratizing access to published knowledge and cultural production.

It might be instructive to compare the user workflow in actually getting books onto your device of choice between Amazon and Oyster’s systems (for both lending and purchase), and the vendors and solutions typically used by libraries (OverDrive, etc).  I suspect it wouldn’t look pretty for library’s offerings. The ALA has a working group trying to figure out what can be done.

Posted in General | Leave a comment