Of ISBN reliability, and the importance of metadata

I write a lot of software that tries to match bibliographic records from one system to another, to try and identify availability of a desired item in various systems.

For instance, if I’m in Google Scholar, I might click on an OpenURL link to our university. And then my software wants to figure out if the item you clicked on is available from any of our various systems — primarily the catalog, or the BorrowDirect consortium, or maybe a free digital version from HathiTrust, a few other places.

You could instead be coming from EBSCOHost, or even WorldCat, or any other third party platform that uses OpenURL to hand-off a citation to a particular institution.

The best way to find these matches is with an identifier, like ISBN, OCLCnum, or LCCN. Trying to search by just author/title, it’s a lot harder for software to be sure it’s found the right thing, and only the right thing. This is why we use identifiers, right? And ISBN is by far the most popular one, the one most likely to be given to my system by a third-party system like Google Scholar — even though there are many titles that don’t have ISBN’s, Google Scholar wont’ send me an OCLCnum or LCCN. We have to work with what we’ve got.

One problem that can arise is when an ISBN seems “wrong” somewhere.

Recently, I was looking for a copy of W.E.B. DuBois’ Darkwater, after I learned that  DuBois had written a speculative fiction short story that appears in that collection! (You can actually read it for free on Project Gutenberg online, but I wanted the print version).

Coming from an OpenURL link from WorldCat, WorldCat gave my system this ISBN for that title: 9780743460606.  That 13-digit ISBN translates into a 10-digit ISBN 074346060X.  (Every 10-digit ISBN has an equivalent 13-digit ISBN which represents the same assigned ISBN, just in 13-digit form).

When my local system powered by Umlaut searched our local catalog for that ISBN — it came up with an entirely different book!  Asphalt, by Carl Rux, Atria Books, 2004.  Because my system assumes if it gets an ISBN match it’s the right book, it offered that copy to me, and I requested it for delivery at the circ desk — and only when it gave me a request confirmation did I realize, hey, that’s not the book I wanted!

It turns out not only OCLC, but the LC Catalog itself lists that same 10-digit version of the ISBN on two different bibliographic records for two entirely different titles. (I don’t know if catalog.loc.gov URLs are persistent, but try this one). LCCN 2003069067 for an edition of DuBois Darkwater , Washington Square Press 2004; and LCCN 2003069638 for a 2004 Atria Press edition of Rux’s Asphalt.  In both records in catalog.loc.gov, that same ISBN 074346060X appears in a MARC 020$a as an ISBN.

So what’s going on? Is there an error in the LC cataloging, which made it into worldcat and many many libraries cataloging?  Or did a publisher illegally re-use the same ISBN twice? (The publisher names appear different, but perhaps they are two different imprints of the same publisher? How else would they wind up with the same ISBN prefix?)

I actually don’t know.  I did go and get a copy of the 2004 Atria Press Asphalt by Rux from our stacks.  It’s a hardcover and no longer has it’s dust jacket, as is typical. But on the verso in the LC Cataloging-in-publication data, it lists a different ISBN: “ISBN 0-7434-7400-7 (alk. paper)”.  It does not list the 074346060X ISBN.  I think the 074346060X may really belong to the DuBois 2004 Washington Square Press edition?  In some cataloging records for the Asphalt/Rux, both  ISBN’s appear, as repeated 020’s.

It took me quite a bit of time to get to the bottom of this (which I still haven’t done actually), a couple hours at least.  I did it because I was curious, and I wanted to make sure there wasn’t an error in my software.  We can’t really “afford” to do this with every mistake or odd thing in our data.  But this is a reminder that our software systems can only be as good as our data.   And data can be very expensive to fix — let’s say this is an error in LC, and LC fixes it, I have no idea how long it would take to make it to WorldCat, or to individual libraries — there are many libraries that don’t routinely download updates/changes from WorldCat, and the correction would probably never make it to them. (If you have a way to report this to LC and feel like it, feel free to do so and update us in comments!)

Also a reminder that periodically downloading updates from WorldCat, to sync your catalog to any changes in the central system, is a really good idea.  It’s time consuming enough for one person to notice an error like this (if it is an error), figure out how to report it, someone to fix it.  That work should result in updated records for everyone, not just individual libraries that happen to notice the issue and manually download new copy or fix it.

It may not be a cataloging error — publishers have sometimes assigned the same ISBN to more than one title. Due to a software error on their part, or not understanding how the ISBN system works — sometimes a publisher figures if the ISBN was previously used in an edition that’s been out of print for 20 years, why not re-use it?  This is not allowed by the ISBN system. It causes havok in computer systems if a publisher does so. But the ISBN registrars could probably be doing a better job of educating publishers about this (it’s not mentioned in Bowker’s FAQ, maybe they think it’s obvious?). Or even applying some kind of financial penalty or punishment if a publisher does this, to make sure there’s a disincentive?

At any rate, as the programmers say, Garbage In, Garbage Out,  our systems can only work with the (meta)data they’ve got — and our catalogers’ and metadata professionals’ work with our data is crucial to our systems ultimate performance.

Posted in General | 2 Comments

unicode normalization in ruby 2.2

Ruby 2.2 finally introduces a #unicode_normalize method on strings. Defaults to :nfc, but you can also normalize to other unicode normalization forms such as :nfd, :nfkc, and :nfkd.


Unicode normalization is something you often have to do when dealing with unicode, whether you knew it or not. Prior to ruby 2.2, you had to install a third-party gem to do this, adding another gem dependency. Of the gems available, some money-patched string in ways I wouldn’t have preferred, some worked only on MRI and not jruby, some had unpleasant performance characteristics, etc.  Here’s some benchmarks I ran a while ago on available gems giving unicode normalization and performance, although since I did those benchmarks new options appeared and performance characteristics changed , but now we don’t need to deal with it, just use the stdlib.

One thing I can’t explain is that the only ruby stdlib documentation I can find on this, suggests the method should be called just `normalize`.  But nope, it’s actually `unicode_normalize`.  Okay. Can anyone explain what’s going on here?

`unicode_normalized?` (not just `normalized?`) is also available, also taking a normalization form argument.

The next major release of Rails, Rails 5, is planned to require ruby 2.2.   I think a lot of other open source will follow that lead.  I’m considering switching some of my projects over to require ruby 2.2 as well, to take advantage of some of the new stdlib like this. Although I’d probably wait until JRuby 9k comes out, planned to support 2.2 stdlib and other changes.  Hopefully soon. In the meantime, I might write some code that uses #unicode_normalize when it’s present, otherwise monkey-patches in a #unicode_normalize method implemented with some other gem — although that still requires making the other gem a dependency.  Which I’ll admit there are some projects I have that really should be unicode normalizing in some places, but I could barely get away without it, and skipped it because I didn’t want to deal with the dependency. Or I could require MRI 2.2 or jruby latest, and just monkey-patch a simple pure-java #unicode_normalize if JRuby and not String.instance_methods.include? :unicode_normalize.

Posted in General | 3 Comments

simple config for faster travis ruby builds

There are a few simple things you can configure in your .travis.yml to make your travis builds faster for ruby builds. They are oddly under-documented by travis in my opinion, so I’m noting them there.


Odds are your ruby/rails app uses nokogiri. (all Rails 4.2 apps do, as nokogiri has become a rails dependency in 4.2)  Some time ago (in the past year I think?) nokogiri releases switched to building libxml and libxslt from source when you install the gem.

This takes a while. On various machines I’ve seen 30 seconds, two minutes, 5 minutes.  I’m not sure how long it usually takes on travis, as travis logs don’t seem to give timing for this sort of thing, but I know I’ve been looking at the travis live web console and seen it paused on “installing nokogiri” for a while.

But you can tell nokogiri to use already-installed libxml/libxslt system libraries if you know the system already has compatible versions installed — which travis seems to — with the ENV variable `NOKOGIRI_USE_SYSTEM_LIBRARIES=true`.  Although I can’t seem to find that documented anywhere by nokogiri, it’s the word on the street, and seems to be so.

You can set such in your .travis.yml thusly:


Use the new Travis architecture

Travis introduced a new architecture on their end using Docker, which is mostly invisible to you as a travis user.  But the architecture is, at the moment, opt-in, at least for existing projects. 

Travis plans to eventually start moving over even existing projects to the new architecture by default. You will still be able to opt-out, which you’d do mainly if your travis VM setup needed “sudo”, which you don’t have access to in the new architecture.

But in the meantime, what we want is to opt-in to the new architecture, even on an existing project. You can do that simply by adding:

sudo: false

To your .travis.yml.

Why do we care?  Well, travis suggests that the new architecture “promises short to nonexistent queue wait times, as well as offering better performance for most use cases.” But even more importantly for us, it lets you do bundler caching too…

Bundler caching

If you’re like me, a significant portion of your travis build time is just installing all those gems. On your personal dev box, you have gems you already installed, and when they’re listed in your Gemfile.lock they just get used, the bundler/rubygems doens’t need to go reinstalling them every time.

But the travis environment normally starts with a clean slate on every build, so every build it has to go reinstalling all your gems from your Gemfile.lock.

Aha, but travis has introduced a caching feature that can cache installed gems.  At first this feature was only available for paid private repos, but now it’s available for free open source repos if you are using the new travis architecture (above).

For most cases, simply add this to your .travis.yml:

cache: bundler

There can be complexities in your environment which require more complex setup to get bundler caching to work, see the travis docs.

Happy travis’ing

The existence of travis offering free CI builds to open source software, and with such a well-designed platform, has seriously helped open source software quality/reliability increase in leaps and bounds. I think it’s one of the things that has allowed the ruby community to deal with fairly quickly changing ruby versions, that you can CI on every commit, on multiple ruby versions even.

I love travis.

It’s odd to me that they don’t highlight some of these settings in their docs better. In general, I think travis docs have been having trouble keeping up with travis changes — travis docs are quite good as far as being written well, but seem to sometimes be missing key information, or including not quite complete or right information for current travis behavior. I can’t even imagine how much AWS CPU time all those libxml/libxslt compilations on every single travis build are costing them!  I guess they’re working on turning on bundler caching by default, which will significantly reduce the number of times nokogiri gets built, once they do.

Posted in General | Leave a comment

subscription libraries, back to the future

Maybe you thought libraries were “the netflix for books”, but in this Wired article, The ‘Netflix for Books’ Just Invaded Amazon’s Turf, it’s not libraries they’re talking about, and it’s not just Amazon’s turf they’re invading. Although they’re talking about the vendor, Oyster, starting to sell books, not just offer a subscription lending library, that’s what they mean by “Amazon’s turf.”  Still, one might have thought that lending books was the “turf” of library’s, but they don’t even get a mention.

Before the existence of public libraries, paid subscription libraries were a thing, both as commercial entities and private clubs, popular in the 18th and 19th centuries. Books were comparatively expensive then compared to now.

The United States played a key role in developing public free libraries, democratizing access to published knowledge and cultural production.

It might be instructive to compare the user workflow in actually getting books onto your device of choice between Amazon and Oyster’s systems (for both lending and purchase), and the vendors and solutions typically used by libraries (OverDrive, etc).  I suspect it wouldn’t look pretty for library’s offerings. The ALA has a working group trying to figure out what can be done.

Posted in General | Leave a comment

“Streamlining access to Scholarly Resources”

A new Ithaka report, Meeting Researchers Where They Start: Streamlining Access to Scholarly Resources [thanks to Robin Sinn for the pointer], makes some observations about researcher behavior that many of us probably know, but that most of our organizations haven’t succesfully responded to yet:

  • Most researchers work from off campus.
  • Most researchers do not start from library web pages, but from google, the open web, and occasionally licensed platform search pages.
  • More and more of researcher use is on smaller screens, mobile/tablet/touch.

The problem posed by the first two points is the difficulty in getting access to licensed resources. If you start from the open web, from off campus, and wind up at a paywalled licensed platform — you will not be recognized as a licensed user.  Becuase you started from the open web, you won’t be going through EZProxy. As the Ithaka report says, “The proxy is not the answer… the researcher must click through the proxy server before arriving at the licensed content resource. When a researcher arrives at a content platform in another way, as in the example above, it is therefore a dead-end.”

Shibboleth and UI problems

Theoretically, Shibboleth federated login is an answer to some of that. You get to a licensed platform from the open web, you click on a ‘login’ link, and you have the choice to login via your university (or other host organization), using your institutional login at your home organization, which can authenticate you via Shibboleth to the third party licensed platform.

The problem here that the Ithaka report notes is that these Shibboleth federated login interfaces at our  licensed content providers — are terrible.

Most of them even use the word “Shibboleth” as if our patrons have any idea what this means. As the Ithaka report notes, “This login page is a mystery to most researchers. They can be excused for wondering “what is Shibboleth?” even if their institution is part of a Shibboleth federation that is working with the vendor, which can be determined on a case by case basis by pulling down the “Choose your institution” menu.”

Ironically, this exact same issue was pointed out in the NISO “Establishing Suggested Practices Regarding Single Sign-on” (ESPReSSO) report from 2011. The ESPReSSO report goes on to not only identify the problem but suggest some specific UI practices that licensed content providers could take to improve things.

Four years later, almost none have. (One exception is JStor, which actually acted on the ESPReSSO report, and as a result actually has an intelligible federated sign-on UI, which I suspect our users manage to figure out. It would have been nice if the Ithaka report had pointed out good examples, not just bad ones. edit: I just discovered JStor is actually currently owned by Ithaka, perhaps they didn’t want to toot their own horn.).

Four years from now, will the Ithaka report have had any more impact?  What would make it so?

There is one more especially frustrating thing to me regarding Shibboleth, that isn’t about UI.  It’s that even vendors that say they support Shibboleth, support it very unreliably. Here at my place of work we’ve been very aggressive at configuring Shibboleth with any vendor that supports it. And we’ve found that Shibboleth often simply stops working at various vendors. They don’t notice until we report it — Shibboleth is not widely used, apparently.  Then maybe they’ll fix it, maybe they won’t. In another example, Proquest’s shibboleth login requires the browser to access a web page on a four-digit non-standard port, and even though we told them several years ago that a significant portion of our patrons are behind a firewall that does not allow access to such ports, they’ve been uninterested in fixing/changing it. After all, what are we going to do, cancel our license?  As the several years since we first complained about this issue show, obviously not.  Which brings us to the next issue…

Relying on Vendors

As the Ithaka report notes, library systems have been effectively disintermediated in our researchers workflows. Our researchers go directly to third-party licensed platforms. We pay for these platforms, but we have very little control of them.

If a platform does not work well on a small screen/mobile device, there’s nothing we can do but plead. If a platform’s authentication system UI is incomprehensible to our patrons, likewise.

The Ithaka report recognizes this, and basically recommends that… we get serious when we tell our vendors to improve their UI’s:

Libraries need to develop a completely different approach to acquiring and licensing digital content, platforms, and services. They simply must move beyond the false choice that sees only the solutions currently available and instead push for a vision that is right for their researchers. They cannot celebrate content over interface and experience, when interface and experience are baseline requirements for a content platform just as much as a binding is for a book. Libraries need to build entirely new acquisitions processes for content and infrastructure alike that foreground these principles.

Sure. The problem is, this is completely, entirely, incredibly unrealistic.

If we were for real to stop “celebreating content over interface and experience”, and have that effected in our acquisitions process, what would that look like?

It might look like us refusing to license something with a terrible UX, even if it’s content our faculty need electronically. Can you imagine us telling faculty that? It’s not going to fly. The faculty wants the content even if it has a bad interface. And they want their pet database even if 90% of our patrons find it incomprehensible. And we are unable to tell them “no”.

Let’s imagine a situation that should be even easier. Let’s say we’re lucky enough to be able to get the same package of content from two different vendors with two different platforms. Let’s ignore the fact that “big deal” licensing makes this almost impossible (a problem which has only gotten worse since a D-Lib article pointed it out 14 years ago). Even in this fantasy land, where we say we could get the same content from two differnet platforms — let’s say one platform costs more but has a much better UX.  In this continued time of library austerity budgets (which nobody sees ending anytime soon), could we possibly pick the more expensive one with the better UX? Will our stakeholders, funders, faculty, deans, ever let us do that? Again, we can’t say “no”.

edit: Is it any surprise, then, that our vendors find business success in not spending any resources on improving their UX?  One exception again is JStor, which really has a pretty decent and sometimes outstanding UI.  Is the fact that they are a non-profit endeavor relevant? But there are other non-profit content platform vendors which have UX’s at the bottom of the heap.

Somehow we’ve gotten ourselves in a situation where we are completely unable to do anything to give our patrons what we know they need.  Increasingly, to researchers, we are just a bank account for licensing electronic platforms. We perform the “valuable service” of being the entity you can blame for how much the vendors are charging, the entity you require to somehow keep licensing all this stuff on smaller budgets.

I don’t think the future of academic libraries is bright, and I don’t even see a way out. Any way out would take strategic leadership and risk-taking from library and university administrators… that, frankly, institutional pressures seem to make it impossible for us to ever get.

Is there anything we can do?

First, let’s make it even worse — there’s a ‘technical’ problem that the Ithaka report doesn’t even mention that makes it even worse. If the user arrives at a paywall from the open web, even if they can figure out how to authenticate, they may find that our institution does not have a license from that particular vendor, but may very well have access to the same article on another platform. And we have no good way to get them to it.

Theoretically, the OpenURL standard is meant to address exactly this “appropriate copy” problem. OpenURL has been a very succesful standard in some ways, but the ways it’s deployed simply stop working when users don’t start from library web pages, when they start from the open web, and every place they end up has no idea what institution they belong to or their appropriate institutional OpenURL link resolver.

I think the only technical path we have (until/unless we can get vendors to improve their UI’s, and I’m not holding my breath) is to intervene in the UI.  What do I mean by intervene?

The LibX toolbar is one example — a toolbar you install in your browser that adds instititutionally specific content and links to web pages, links that can help the user authenticate against a platform arrived to via the open web, even links that can scrape the citation details from a page and help the user get to another ‘appropriate copy’ with authentication.

The problem with LibX specifically is that browser toolbars seem to be a technical dead-end.  It has proven pretty challenging to get a browser toolbar to keep working accross browser versions. The LibX project seems more and more moribund — it may still be developed, but it’s documentation hasn’t kept pace, it’s unclear what it can do or how to configure it, fewer browsers are supported. And especially as our users turn more and more to mobile (as the Ithaka report notes), they more and more often are using browsers in which plugins can’t be installed.

A “bookmarklet” approach might be worth considering, for targetting a wider range of browsers with less technical investment. Bookmarklets aren’t completely closed off in mobile browsers, although they are a pain in the neck for the user to add in many.

Zotero is another interesting example.  Zotero, as well as it’s competitors including Mendeley, can succesfully scrape citation details from many licensed platform pages. We’re used to thinking of Zotero as ‘bibliographic management’, but once it’s scraped those citation details, it can also send the user to the institutionally-appropriate link resolver with those citation details — which is what can get the user to the appropriate licensed copy, in an authenticated way.  Here at my place of work we don’t officially support Zotero or Mendeley, and haven’t spent much time figuring out how to get the most out of even the bibliographic management packages we do officially support.

Perhaps we should spend more time with these, not just to support ‘bibliographic management’ needs, but as a method to get users from the open web to authenticated access to an appropriate copy.  And perhaps we should do other R&D in ‘bookmarklets'; in machine learning for citation parsing so users can just paste a citation into a box (perhaps via bookmarklet) to get authenticated access to appropriate copy; in anything else we can think of to:

Get the user from the open web to licensed copies.  To be able to provide some useful help for accessing scholarly resources to our patrons, instead of just serving as a checkbook. With some library branding, so they recognize us as doing something useful after all.

Posted in General | 7 Comments

Preservation in a war zone

On the cover of today’s NYTimes (print washington edition)

Race in Iraq and Syria to Record and Shield Art Falling to ISIS

BAGHDAD — In those areas of Iraq and Syria controlled by the Islamic State, residents are furtively recording on their cellphones damage done to antiquities by the extremist group. In northern Syria, museum curators have covered precious mosaics with sealant and sandbags….

…There was also the United States invasion in 2003, when American troops stood by as looters ransacked the Baghdad museum, a scenario that, Mr. Shirshab suggested, is being repeated today….

…The Babylon preservation plan also includes new documentation of the site, including brick-by- brick scale drawings of the ruins. In the event the site is destroyed, Mr. Allen said, the drawings can be used to rebuild it….

…The American invasion alerted archaeologists to what needed protecting. After damage and looting at many sites, documentation and preservation accelerated. One result was that the Mosul Museum, attacked by the Islamic State, had been digitally cataloged…

…He oversees an informal team of Syrians he has nicknamed the Monuments Men, many of them his former students. They document damage and looting by the Islamic State, pushing for crackdowns on the black market. Recently, the United Nations banned all trade in Syrian artifacts….

…Now, Iraqi colleagues teach conservators and concerned residents simple techniques to use in areas controlled by the Islamic State, such as turning on a cellphone’s GPS function when photographing objects, to help trace damage or theft, or to add sites to the “no-strike” list for warplanes….

Posted in General | Leave a comment

Factors to prioritize (IT?) projects in an academic library

  • Most important: Impact vs. Cost
    • Impact is how many (what portion) of your patrons will be effected; and how profound the benefit may be to their research, teaching, learning.
    • Cost may include hardware or software costs, but for most projects we do the primary cost is staff time.
    • You are looking for the projects with the greatest impact at the lowest cost.
    • If you want to try and quantify, it may be useful to simply estimate three qualities:
      • Portion of userbase impacted (1-10 for 10% to 100% of userbase impacted)
      • Profundity of impact (estimate on a simple scale, say 1 to 3 with 3 being the highest)
      • “Cost” in terms of time. Estimate with only rough granularity knowing estimates are not accurate. 2 weeks, 2 months, 6 months, 1 year. Maybe assign those on a scale from 1-4.
      • You could then simply compute (portion * profundity) / cost, and look for the largest values. Or you could plot on a graph with (benefit = portion * profundity) on the x-axis, and cost on the y-axis. You are looking for projects near the lower right of the graph — high benefit, low cost.
  • Demographics impacted. Will the impact be evenly distributed, or will it be greater for certain demographics? Discipline/school/department? Researcher vs grad student vs undergrad?
    • Are there particular demographics which should be prioritized, because they are currently under-served or because focusing on them aligns with strategic priorities?
  • Types of services or materials addressed.  Print items vs digital items? Books vs journal articles? Other categories?  Again, are there service areas that have been neglected and need to be brought to par? Or service areas that are strategic priorities, and others that will be intentionally neglected?
  • Strategic plans. Are there existing Library or university.strategic plans? Will some projects address specific identified strategic focuses? Can also be used to determine prioritized demographics or service areas from above.
    • Ideally all of this is informed by strategic vision, where the library organization wants to be in X years, and what steps will get you there. And ideally that vision is already captured in a strategic plan. Few libraries may have this luxury of a clear strategic vision, however.
Posted in General | 3 Comments

ethical code for software engineering professionals?

Medical professionals have  professional ethical codes. For instance, the psychologists who (it is argued) helped devised improved torture methods for the U.S. government are accused of violating the ethical code of their profession.

Do software engineers and others who write software have professional ethical duties?

Might one of them be to do one’s best to create secure software (rather than intentionally releasing software with vulnerabilities for the purposes of allowing people in the know to exploit), and responsibly disclosing any security vulnerabilities found in third party software (rather than keeping them close so they can be used them for exploits)?

If so, are the software developers at the NSA (and, more likely, government contractors working for the NSA) guilty of unethical behavior?

Of course, the APA policy didn’t keep the psychologists from doing what they did, and there is some suggestion that the APA even intentionally made sure to leave enough loophole, which they potentially regret.   And there have been similar controversies within Anthropology. There’s no magic bullet to ethical behavior from simply writing rules, but I still think it’s a useful point for inquiry, at least acknowledging that there is such a thing as professional ethics for the profession, and providing official recognition that these discussions are part of the profession.

Are there ethical duties of software engineers and others who create software?  As software becomes more and more socially powerful, is it important to society that this be recognized? Are these discussions happening?  What professional bodies might they take place in? (IEEE? ACM?).  The ACM has a code of ethics, but it’s pretty vague, it seems easy to justify just about any profit-making activity.

Are these discussions happening?   Will the extensive Department of Defense funding of Computer Science (theoretical and applied) in the U.S. make it hard to have these discussions? (When I googled, the discussion that came up of how DoD funding effects computer science research was from 1989 — there may be self-interested reasons people aren’t that interested in talking about this).

Posted in General | 3 Comments

Be careful of regexes in a unicode world

Check out the following, which I wrote some time ago:

    # remove non-alphanumeric, excluding apostrophe; replace with space
    title.gsub!(/[^\w\s\']/, ' ') 

See any problem with that? What is \w and \s again? The ruby docs helpfully explain:

/\w/ - A word character ([a-zA-Z0-9_])
/\s/ - A whitespace character: /[ \t\r\n\f]/

See the problem yet?

"el revolución".gsub(/[^\w\s\']/, ' ')
# => "el revoluci n"

Oops. ó is not in the class [a-zA-Z0-9_]. \w doesn’t actually mean “a word character” at all, unless your input is only ascii. The docs probably really should warn you about this, describing the class as “an ascii word character”, and warning you to use other metacharacters if you aren’t just dealing with ascii.

Fortunately, ruby also provides some unicode-aware regex character classes, but they’re a lot harder to remember and longer to type. Here it is right, let’s use unicode-aware spacing instead of `\s` too:

"el: revolución".gsub(/[^[[:alnum:]][[:space:]]\']/, ' ')
#=> "el  revolución"

Yep, that’s what we wanted. There are several other unicode-aware character classes, apparently defined by POSX. The docs also say there’s a couple non-POSIX ones, including:

/[[:word:]]/ - A character in one of the following Unicode general categories Letter, Mark, Number, Connector_Punctuation

I wasn’t able to make that work, it didn’t seem to be recognized in my ruby. I am not sure why, and didn’t bother finding out. What works is good enough for me.

But in a non-ascii world, it turns out, you almost never actually want to use those traditional regex character class metacharacters that many of us have been using for decades. \w and \s, no way. \d is less risky since you probably really do mean 0-9 and not digits from some other script, but that better be what you mean.

Posted in General | 1 Comment

Ruby threads, gotcha with local vars and shared state

I end up doing a fair amount of work with multi-threading in ruby. (There is some multi-threaded concurrency in Umlaut, bento_search, and traject).  Contrary to some belief, multi-threaded concurrency can be useful even in MRI ruby (which can’t do true parallelism due to the GIL), for tasks that spend a lot of time waiting on I/O, which is the purpose in Umlaut and bento_search (in both cases waiting on external HTTP apis). Traject uses multi-threaded concurrency for true parallelism in jruby (or soon rbx) for high performance.

There’s a gotcha with ruby threads that I haven’t seen covered much. What do you think this code will output from the ‘puts’?

value = 'original'

t = Thread.new do
  sleep 1
  puts value

value = 'changed'


It outputs “changed”.   The local var `value` is shared between both threads, changes made in the primary thread effect the value of `value` in the created thread too.  This is an issue not unique to threads, but is a result of how closures work in ruby — the local variables used in a closure don’t capture the fixed value at the time of closure creation, they are pointers to the original local variables. (I’m not entirely sure if this is traditional for closures, or if some other languages do it differently, or the correct CS terminology for talking about this stuff).  It confuses people in other contexts too, but can especially lead to problems with threads.

Consider a loop which in each iteration prepares some work to be done, then dispatches to a thread to actually do the work.  We’ll do a very simple fake version of that, watch:

threads = []
i = 0
10.times do
  # pretend to prepare a 'work order', which ends up in local
  # var i
  i += 1
  # now do some stuff with 'i' in the thread
  threads << Thread.new do
    sleep 1 # pretend this is a time consuming computation
    # now we do something else with our work order...
    puts i

threads.each {|t| t.join}

Do you think you’ll get “1”, “2”, … “10” printed out? You won’t. You’ll get 10 10’s. (With newlines in random places becuase of interleaving of ‘puts’, but that’s not what we’re talking about here). You thought you dispatched 10 threads each with different values for ‘i’, but the threads are actually all sharing the same ‘i’, when it changes, it changes for all of them.


Ruby stdlib Thread.new has a mechanism to deal with this, although like much in ruby stdlib (and much about multi-threaded concurrency in ruby), it’s under-documented. But you can pass args to Thread.new, which will be passed to the block too, and allow you to avoid this local var linkage:

require 'thread'

value = 'original'

t = Thread.new(value) do |t_value|
  sleep 1
  puts t_value

value = 'changed'


Now that prints out “original”. That’s the point of passing one or more args to Thread.new.

You might think you could get away with this instead:

require 'thread'

value = 'original'

t = Thread.new do
  # nope, not a safe way to capture the value, there's
  # still a race condition
  t_value = value
  sleep 1
  puts t_value

value = 'changed'


While that will seem to work for this particular example, there’s still a race condition there, the value could change before the first line of the thread block is executed, part of dealing with concurrency is giving up any expectations of what gets executed when, until you wait on a `join`.

So, yeah, the arguments to Thread.new. Which other libraries involving threading sometimes propagate. With a concurrent-ruby ThreadPoolExecutor:

work = 'original'
pool = Concurrent::FixedThreadPool.new(5)
pool.post(work) do |t_work|
  sleep 1
  puts t_work # is safe

work = 'new'


And it can even be a problem with Futures from ruby-concurrent. Futures seem so simple and idiot-proof, right? Oops.

value = 100

future = Concurrent::Future.execute do
  sleep 1
  # DANGER will robinson!
  value + 1

value = 200

puts future.value # you get 201, not 101!

I’m honestly not even sure how you get around this problem with Concurrent::Future, unlike Concurrent::ThreadPoolExecutor it does not seem to copy stdlib Thread.new in it’s method of being able to pass block arguments. There might be something I’m missing (or a way to use Futures that avoids this problem?), or maybe the authors of ruby-concurrent haven’t considered it yet either? I’ve asked the question of them.  (PS: The ruby-concurrent package is super awesome, it’s still building to 1.0 but usable now; I am hoping that it’s existence will do great things for practical use of multi-threaded concurrency in the ruby community).

This is, for me, one of the biggest, most dangerous, most confusing gotchas with ruby concurrency. It can easily lead to hard-to-notice, hard-to-reproduce, and hard-to-debug race condition bugs.

Posted in General | Leave a comment