Upgrading a Blacklight app from Solr 1.4 to Solr 4.3

updated version of post previously published, then unpublished,  more content added

Our public catalog is a Blacklight-based app. It went live using a Solr 1.4 index, and it still does.

Meanwhile, the latest Solr is now 4.3. It’s time to upgrade our Solr. I’ve spent a few days getting things working with Solr 4.3; haven’t deployed it to production yet, but it seems to be working out. I have a bit more testing to do, then just figure out how to schedule the switchover.

Here are some notes for some things I had to change or do for a switch all the way from 1.4 to 4.3.   I think some others of you are in the midst of just such a long awaited jump-update too or soon will be (cause it’s summer maybe?), so maybe this will be helpful.

I can’t neccesarily cover everything I changed here exhaustively. I was basically guided by

  • comparing my existing solrconfig.xml and schema.xml to the new example ones from the Solr 4.3 example app;
  • as well as comparing to the solrconfig.xml and schema.xml from the current Solr 4.x-ready Blacklight example solr app; although I am not sure any Blacklight committers are actually using the out of the box example Solr in production, so I don’t neccesarily trust it to be optimal or error free.
  • Also, very useful, check the “Logging” section from left-hand sidebar in the (amazingly improved) Solr admin console, for any deprecation notices or other warnings.

Update SolrMarc

If you use SolrMarc for indexing, you need to be using a version that’s compatible with Solr 4.3.

Blacklight comes with a SolrMarc.jar, but it can be confusing to figure out which version you have built-in, depending on what version of Blacklight you’re using, etc.

However, the built-in rake tasks in Blacklight that use solrmarc (such as `rake solr:marc:index`) will also use a local SolrMarc.jar if one is present.

I downloaded the most recent pre-built SolrMarc.jar, and put it in my local app at ./lib/SolrMarc.jar.  The built-in rake tasks will find it there, and use it.

Logging .jars

In Solr 4.3 for the first time, you need to supply your own logging-related jars, they aren’t bundled in the solr war.  At the time I was doing this, the Solr docs weren’t neccesarily clear (to me anyway) on this, and I spent a bunch of time dealing with it.

If you just used the example jetty bundled with Solr 4.3, you wouldn’t run into this issue, because it’s already got the required jars. But we’ve been using tomcat here (not neccesarily for any good reason, I don’t neccesarily recommend it), and were trying not to change all the moving parts at once as part of this switch.

So you’ve got to copy some jars from the example jetty to the relevant place in your tomcat (or other).  Or you get some not very explanatory error messages.

We have a tomcat “split install” where $CATALINA_HOME and $CATALINA_BASE are not the same.  Ideally, it made sense to me to put the custom jars in the (instance-specific) CATALINA-BASE’s lib directory. But I could not get that to work (and could not figure out why it wasn’t working through debugging of tomcat’s configuration files; it really looked like it ought to have), so ended up giving up and putting them in the cross-instance $CATALINA_HOME.

New built-in ICU analyzers for unicode

Previously, with Solr 1.4, I was using Bob Haschert’s custom unicode normalization jars for handling unicode.

(A number of us have been using these, but I am honestly not certain where, if anywhere, source and/or compiled jars for this code lives on the web, or even what it’s official name is for googling. Sorry!)

These did a couple things:

  • Normalize all unicode to Normalization Form KC. This is absolutely required for any indexing of unicode, really, to avoid unexpected behavior.
  • Additionally, normalize to an “English ascii” character set, for instance normalizing an e with an accent to just plain “e”.  This is something that you absolutely would not want to do in some non-English contexts, and even in English may or may not want to — but for our use case, it was essential, searching for ”Simón Bolívar” should find “simon bolivar” and vice versa.

In Solr 1.4, there was no built-in way to do these things, thus Bob Haschart’s custom code came to our rescue.  Some time in between there and Solr 4.3, now there is: Include the ICUFoldingFilter in your analyzer chain, to do both of these things.

The ICUFoldingFilter, unlike Bob’s, also normalizes up/downcase, so I could eliminate the LowerCaseFilterFactory I was using before.

As long as I was at it, and noticing it was there, I also changed from using a WhitespaceTokenizerFactory to the new multi-language-aware ICUTokenizerFactory.   In addition to splitting on whitespace, for letters from non-Roman scripts it will sometimes tokenize in other language-appropriate places.

(Note, some, such as Tom Burton-West at umich and Naomi Dushay at Stanford, have found that the basic approach I’m using is not sufficient for effective Chinese/Japanese/Korean searching. But my Solr 1.4-based index wasn’t doing the things they are trying to do anyway, and at this point I’m not trying to improve that aspect of my index).

To use these ICU filters, you need to copy a number of optional jars from the Solr example distro to your Solr core `lib` folder (or other place on the classpath for your webapp).

  • icu4j-49.1.jar
  • lucene-analyzers-icu-4.3.0.jar
  • solr-analysis-extras-4.3.0.jar (Note this one is not mentioned by some of the applicable Solr documentation that mentions the other two, but is needed too).

Here’s my standard text field definition for Solr 4.3:

    <fieldType name="text" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
      <analyzer>
        <tokenizer class="solr.ICUTokenizerFactory"/>

        <filter class="solr.SynonymFilterFactory" synonyms="punctuation-whitelist.txt" ignoreCase="true"/>

        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>

        <!-- folding need sto be after WordDelimiter, so WordDelimiter
             can do it's thing with full cases and such -->
        <filter class="solr.ICUFoldingFilterFactory" />

        <!-- ICUFolding already includes lowercasing, no
             need for seperate lowercasing step
        <filter class="solr.LowerCaseFilterFactory"/>
        -->

        <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>

It is potentially an unusual choice to combine a WordDelimiterFilterFactory (which generally only does sensible things with English), with a ICUTokenizerFactory (which is intentionally language-agnostic, and may not be what you’d choose if you knew your input was English) — But I decided it made sense for me. I want to optimize for English, and my corpus and users are mostly (but not entirely) English, but I still want to do reasonable things with the non-English (which is generally mixed in with the English), where it can be done without causing any significant problems for English. I think I’ve done so here.

solr.UpdateRequestHandler replaces XML and Binary subclasses

This is just one example of things pointed out as warnings or deprecations when I checked the logs after starting up. I forget if this was just deprecated and would still work or not, but when doing a switch like this, I’d just as soon update anything deprecated that I notice.

Previously found in my 1.4 solrconfig

<requestHandler name="/update" class="solr.XmlUpdateRequestHandler" />
<requestHandler name="/update/javabin" class="solr.BinaryUpdateRequestHandler" />

Now the standard solr. does both XML and binary, deciding which based on http headers:

 <requestHandler name="/update" class="solr.UpdateRequestHandler" />
 <requestHandler name="/update/javabin" class="solr.UpdateRequestHandler" />

Solr schema 1.3 => 1.5

My schema.xml in Solr 1.4 had a version=”1.3″ attribute in the root <schema> tag.

This actually identifies the version of the Solr schema language itself, for backwards compatibility maintenance. I could have left it 1.3 for compatibility with my existing schema, but I find it makes for less confusing maintenance in the long run to try and keep everything up to date when you do a migration like this.

So I changed this to the “1.5″ version that is latest for Solr 4.3 — and then made one change to my actual schema.xml to keep it (for now) semantically the same as it was when it was Schema version 1.3, after consulting the docs on changes between versions. 

  • Added `autoGeneratePhraseQueries=”true”` to the field type definitions of my text field types.

Update ISBN/LCCN normalization

I had been using some custom Solr analyzers from Bill Dueber to normalize ISBNs and LCCNs.  The idea is that ISBN-10 and ISBN-13 should be interchangeable, and so should various alternate forms of LCCN that mean the same thing according to LCCN normalization rules.

These custom analyzers had to be rejiggered a bit to work with Solr 4.3. Which I somehow convinced Jay Luker to do for us out of pure kindness (my Java chops and familiarity with Solr under the hood sourcecode are…. not great).

The source is now available for you too on github though if you want it https://github.com/billdueber/solr-libstdnum-normalize

Blacklight: Deal with weird escaping of Marc21 binary

We store Marc21 binary records in a Solr text field.

With Solr 1.4 and previous versions of SolrMarc, somehow this just worked — although it’s not clear to me if it should have, what SolrMarc may have had to do (possibly using techniques not documented to be supported by Solr) to get the binary data (including control characters) into a Solr text field, in a way that it somehow came out the other end as intact string literals including control characters.  I honestly don’t really understand what was going on.

But I can say that with Solr 4.3 and the latest SolrMarc… it stopped working quite like that. The binary data still made it into the Solr text field, but when my Blacklight app on the other end made requests to Solr and received back responses with the stored field with the marc21 in it….  the control characters come back… oddly.  As for instance the control character represented by hex `1D` comes back as literal ascii “#29″.  Which is some kind of escaping, yeah, I have no idea how it happens as an interaction of SolrMarc, Solr, and what else.

Long-term, the solution is to stop trying to use Solr in this weird undocumented way. Either store MarcXML or Marc-in-json (non-binary, ordinary text stream), or store Marc21 binary in a Solr ‘binary’ field (requiring Base64 encoding and decoding on the client end), or stop storing it in Solr at all, etc.

In the near term, I just want to get things working as closely to how they did in my Solr 1.4 version as possible, with as little work as possible.

No problem, just add some code to my local Blacklight-based application to over-ride the Marc fetching code to properly unescape.

In your local `./app/models/solr_document.rb` (which was generated into your app by Blacklight), I added this (my local Solr stored field where the marc is, is called ‘marc_display’):

  # Custom hack to unescape Marc21 control characters that
  # SolrMarc escapes weirdly, and somehow were automatically unescaped in
  # Solr 1.4, but no longer using Solr 4.3. I don't understand it. This is a mess.
  # This is no longer needed when we stop using SolrMarc, or stop storing in binary Marc21,
  # or both.
  module LoadMarcEscapeFix
    def load_marc
      if _marc_format_type.to_s == "marc21"
        value = fetch(_marc_source_field)

        # SolrMarc escapes binary marc control chars like this, we need to
        # unescape. Yes, we might theroetically improperly unescape literals too.
        # it's a hell of a system.
        value.gsub!("#29;", "\x1D")
        value.gsub!("#30;", "\x1E")
        value.gsub!("#31;", "\x1F")

        return MARC::Record.new_from_marc( value )
      else
        return super
      end
    end
  end
  use_extension(LoadMarcEscapeFix) do |document|
    document.key?( :marc_display )
  end

It might be nice to somehow contribute this fix back to Blacklight… but I can no longer comprehend what’s going on, all the various interacting software, what’s actually supposed to be happening at each point, what’s intended by the code designers/maintainers (if anything), what the backwards-compatible choice is, etc. I’m overwhelmed. (And still using Blacklight 3.5.0 here, not the latest BL).

That’s about it

Doesn’t sound to painful, does it? It still took me, oh, probably in aggregate nearly a week (although spread out over a couple) to work through all this, plus some other details here and there, testing, etc.   So it goes, it’s why it’s taken many of us have been running Solr 1.4 for so long. Hopefully these notes will reduce time for someone else.

So now I mostly just need to figure out how to transition in production.

I already operate with a Solr master/slave replication, where I index to the master, and periodically replicate to the slave — only the slave serves queries from the production app.

This has definitely ended up a convenient setup for any kind of maintenance, to have these two Solr servers.

So to update with no downtime I’ll probably:

  • turn off the replication
  • Upgrade my master to Solr 4.3, including switching to the core ‘conf’ directory I have in a “solr4.3″ branch in my git project for my solr configurations.
  • Re-index my entire catalog to master. (overnight).
  • Temporarily point the production apps to be served by master instead of slave.
  • Upgrade my slave to Solr 4.3, replicate master to slave, point production apps back to slave.
  • Non-Profit!
Posted in General | Leave a comment

Umlaut 3.1.0 released, with new Bootstrap-based visual design

I like to be confident that open source code I wrote is pretty stable and robust before recommending that others use it.

So I usually try to run any new code, or new versions of existing code, in production myself for a couple weeks before actually releasing it as a stable release.

I’ve been running Umlaut 3.1.0 in production for a couple weeks now. Some minor problems found by reviewing the logs for uncaught excpetions, and fixed. It’s ready for a release.

Umlaut is an open source aggregator of “last mile”services, working with your link resolver and other services to provide consolidated and efficient discovery/delivery service provision.

Umlaut 3.1.0 has now been released. Please see the release notes, especially if upgrading from a previous version of umlaut.

The major change is a complete overhaul of the visual design, based on bootstrap, and small-screen friendly. Thanks again to Scot Dalton from NYU for the initiative to make the Bootstrap-based redesign finally happen.

umlaut_bootstrap

Posted in General | Leave a comment

on the internet, and power

Bruce Schneier writes on how our internet lives are frequently dominated by a few huge internet companies with immense power over those internet doings:

There are a lot of good reasons why we’re all flocking to these cloud services and vendor-controlled platforms. The benefits are enormous, from cost to convenience to reliability to security itself. But it is inherently a feudal relationship. We cede control of our data and computing platforms to these companies and trust that they will treat us well and protect us from harm. And if we pledge complete allegiance to them — if we let them control our email and calendar and address book and photos and everything — we get even more benefits. We become their vassals; or, on a bad day, their serfs….

…So how do we survive? Increasingly, we have little alternative but to trust someone, so we need todecide who we trust – and who we don’t — and then act accordingly. This isn’t easy; our feudal lords go out of their way not to be transparent…

In the longer term, we all need to work to reduce the power imbalance…   We need to balance this relationship, and government intervention is the only way we’re going to get it.

I’ve also been thinking a lot about trying to create more cooperatively controlled internet infrastructure as a way to balance this power and bring (economic) democracy to the internet.

And, with regard to libraries, in many of our fantasies the institution of libraries, collectively, would be a force in internet life, a civic, public sector, decentralized but massive in aggregate counter-balance to the ‘feudal’ internet companies.  If libraries can find, keep, and expand a sustainable role as internet actors.

Posted in General | Leave a comment

Scientific publishing has some problems beyond business models

From an open letter in the Guardian:

 Early in their training, students learn that the quest for truth needs to be balanced against the more immediate pressure to “publish or perish”….

…This publishing culture is toxic to science. Recent studies have shown how intense career pressures encourage life scientists to engage in a range of questionable practices to generate publications….

…At the same time, journals incentivise bad practice by favouring the publication of results that are considered to be positive, novel, neat and eye-catching. In many life sciences, negative results, complicated results, or attempts to replicate previous studies never make it into the scientific record. Instead they occupy a vast unpublished file drawer….

As academic librarians, our role is to be experts — not in any specific field — but in the phenomenon of academic publishing in general.   In our educational role with students, we ought to be helping students understand and think about these issues — to problematize and complexify the world of research publication.  Despite our patrons desire to have blacks and whites that let them complete their assignments with as little thinking as possible (yeah, I said it) —  it’s our professional duty to not only help them complete their assignments as conveniently as possible but also understand problems and current issues in academic publishing in general.

And to make the case to administrators and faculty that this our rightful role.  Not all faculty will welcome critique of the scholarly publishing enterprise that is essentially their livelihood either of course (go read that letter in the Guardian we began with, again).

This reminds me again of Karen Coyle’s excellent points about the phenomenon of “predatory publishers” – we over-simplify if we suggest that publications can easily be split into problem-free ‘good’ and untrustworthy ‘predatory’ problems.

We do disservice to our patrons to imply that as long as they steer clear of identified ‘bad’ publishers from some librarian-endorsed list, then of course anything that’s “peer reviewed” becomes absolutely trustworthy gospel through the magical transubstantiation of ‘peer review’.

And we do disservice to ourselves and our professional capacities to avoid critical engagement with our domain of expertise — academic publishing.  We are — or ought to be — academic professionals, not just clerks and secretaries for the university community or salespeople for scholarly publishers.    Could it help restore professional credibility and respect to librarians if we participated at the front of research into research, of the history and critical analysis of the enterprise of scholarly publishing?

Posted in General | Leave a comment

Take control of delivery and access with Umlaut

In a recently published editorial in ITAL, Services and User Context in the Era of Webscale DiscoveryMark Dehmlow writes:

A major issue that continues to confound me is the lack of fully integrated request and delivery services that many discovery systems lack. Of course, all of them implement full text linking to every online article that they can create a link to, but as the sphere of scholarly data stretches beyond just articles, library print collections and delivery services have continued to be neglected primarily because implementing those services in an intuitively integrated way, beyond the “link to your old OPAC” methodology, remains a complex task. My main concern with this deficit is that there is a significant amount of scholarly material only available in print and to focus primarily on
electronic access limits the ability of our users to perform comprehensive research and reduces access to significant resources and services that libraries provide.

The open source Umlaut software (for which I am principal developer) has been aiming to fill this gap for over 7 years now, aiming to provide an aggregated and integrated path to delivery and access cross-cutting library departments, systems and services, accross the entire library business.

To be sure, Umlaut is not a magic bullet.  It’s more a platform to design the best solution you can in your actually existing infrastructure.  To make the most of Umlaut requires local developer time and creativity to figure out how you can use it to tie together your various systems and services as seamlessly as possible.   And a typical lack of good integration API in much of our existing (proprietary) infrastructure is an added challenge, generally increasing cost/time of developing a good solution.  But Umlaut is designed to be a platform supporting local solutions to integrating delivery, access, and specific item services — giving you the common skeleton on which you can hang your custom local functionality.

I agree with Dehmlow (and others I know I’ve read essays from but can’t find now) that the ‘last mile’ of access and delivery ought to be a priority for libraries — among other reasons, because access and delivery of the mountains of content we still have that is not both online and freely available, is something that we uniquely provide to our patrons, with much less ‘competition’ than for search and discovery services.  If our services aren’t good, our patrons don’t have other options (such as Google) to get (eg) printed monographs for their research (without just buying them).

And, at this stage in the development of our technological infrastructures, this is not something that a proprietary vendor-provided open-the-box-and-turn-it-on solution is going to be able to do well. Integrated access/delivery necesarily involves cross-cutting multiple pieces of local enterprise software (catalog, ILL, local identity/SSO, and that’s just the start) and policies (can you request locally held books to be delivered to your office? Does it depend on who you are and where the book is?).  It requires custom local policy and integraiton logic. It’s not going to be feasible/economical for a vendor to provide one-size-fits software that actually works well in this arena.  So I’m not as

At least, until your entire library enterprise infrastructure comes from one vendor and consists of an actually integrated single-business cloud platform.  This does seem to be where the industry is heading and what, for instance,  OCLC, Ex Libris, and Serials Solutions are trying to provide.  I know some of these vendors are trying to provide integrated ‘last mile’ services taking advantage of the consolidated integrated cloud infrastructure they provide — although it’s seldom highlighted as an advantage in their marketting, perhaps because most library customers aren’t yet seeing what an advantage it is, what a stumbling point this is for our patrons — where we should be uniquely distinguishing ourselves as able to provide seamless delivery/access, we’re instead just again showing our patrons our ability to provide them with a disjointed, inefficient, frustrating, confusing, experience.

In the meantime, there’s Umlaut, to help you try to stich together a pleasant and fast delivery/access experience.   It hasn’t received quite as much attention in the academic library world as I would hope — I think that’s in part because administrative decision makers have not realized the importance and benefits of improving our ‘last mile’ services, and certainly standing up an Umlaut at your institution does take some local development resources.  However, in addition to my place of work, NYU and Vanderbilt have been using Umlaut for a while.  Recently, I’ve heard of potential interest from several other large research university libraries.  I am hoping that at some point there will be sufficient critical mass of library developers using Umlaut that we can use the platform to take the ‘last mile’ to even greater levels of convenience and integration for our users than I’ve had the resources to do with Umlaut so far.

Posted in General | Leave a comment

how to make apache fake a 500 http response

For experimentation or testing (manual or automated, if automated usually captured by vcr), I sometimes need a URL guaranteed to always return an HTTP 500 error response.

Here’s some configuration you can drop in an apache conf to generate a simple default 500:

Redirect 500 /error500

Accessing http://yourserver/error500, or /error500/more/path, or /error500/more/path?with=query, all will return a 500 response with apache’s default 500 body.

The command is ‘Redirect’ becuase this is normally used to generate a redirect with “Location” header, so you can use it for mocking any 3xx with Location too (third argument, if present,  value of Location header), but it also works fine for mocking up other status codes like 500 or anything else, so long as you don’t care too much about what the body looks like.

Posted in General | Leave a comment

More affordable cloud hosting options

A few years ago when AWS was getting a lot of attention in the library world (and everywhere else!), it immediately seemed to me, based on some back of the envelope calculations, to be likely unaffordable to libraries. And not a very good value proposition compared to our standard self-hosting — especially for academic libraries which can benefit from their host universities IT infrastructure, but in general, it wasn’t cheap.

Here’s a blog post arguing that in general, EC2 indeed isn’t a great value proposition – if you mostly need 24/7 instances. Where EC2 shines, of course, is it’s ability to “elastically” (the “E” in EC2) spin up and down instances on the fly, and pay hourly, to quickly adjust your provisioning for to the moment demand.

That’s for architectures that are horizontally scaled out a lot, and need to scale up a lot — not most of our library things, but perhaps increasingly more and more (if we succesfully get more competent and succesful!), sure. Although the author notes and provides some calculations showing that even this can be a dicey value proposition.

In that blog post, he mentions other providers with better prices, but not by name.

In the Hacker News thread, someone mentioned Digital Ocean. They seems to offer a service roughly analagous to EC2, but with far better prices — including hourly charging and instantaneous provisioning, so in theory supporting on-demand load balancing for highly horizontally scaled services, just like EC2.  It might still be hard for a university library to beat in-house hosting, due to having our existing university IT infrastructures where we probably don’t need to pay for or pay seriously reduced pricing for bandwidth, electricity, server room facilities, maybe even operations staff, etc. But if you do have a context where cloud hosting makes sense, it’s worth remembering that Amazon is not the only reliable/competent player in town, and is definitely not the cheapest — although it may be the most feature-complete.

Posted in General | Leave a comment

ActiveRecord: Atomic check-and-update through optimistic locking

In Umlaut, there is a database column that basically corresponds to “service_status”.  This gets set, in sequence, to “queued”, “in_progress”, and then “complete”. (Also some possibility of error statuses etc).

There is a point in Umlaut logic where it first checks to make sure a row is “queued”, then only if it is sets it to “in_progress” and executes the service.

The idea is that if it was already set to “in_progress” by someone else (another thread or another process entirely), we leave it alone, we don’t execute the logic, it’s already in progress by someone else.

And the problem with the original implementation was the race condition. First we fetch the model instance (an SQL query), and check it’s service_status. Then, only if it’s service_status is ‘queued’, do we set it’s service_status to `in_progress` and proceed to execute. But the race condition is clear here — in between the first SQL query to fetch, and the second to update, some other thread or process may have already updated it to `in_progress` and began execution, and now we have double execution.

The solution? Some form of “optimistic locking” using the atomic facilities of any rdbms.  Now, I’m not actually talking about the ‘optimistic locking’ feature built into ActiveRecord. You probably could use that feature here, but it requiers adding a special column to your db, rescuing `ActiveRecord::StaleObjectError` etc. The optimistic locking feature built into AR is potentially a powerful general purpose tool when you need to avoid concurrent updates to any column at all in many different scenarios.

But for this particular use case, there’s a simpler way to do it ourselves. We basically want to have a single SQL line that updates the column to `in_progress` if and only if it was already `queued`, in a single atomic rdbms operation, and then lets us know if the update happened or not. We can generate such an SQL using ActiveRecord 3 update_all. 

my_active_record_model = ModelClass.where( however_we_fetched_it )

num_updated =
  ModelClass.where(:id             => my_active_record_model.id,
                   :service_status => "queued").
             update_all(:service_status => "in_progress")

if num_updated > 0
  # we updated, execute
else
  # it did not have a service_status of queued, someoen
  # else beat us to it
end

  • Haven’t actually updated Umlaut yet, there are some annoying legacy issues in Umlaut that make this a bit harder to fix.
  • Note that update_all does not automatically update ActiveRecord `updated_at` columns, you can include those yourself if you want them in the hash of columns to update. `:updated_at => Time.now`.
  • This is also an example of why you need to know and understand SQL and rdbms even if you are using a good ORM like ActiveRecord. And I love ORM’s, but if you didn’t know SQL/rdbms, you wouldn’t be able to come up with a clear way to solve this race condition in terms of SQL, and then figure out the cleanest way to do that with AR.)
  • Some answers on the web to analagous problems to this one suggest using db transactions here. I suspect that transactions may not even be able to solve this problem, but even if there’s a way to do it somehow with transactions, it’s going to be messier. Transactions aren’t in fact the right tool here. A simple optimistic locking `update…where` is.
Posted in General | Leave a comment

A method to map from query to broad topic, and associated resources

Short answer: Take advantage of a facetted response on a search against a corpus that has controlled classification data.

From user query to topic, to resources on that topic

Andrew Nagy tells me in direct email that one of the new features in upcoming Summon 2.0 release is:

Topic Explorer – Over 50,000 english topics will be mapped to user queries on the fly and the API will deliver a “topic” that has an encyclopedia entry, recommended librarian, recommended subject guide, related topics, etc.

(This is described on the Summon 2.0 brochure webpage, although it wasn’t completely clear to me from the webpage that it took user queries as input to arrive at a topic).

This is along the lines of a feature that I’ve been thinking about for years — the ability to recommend appropriate subject resources (subject specializt librarians, subject guides, databases recommended on a particular subject) in response to a user-entered query in a catalog, articles, or other discovery search.

Have been thinking about it for years, as a way to get users to our librarian-recommended resources, but it’s become even more desired by some local librarians in response our recent move towards offering an integrated article search function, currently based on the EBSCOHost API,  in our local discovery UI as an alternative to directly going to individual licensed database platforms, and as a replacement for Metalib broadcast federated search.

So I think Summon is right on track here in their new feature development, which is nice to see, and less usual than it should be in the library proprietary software sector.  This is in some sense an expansion of the existing Summon feature to recommend subject-relevant licensed database platforms based on user-entered queries, expanding it to additional topic-specific resources.

While they say 50,000 topics, I assume there must be some hieararchy to their topic list, with things like institutionally specific librarians and subject pages assigned only to top-level broad topics — it would not be feasible to manually make specialist librarian assignments to 50,000 topics, of course.

So it’s really mapping to fairly broad high-level topics that matters for locally-assigned subject resources like librarians, subject pages, or subject-specific licensed database platforms.  (I’m guessing the SerSol feature may automatically map things like encyclopedia entries at the narrower, more specific elements of the 50k list, which might be neat, but is not what I’m choosing to focus on in this discussion).

The hard part about implementing a feature like this is mapping from arbitrary user query to topic (broad or otherwise).  Once you’ve done that, it’s of course an easy software problem to record URLs or other content that corresponds with each broad topic, and provide them to the user once a topic has been identified.

If you have Summon, and like it’s new “topic explorer” feature, great. We won’t know exactly how it’s implemented, but those with Summon licenses will be able to test it and see how effective we find it, once it’s released.

But what additional options might you have for implementing such a feature yourself, for institutions who do their own development in some cases?

Spider the web, use text mining techniques? — NCSU

Way back in 2007, Tito Sierra then at NCSU presented at the Code4Lib conference on an NCSU project called Smart Subjects. 

As you can see in the slide show there, Smart Subjects was also an attempt to map from a user-entered search query to one or more library subjects.

It did so (and possibly still does so) in a creative way. From existing bodies of text that can be easily classified by department (Course catalogs, departmental lists of published articles), harvest all that text (classified by department), and then index in a text indexing engine, that allows information retrieval relevance ranking techniques to take arbitrary phrases (user-entered queries) and see which academic department’s harvested text corpus has the best match to the query.

believe some years after this, Tito told me he wasn’t, in the end,  neccesarily super enthused with the quality of results attained by this method. In any event, it is a fairly heavy-weight method, with lots of moving parts to develop and maintain and fine tune.

Tito has since left NCSU, but I believe it’s what is still powering the subject recommendations at the bottom left of their “QuickSearch” results, although it’s unclear if the corpus ever gets updated. I don’t know if they ever thought to use it to power recommendations of actual library staff too, although there is a library staff member highlighted on the QuickSearch results page.

(Looks like the NCSU SmartSearch tool began in 2005, 8 years ago!)

Using your catalog corpus, with classification data, as a classifier?

I’ve been thinking for a while of another approach, lighter weight and taking advantage of the extensive person-hours of work that goes into our cataloging metadata. Although I haven’t had a chance to prototype it yet, I’m going to tell you about it anyway.

We have these extensive library catalogs. What if each record in the catalog had broad subjects assigned to it? (They don’t, really, but bear with me, let’s start here).

And let’s say we exposed these broad subjects in a facet. Then for a given query, you’d get a count of how many items in your result set (matching that query) were posted to each broad subject.

Say you search for “project management techniques”, and get back, in the facet based off these broad subjects:

  • Engineering (56801)
  • Business  (47920)
  • Computer Science (34000)
  • Health Science (24000)

That would potentially be a pretty good list of recommended subjects corresponding to the query entered, no?  Then, if you have subject pages, database lists, specialist librarians, etc., already categorized into these same subjects, you could recommend them to the user based on her query.

Now, our library catalogs do have classification data in them assigned to individual records, using vocabularies created and assigned through the hard work of many catalogers over many years. Is there a way to use this data for this purpose?

The common classification systems of Dewey and LCC both classify rather too finely for this use — we need to map to a vocabulary of dozens of topics/subjects/disciplines, so we can assign local resources to each one. Hundreds or thousands is too many.

But is there hieararchy in DDC or LCC that would let you “post up” from finer-grained specific classifications, to more broad classifications useful for our purpose here? DDC might have, but I don’t have many DDC records in my local corpus, and haven’t spent much time with DDC. LCC is known to be less hieararchical than DDC, but there are still ways to get some broad classifications out of it, using the top-level schedules. But it’s tricky to make this work, and the broad categories you end up with aren’t neccearily as useful as we’d like.  (See the “Discipline” facet in our own Solr-based catalog, which is constructed from LCC “posted up” into broad classification. For “project management techniques”, the top Discipline facets are “Technology”, “Science”, and “Social Science”, which aren’t neccesarily wrong, but also aren’t as useful as we might like, they are too broad and somewhat archaic.)

The University of Michigan High Level Browse classification

The University of  Michigan has developed their own High-Level Browse (HLB) classification.  One of the main uses for this classification is indeed a broad classification facet in their catalog search.

The U of M HLB is conveniently based on LCC, and U of M maintains mappings from LCC call numbers to their own HLB classes. Which is what makes the facetting work in the first place, for any corpus with LCC classifications on items.

They’ve developed their HLB based on their own schools, departments, and programs at U of M.  You could try to develop the same locally. But it’d be a lot of work. And U of M awesomely shares their classification, with LCC mappings, in XML form too. So you could just write software to download theirs, use it in indexing into your own Solr catalog index, and get U of M HLB facets in your catalog too — and use them to power a subject recommender too.

Any large research university probably has academic classification needs roughly similar to U of M’s, although there will certainly be special programs you wish were represented that aren’t (or that are in U of M’s, unneccesarily for you), but it will likely be good enough, if you don’t have the resources/organization to develop and maintain your own local classification. (I’m amazed U of M even pulls it off, honestly.)

Let’s give it a try, go to U of M’s catalog and do a search, and check out the top categories represented in the “Academic Discipline” facet in the sidebar.  For “project management techniques”, it’s Business, Management, Business (General), Social Sciences, and Engineering.

If a system made recommendations for subject guides, specialist librarians, subject-relevant databases, and other subject resources, based on those classifications… they’d be fairly relevant to the query, right?

Do some of your own queries, how well does it work?

(In the U of M HLB, there are still a few levels of hieararchy, all of which may be represneted in the facetted result. For instance, “Business (General)” is a sub-category of the more general “Business”. Inter-mixing them both is probably appropriate for facet response, but for making subject recommendations some experimentation is called for as to when to use more-specific and when to use more-general, and when to de-dupliate when  a super- and sub-class are both represented in the potential ‘best subjects’)

Not just for catalog searches

The idea is to use the catalog as a classifier, but that doesn’t mean you can only use it for catalog searches.

For any search in any system you control enough to add custom features to — you could add a feature based on the catalog as a classifier. Even if they are searching in a non-catalog article discovery system — the software could still, behind the scenes, take the user’s query, execute it’s own under-the-hood query against the catalog, look at the facetted broad subject results, and use them to make subject recommendations.

Not necessarily just with your own catalog

Likewise, there’s no reason you need to use your own local catalog as the classifier. Any catalog will do — if it can provide a facetted response of broad subject classification, has an API such that you can use it in this way, and the operators don’t mind you using their catalog in your service.

WorldCat would be great, if OCLC added broad subject classification facet, and an API to retrieve such.  Umich’s catalog, already using their own in-house HLB classificaiton, might be convenient too.

Of course, if you do add umich’s HLB broad subjects as a facet in your own local catalog, your users get the advantage of using that facet directly for their catalog searches too.  (Assuming you have enough control of your local catalog to add such a thing, for instance becuase you’re catalog is based on Blacklight, VuFind, or another tool using a local Solr your control).

Idea worth exploring?

I’m not sure when/if I’ll have time to investigate this idea, although I probably will eventually. But I absolutely don’t mind if someone else runs with it and beats me to it — as long as you share back your findings, how well it worked, etc.

Posted in General | 8 Comments

One scenario for the death of the academic library

My last post has attracted some interesting discussion. Eric Hellman, in the comment thread, recommended this very interesting recent article, Open Access, library and publisher competition, and the evolution of general commerce, by Andrew Odlyzko, 2013. 

I recommend the entire article heartily, but provide some extensive pertinent excerpts here, with some commentary.

The ARL has statistics showing library budgets as fractions of total university budgets for a sizable collection of their members [10]. The chart for the 40 members that have reported since 1982 shows an inexorable decline in this ratio, from about 3.7% to a bit under 2.0%…

…The share of library budgets that goes out in purchases of books, journals, and databases has grown substantially, from 33% in 1990 to 42.5% in 2010…  Further, all of this growth is accounted for by serials. Books and other materials have just about held their own (with books shrinking at the expense of the rest).

In my last post, I asked, at what percentage of faculty or other univeristy community members thinking the library is not worth what’s being spent on it — would result in decreasing library budgets as a percentage of host institution budget.

It turns out, library budgets have already been declining as a portion of  total university budgetsand within the library’s budget collections have been rising as a portion of library budget. Meaning the decrease in spending on library staff, professional and otherwise, has decreased even further as a proportion of university spending than library budgets in total.

Perhaps, in fact, that point I asked about has already been reached.

There are many interesting statistics at [9] demonstrating decline of the traditional functions of libraries. Thus between 1995 and 2010, the number of students at ARL institutions grew by 33% (with the ranks of teaching faculty and graduate students climbing 15% and 43%, respectively). The only category of library services involving physical material that showed growth was interlibrary loans, which climbed 92%. This reflects libraries concentrating their budges on serials, and giving up on trying to keep up with the growth in the number of new books being published. In other categories, initial circulation (i.e., excluding renewals) of physical volumes dropped by 42%. Thus it is a gross exaggeration that “nobody uses the library anymore,” as one sometimes heard from faculty or students. But the decline in borrowings per student by more than half is telling. What is perhaps most surprising is that the number of requests for reference assistance dropped by 66% in absolute terms, as is shown in Fig. 6, and thus by about 75% on a per-student basis. This is certainly a core competency of librarians, and they are great at navigating the torrents of electronic information, as well as providing guidance to the use of traditional printed sources. However, it appears that Google, Wikipedia, publisher databases, and the like are “good enough” for most scholars, and that the convenience of around the clock access from anyplace outweighs the higher quality that librarians provide…

…The basic and very promising approach open to publishers is to continue marginalizing libraries by extending the reach and scope of “Big Deals.” The consortium model, in which groups of libraries cooperate to get access to a “Big Deal” is already common, and can be pushed further. The ultimate situation might be national “Big Deals,” where some toplevel bodies pay for access for everyone from a nation. Enlarging the “Big Deal,” especially through further mergers, but also by including additional information sources, can serve to create packages that simply could not be dispensed with. The most obvious move in that direction (which is already taking place to a small extent) is to make books, both current and old ones, a part of the “Big Deal.” (Recall that the process of digitizing old printed materials is extremely inexpensive.)…

I think an extremely likely scenario for the death of the academic library will be our hosting institutions simply paying vendors — whether publishers, aggregators, or other newer ‘disruptive’ businesses — directly for services (both the content itself, the platforms that host and organize it, and the ‘discovery’ services to search it), needing only a local skeleton staff to handle licensing.  If the bulk of a libraries budget goes simply to passing money on to vendors for large ‘big deal’ bulk packages, minimal professional staff, and not much of a library organization at all is required simply to do the bookkeeping and ordering.

Perhaps in the future, it will be clear that by 2013 this was already more or less a foregone conclusion to the story of the academic library – Odlyzko’s article shows some of the indicators and directions pointing in that direction.

What about libraries? They are handicapped in the competition with publishers by several factors, see [62]. One of them, that they have the bulk of the resources, and are thus a fat target, is a strength as well. At least in principle it makes possible revolutionary changes. In particular, as was shown earlier, just the external journal purchases of the ARL libraries alone could provide Open Access publishing for the world’s entire scholarly literature. Had libraries thrown their resources enthusiastically behind new, low-cost Open Access journals, perhaps the current scene and the unfolding future sketched here would have been different. But that would have required many research partners willing to put their energy into the enterprise (certainly a very doubtful proposition, given the inertia in the academic system), and the willingness of librarians to cannibalize their bread-and-butter operations. Certainly librarians present a classic case of Christensen’s “innovator’s dilemma,” pressed to maintain traditional services, and therefore slow to embrace new ones. As an example, digital libraries have been discussed in the library literature for decades. Further, the amount that ARL libraries spend in a single year on acquisition of serials would have sufficed, with plenty left over, to digitize all their standard books and journals that are out of copyright. Yet it was outside efforts, in particular the Gutenberg Project (the early pioneer, almost forgotten), Google Books, and the Internet Archive, that led the way…

…We also see libraries moving into other services, such as providing long-term storage for publications, data sets, and so on. However, there they are competing not just with publishers, who also see the opportunities, but also other organizations, such as campus information technology units, high performance computer centers, and a variety of new commercial startups. The opportunities are many, but so are the competitors….

Of course, people have been talking for over a decade about how the internet and other changes in the information environment will/may spell the death of libraries.  Some may be sense this as tiresome alarmism.

But I think now we’re actually seeing it happening. Many of our responses previously to this library apocalyptic thought was “Sure, traditional library services exactly as delivered may no longer be as important, but there is obviously an even greater need than ever for impartial information services and expertise for academic and civic communities, libraries can and will provide these services.”  But at this point, I think by and large we’ve seen libraries fail to rise to this challenge, which is why we’re seeing the  indicators of the beginning of actual, not just hypothetical future, sidelining of libraries in the university environment. 

If the library effectively ceases to exist as an organization providing information expertise to the university community, one thing our host institutions lose is an organization which can facilitate research/information needs from a perspective of interests aligned to those of the host university community.

The library is one of the few information organizations involved in research life that does not have business interests based on selling our users something (or selling our users’ privacy to someone else), or on convincing users to buy a particular product — but only on facilitating our users own self-directed goals and needs. Libraries can thus, uniquely in the information environment, provide services with transparency, impartiality, assertive protection of user privacy, and a professional ethical responsibility to act always in the interests of our patrons, never sacrificing them to our own business interests.

The existence of libraries as such disinterested advisors is thus extremely valuable in making possible the impartial non-market-based free inquiry at the idealistic heart of academic research and learning itself.  I firmly believe it will be a loss to the academy and to society to see libraries fade to irrelevance.

But that’s not going to be enough to save the library, in the current environment,  if we can’t also provide cost-effective services that not only satisfy (and we are barely doing that) but go on to delight and excite our host communities by what we can do to make their work easier, more productive, and more pleasurable.  A library’s impartiality in failing to deliver services of value is naturally of  limited perceived value to the host organization.

It will require some disruptive changes to our business as usual to get there — some close attention to our patrons’ changing needs, habits, environments, and preferences–and some creativity and risk-taking in attempting to position ourselves to engage our patrons. There’s no guarantee of success and we will inevitably make mis-steps along the way, but how many library organizations are even seriously engaging in the attempt, with all the disruptive risk and challenge it entails?

Posted in General | 7 Comments