Extending your Solr indexing via a gem with traject

One of the goals of the traject MARC->Solr indexing tool was to support easy re-use of mapping logic and code between projects and institutions.

I’ll talk here about how I just used some code shared by Bill Dueber at University of Michigan, to add ‘physical carrier’ information to my indexing rules for a ‘format’ facet.

The Background

Getting form/format/carrier/genre information out of MARC is very tricky.  I’m talking about categories like “Newspapers”, “Pamphlets”, “Dissertations”, “CDs”, “DVDs”, “Print”, “Online”, “Video games,” and similar and related categories.

It’s tricky in part because the way we humans think of these things is not very clear or consistent.  Even one person’s internal categories aren’t neccesarily very consistent if you actually try to tease them out; let alone consistency between people, communities, and over time: needs and categories have also changed over the historical sweep of MARC standardization and cataloging record creation, making it even harder with our large collections with cataloging created over decades!

So anyhow. It’s a tricky problem, and it’s not totally MARC’s fault. But different organizations and software come up with different algorithms and heuristics. I took the one that we’ve been using where I work, and made it a built-in distributed option in traject, just to have something there, but it’s hardly the ultimate solution or anything.

The set of algorithms we’ve been using doesn’t cover physical ‘carrier’ types like CD, DVD, Laserdisc, VHS, LP, or what have you. Getting those out of MARC can pretty tricky — in some cases, you’ve got to scan free entry text fields for things like “sound disc. 12 in. 33 1/3 rpm” to know that means a standard vinyl long-playing record (LP).  There’s not necessarily one right way to do this, it takes some experimentation and development and iteration to come up with the right set of rules.

Bill Dueber has put the University of Michigan’s logic for form/format/carrier categorization up as a traject-compatible ruby gem.  It’s also not the ultimate word or anything — it has it’s own idiosyncracies, and in some cases contains logic based on local U of Michigan cataloging practices or local U of Michigan call numbers.

But the umich logic does logic in it for detection of the particular physical carrier categories that we were most interested in: audio CD, video DVD, LP, VHS.

Adding Umich’s logic to my indexing

So I figured, why not try using Bill’s gem in my traject project, to add those classifications on to what I’ve already got.

Turns out it is quite simple and concise to do so.

First I added `gem “traject_umich_format”` to my local `Gemfile’, since I’m using bundler to manage my gem dependencies in my traject project, as is common in ruby projects, and is optional but recommended for traject. 

Then I just add these lines to my traject indexing configuration, to ask teh traject_umich_format gem to classify a record; take only the categories I’m interested in; and add them on top of the existing values already added by own code to the Solr ‘format’ field.

# add in DVD/CD etc carrier types courtesy of umich gem
# https://github.com/billdueber/traject_umich_format
require 'traject/umich_format'
umich_format_map = Traject::TranslationMap.new('umich/format')
to_field "format" do |record, accumulator|
  types = Traject::UMichFormat.new(record).types
  # only keep the ones we want
  # (previously tried more that didn't work with our catalog)
  types = types & %w{RC RL VB VD VH VL}
  # translate to human with translation map
  accumulator.concat types.collect {|t| umich_format_map[t]}
end

That was really it. Bill’s gem (rather cleverly) separates the classification itself from the human labels used in the actual facets, so we use both parts of the process — classify, take just the classification codes we’re interested in from the output of classifying, run them through the map to turn them into human-presentable labels, and add them into the existing ‘format’ field.

Neat, eh? On the downside, this is not neccesarily as ‘high performance’ code as it could be, because the way Bill’s code is written I end up asking it to calculate all the classifications, then throw away the ones I’m not interested in. But I’m not too worried about it, I think it’ll be fine.

It spares me from having to re-invent the wheel of how the heck you figure out if something is an LP from MARC; and possibly even more importantly, by sharing code with Bill, when either of us finds bugs, or edge cases where our heuristics can be improved, we can easily share them with each other — in fact, more or less automatically share them with each other, by making improvements to the shared ruby gem.

Consider traject?

I have to admit, since we announced traject about a month ago, I am aware of nobody other than me and Bill trying it out.

I had hoped to get some other beta testers before I called it a 1.0.0 release, but what can you do. It will be tagged 1.0.0 soon, but regardless of when it gets that tag, traject is mature, robust, ready for business, and being used in production by both me and Bill. Consider taking it for a test drive?  If you have any frustrations with your current Solr indexing solution related to keeping your logic well-organized, tested,  re-useable and shareable between projects, and supporting rapid development and quick iteration — you may find some things you like in traject.

Of course, formats are still complicated

We don’t have this code in production here, we’re just demo’ing it out. “Formats” remain complicated because our own mental models of them are so varied and inconsistent — it’s not entirely clear what the optimal UI for this stuff is, but we don’t neccesarily have the time (or want to prioritize the time) to figure it out. We’re deciding if the basic implementation based on our current UI supplemented by Bill’s code is ‘good enough’ already to add value for our patrons.

This entry was posted in General. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s