cool range limit/profile function in Blacklight

I’ve deployed a pretty cool date range limit/profile function our demo (not yet production, but public) Blacklight instance.

http://blacklight.mse.jhu.edu/demo

Click on “Publication Year” on the left, either before or after you’ve done a search, try both. (It may be slow, this Solr instance needs some tuning, plus see notes on performance at the bottom of this post).

Thinking about this feature actually brings up a couple points worth mentioning.

Building a platform

In my opinion (others may disagree) Blacklight is still not very mature software. It’s not an install-and-it-just-works software package (yet! I think we can get there eventually…).  It’s instead something that an institution can choose to cooperatively develop.

In this, it’s like much library sector open source software. If we had just bought a proprietary ‘discovery layer’, we’d undoubtedly have it up quicker than Blacklight. (Whether we’d have it up cheaper — accounting for staff time —  hard to say; I don’t know what anyone other than me gets paid!).   Just getting Blacklight to the point of providing the same features that a proprietary solution would provide — will take us more time.

But the potential, which I believe will be fulfilled, is that once we (‘we’ at my institution, ‘we’ the Blacklight development community) get Blacklight to the point — we will then be able to very quickly surpass those proprietary offerings.  It’s taking more time, in part, because we’re setting up a very good platform for future development, and setting up a good platform takes more up-front time, but pays off down the line. (Personally, I think this comparison is, to a lesser extent, true with VuFind and Blacklight too. Blacklight right now is harder to set up than VuFind, and is going to take longer to get to be as easy to set up as VuFind — but I think once it gets there, it’ll offer flexibility and power that VuFind doesn’t. But could be wrong, who knows.)

So anyway, after a long preface, my point is that this date limit/profile feature is kind of a little packet from the future — we decided to really put our all into this little feature to provide a vision of the sorts of things that Blacklight will eventually allow us to do. We don’t have time to do this all over the place, we’ve got to spend time building the unfinished platform, but we needed a date limiter, Blacklight didn’t really have one, and we thought it was worthwhile to take it all the way to provide a glimpse of the future.

Sadly, the story “no time for the cool stuff, got to build the platform” is kind of the story of my library development career. I’ve spent lots of time helping to build up solid platforms that will enable really cool stuff (Umlaut, Xerxes), but then I don’t have time to add the cool stuff on top, it’s on to build another platform (Blacklight).  I’m hoping that soon I’ll have enough fundamental platform pieces in place that I’ll be able to return to adding the cool stuff on top. (catalog and local index search => Blacklight.  Link resolver and known item service infrastructure => Umlaut. Broadcast search => Xerxes. All written to be interoperable with each other and with other software via good design and standard APIs).

Common codebase open source is hard

So, okay, it takes more time to get up to ‘parity’ with complicated open source like Blacklight.  But there are different ways to spend that time, and get there.

What I see happen a lot in library sector, is we download open source software, and then we hack the source as quickly as possible to meet our local needs.  There are a variety of reasons for this — lack of software engineering skill in the library sector (due to salaries offered), lack of developer resources, organizations/administrators that want ‘results’ they can see as quickly as possible — and what they can see is the interface, not what’s ‘under the hood’.

But the problem there is that each library essentially ends up with it’s own “fork” of the original codebase. If a new version comes out, it’s very hard to upgrade, because your code has diverged so much from the common.  (And will a new version ever come out, if all the developer resources are just spent on local forks, and not new common features?).  If someone else fixes a bug or adds a new feature, it’s hard for them to share it with you, because they wrote code against their local diverged fork, which is not the code you have.  You no longer really get the advantages you were expecting from ‘community source’, instead you basically have a ‘homegrown’ software package, although one you got a jump start on by borrowing someone’s open source at the beginning.

So the desired alternate is everyone keeping a common codebase — you make your localizations as configuration to a common codebase, or as ‘plugins’ to a clear ‘api’ of some kind, not just as willy-nilly changes to the original source code.  But this is actually a lot harder to do. It takes software engineering skill and experience to figure out how to write the common codebase this way — and it takes more time. It’s another version of  “more time up front pays dividends down the line.”  But it’s one that managers and administrators who are not savvy to software engineering may not understand; our task is perhaps getting them to understand it, as well as developing the skills to do it.

So I produced the Blacklight range limit code as a shareable plugin to Blacklight. It’s not in Blacklight core, but it’s not just a local hack, it’s plugin which other people can use (if they haven’t locally forked/customized their Blacklgiht too much to use it!), and which should keep working with minimal tweaks with future versions of Blacklight. This took more time — and it took some refactoring of Blacklight core to make it possible for a plugin to provide the features I needed — but I think it’s worth it.

MARC data can be hard

Getting a date out of MARC data ends up being kind of tricky!  This is in part due to the inherent complexity of our real world domain, with items that were written at one point but published at another, or maybe we don’t know exactly when it was written or published, or maybe it was published over a buncha years (like a serial), etc.

It’s also in part due to MARC’s 45-year-old orneriness, MARC doesn’t store even what we do know about an item’s date in a particularly easy to process way.

The best place to get the most complete date information in a a MARC record is the 008 field, but then, since most of our legacy ILS’s didn’t do much (if anything) with dates from the 008, it was a field often ignored — mistakes went un-noticed and not fixed, perhaps even less care was taken in entering the data in the first place because, after all, nothing was using it.

[ The lesson of this to me is: Do data right, or don’t do it at all. If you’re taking the time to enter data, don’t consider it data taht doesn’t really matter and doesn’t have to be right — if you aren’t going to exercize care with it, just don’t enter it at all. Leaving it out is a more legitimate choice to me than putting it in carelessly, because later software can know data left out is simply data left out, but has no way to know which data is trustworthy and which not if there’s a mixture.]

So, at least for now, my own Blacklight (this is not hardwired in to the date limit plugin, but just up to how you choose to index in your Solr) tries to get a date from the 008 first. it ignores dates earlier than 500, or later than this-year-plus-6, because the vast majority of those are data errors. (Why +6? Surprisingly, there are some non-errors in the near future. The publisher would seem to be claiming a copyright on this, for instance, in the future. Probably not legal, but ours is just to report accurately.)

For most 008 date types, the software just chooses date1 as “the date”, which turns out about right most of the time. (Except serials, which is a whole different ball of wax).

If it’s a “q”uestionable date type, my software says if the range is less than 15 years, split the difference and call that estimate good enough. If it’s more than 15 years, consider it ‘unknown’ date. (Future enhancements could be to allow a larger estimation range for earlier work s — having a 50 year guess for a 13th century book is pretty good! — and allowing multiple dates or date ranges instead of being forced to pick a single date.

If an 008 date can’t be found, then the software tries to pull one out of the 260$c using regular expressions, which sometimes kind of works.

If no dates can be found at all, it goes into the ‘unknown date’ bucket.

Not perfect, but seems like it might be good enough, we’ll see.

Some details on code, Solr, performance

So in order to show a slider, or the neat chart, the code needs to know the min and max value of the result set, to set the slider appropriately and fetch the right sub-segments. The code could just use some hard-coded min/max, but it seemed a lot better to have the slider and chart ‘zoomed’ to your actual result set at any given time.

In order to do that, the plugin includes the Solr StatsComponent with nearly every Solr request, so it can get the min and max from the result set via the StatsComponent response. That’s enough to set the slider up.  Then a second request (triggered via AJAX) is made to fetch the sub-segments within that min and max, so the distribution chart can be displayed.

This works well enough, but with my very large corpus seems to have some performance problems — including the StatsComponent sometimes significantly slows down my Solr response, and it’s not entirely clear to me if the StatsComponent is able to use any solr caches behind the scenes.  But I also don’t have my Solr set up optimally: My solr needs more RAM, which I think might speed up the StatsComponent on large result sets, but I don’t have enough hardware for that now. And using the Solr 1.4 TrieInteger field for the integers seems like it couldn’t hurt, but I’m using a weird pre-1.4 nightly build which doesn’t include tries yet (long story).

So I want to address both of those things and re-investigate, but if ultimatley necessary there are ways to write more complicated code that avoids using the StatsComponent at all, but still gets min/max, with more Solr querries (that might still end up being quicker than the StatsComponent– the StatsComponent after all returns  more than min/max, and some of the other things it returns are more expensive than just min/max).

This entry was posted in General. Bookmark the permalink.

3 Responses to cool range limit/profile function in Blacklight

  1. Robin Sinn says:

    Jonathan,
    This really is cool. Very nicely done!
    Robin

  2. David Kennedy says:

    Jonathan,

    The date function is really cool.

    I appreciate this blog post most though for your discussion of the challenges with open source software and building the platform. I don’t think this is well understood in the library world, and I am glad you’re shedding light on it in your blog.

    Dave

  3. Pingback: idle thoughts: timeline visualization in a catalog « Bibliographic Wilderness

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s