traject, blacklight, and crazy facet tricks

So in our Blacklight-based catalog, we have a facet/limit for “Location”, that is based on the collection/location codes from holdings, and is meant to limit to just items held a particular sub-library of our Hopkins-wide system.

We’ve gotten a new requirement, which is that when you’ve limited to any of these location limits (for instance, only items in the “Milton S. Eisenhower Library”), the result set should also include all ‘Online’ items. No matter what Location limits you’ve applied, the result set should always include all Online items too. (Mine is not to reason why…). 

One thing that’s trickier than you might think is spec’ing exactly what counts as an ‘online’ item and how you identify it from the MARC records — but our Catalog already has an ‘Online’ limit, and we’ll just re-use that existing classification. How it classifies MARC records as ‘Online’ or not is a different discussion.

I can think of a couple approaches for making the feature work this way.

Option 1. Change how Blacklight app makes Solr requests

Ordinarily in a Blacklight app, if you choose a limit from a facet — say “Milton S. Eisenhower Library” from the “Location” facet — it will add on an `fq` param to the Solr query: say, “&fq=location_facet:Milton S. Eisenhower Library” (except URL-encoded in the actual Solr URL of course).

This is done by the add_facet_fq_to_solr method, and that method is called for creating the Solr URL because it’s listed in the solr_search_params_logic array.

So Blacklight actually gives us an easy way to customize this. We could remove add_facet_fq_to_solr from the solr_search_params_logic array in our local CatalogController, replacing it with our own custom local_add_facet_fq_to_solr method.

Our custom local method would, for facet limits from the Location facet only, do something special to add a different fq on to the Solr query, that looks more like: `&fq:location_facet:Milton S. Eisenhower Library OR format_facet:Online`.  For other facet limits, our custom local method would just call the original add_facet_fq_to_solr.

This wouldn’t change our Solr index at all, and would still make it possible to implement some other (possibly hidden back-end) feature that really limited to the original location without throwing “Online” in too, in case eventually people realize they need that after all.

I am not sure if it would effect performance of applying those limits; I think it would probably not, that expanded ‘fq’ with the ‘OR’ in it can be cached in the solr filter cache same as anything else.

I worry it might be a fragile solution though, that could break in future versions of Blacklight (say, if Blacklight refactors/renames it’s request builder methods, so our code is no longer succesfully replacing the original logic in the `add_facet_fq_to_solr` method) — and then be confusing for future developers who aren’t me to figure out why it’s broken and how to fix it. It’s potentially a bit too clever a solution.

Option 2. Change how location facet is indexed

The other option is changing how the location_facet Solr field is indexed, so every bib that is marked “Online” is also assigned to every location facet value.

Then, without any other changes at all to app code, limiting to a particular location facet value will always include every ‘Online’ record too, because all those records are simply included in every location facet value in the index.

We do our indexing with traject, and it’s fairly straightforward to implement something like this in traject.

In our indexing file, after the rule for possibly assigning ‘Online’ to the `format_facet`, we’d create a rule that looked something like this:

each_record do |record, context|
   if (context.output_hash["format"] || []).include? "Online"
      context.output_hash["location_facet"] ||= []
      context.output_hash["location_facet"].concat all_the_locations
   end
end

Pretty easy-peasy, eh? I think I would have had a lot more trouble doing this concisely and maintainably in SolrMarc, but maybe that’s just because I’m more comfortable in ruby and with traject (having written traject with Bill Dueber). But I think it actually might be because traject is awesome.

The only other trick is where I get that `all_the_locations` from. My existing code uses not one but TWO different translation maps to go from MARC data to Location facet values. The only place ‘all possible locations’ exists in code is in the values in these two hashes. If I just hard code it into a variable, it’ll be fragile and easily get out of sync with those. I guess I’d have to write ruby code to look at both those location maps, get all the values, and stick em in a variable, at boot-time.

No problem, just in the traject configuration file anywhere before the indexing rule we define above:

all_the_locations = []
all_the_locations.concat Traject::TranslationMap.new("jh_locations").to_hash.values
all_the_locations.concat Traject::TranslationMap.new("jh_collections").to_hash.values
all_the_locations.uniq!

The benefit of traject being just ruby is that you can just write ruby, and I’ve tried to make the traject classes and api’s flexible so you can do what you need with them (I hadn’t considered this use case specifically when i wrote the TranslationMap api, but I gave it a to_hash figuring all sorts of things could be done with that, as ruby Hash has a flexible api).

Anyhow. Benefits of this approach is that no fancy potentially fragile “create a custom Solr query” code is needed, and the Solr `fq`s for facet queries are still ordinary “field:value” with Solr performance characteristics we are well familiar with.

Disadvantages might be that we’re adding something to our indexing size with all these additional postings (probably not too much though, Solr is pretty efficient with this stuff), and possibly changing the performance characteristics of our facet queries by changing the number and distribution of postings in location_facet.

Another disadvantage is that we’ve made it impossible to query the “real” location facet, without the inclusion of “Online”, but that does meet the specs we’ve been currently given.

So which approach to take?

I’m actually not entirely sure. I lean to option 2, despite it’s downsides, because my intuition still says it’s less fragile and easier for future developers to understand (a huge priority for me these days), but I’m not entirely sure i’m right about that.

Any opinions?

About these ads
This entry was posted in General. Bookmark the permalink.

6 Responses to traject, blacklight, and crazy facet tricks

  1. Trey Terrell says:

    Personally, I like option 1 – the second option feels like I’m indexing a lie. It’s not that every online item belongs to all locations, it’s just that the app needs to deal with a certain location in a special way. Option 2 opens the door for having to reindex every object in the catalog, and that seems a waste. As for compatibility and developer concerns, I think that can be nicely remedied with a method-level comment and an integration test.

  2. jrochkind says:

    Thanks for the feedback.

    Whether it’s a lie or not, it’s what the UI is going to be, it’s what stakeholders have chosen, the UI is not (for now) going to let you limit to any location without also including Online items. I think it is appropriate to set up Solr to support the actual UI you’ve got, even if it’s weird.

    We regularly reindex every item in the catalog, so that’s not a concern — we have to make indexing changes all the time that require reindexing, and do regular mass reindexes even when we aren’t making indexing changes.

    My own experience gives me far, far, far less faith than you in a method-level comment and integration test ‘nicely remedying’ issues of code that’s fragile or difficult for developers to understand.

    I guess as always, it depends on your context.

  3. I’d index two fields: one for “location_facet” as you indicated above — include everything for online stuff, translating it basically into “available from” — and one for “physical location” which would only list the actual buildings items hanging off the record sit in. Maybe you won’t use the latter, at least right away, but it makes it easy to get at that dimension if you need to.

    We do this, too, in a variety of ways. I’m moving more and more toward finding a way to quantify “how hard is it for me to get this?” Online items are easy, no matter where you are. If you’re in a particular building, those things are easy. Things on your end of campus might be easy, too. Stuff in off-site storage is hard, but maybe no harder than stuff for which we have very speedy I.L.L. agreements. Stuff in any building that has overnight delivery is all equal. At the very bottom end would be stuff on microfilm in buildings without microfilm readers :-)

  4. jrochkind says:

    Thanks Bill.

    I’ve been thinking for a while along those same lines of ‘how hard is it for me to get’ — in particular, what I was thinking was assigning each item an actual time of “how long until it’s in my hands”: overnight, 3-5 days, ~7 business days, whatever.

    Just displaying this next to each item, or allowing filtering/limiting, either way.

    There are some challenges to doing this though (let’s start by saying I don’t have the data to even know how long it will take to get any certain item in it’s situation and relation to your location/status, or what is the std deviation or general level of consistency/predictability).

    As far as the specific suggestions: I don’t see why I’d create the ‘physical location’ facet now, if it’s not going to be reflected in the UI at all now; might as well wait to create it until I need it, there’s nothing saved by creating it in advance, is there? Although now I’m again divided on which option is better, I might just flip a coin or something, heh. Or pick whichever one seems like less work right now.

  5. jrochkind says:

    Ha, while in theory I have no problem with changes that require re-indexing…

    Some problems I had today with an unrelated feature that required re-indexing, and didn’t work out right, and the way when I think I have it fixed I have to wait at minimum a few hours for a reindex to see if it’s REALLY all fixed (and I still haven’t figured out exactly what the problem was)….

    ….is definitely leaning me back in the direction of Option 1, which requires no index changes.

  6. jrochkind says:

    Hm, but I realized that option 1 won’t have the facet _counts_ displayed in the UI be accurate.

    The facet count next to “Milton S. Eisenhower Library” will be the count of things actually assigned to that facet, while then if you click on it, your result set will actually be those things PLUS the Online things, a number greater than the count shown next to the facet you clicked on.

    This seems unacceptable.

    I could do something fancy using Solr query facets, which Blacklight does support, but Option 2 is sounding a lot better to me.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s