customizing Blacklight: “disable automatic stemming”

A second post with a walk through on customizing a Blacklight-based app to add a search feature. Of interest to programmers working on Blacklight, or those curious about Blacklight’s architecture (especially those familiar with Rails). The first post covered a simpler feature, and is probably a recommended pre-requisite to this one for those not expert in Blacklight/Rails already.

This is a long one, I try to explain everything, let me know if it’s helpful at all.

The overview

In this post, we want to add a “disable automatic stemming” checkbox option.

Our ordinary/default searches use the Solr dismax query parser over a number of Solr fields with the ‘qf’ parameter. Some of these fields use Solr analysis that provides normalization including stemming (see the field definition currently in use here), such that a search for “book” will find documents including term “books” — and vice-versa a search for “books” will find documents including the term “book”. Normally this works pretty well, but sometimes a user will want to increase precision/decrease recall,  and find only more exact matches, not alternate versions of the words.

For now, we’ve decided it’s sufficient to give the user a single checkbox labelled “disable automatic stemming”, that will effect their entire search. And we’ve decided to put it on our “advanced search” page, accessible via a “more search options” link, rather than on every search box.

It’s helpful to start with a basic idea of how we’ll make this happen at the Solr level. As we mentioned, a normal search searches over Solr fields with stemming analysis. To turn off stemming, we simply need to instead search over fields that contain the same data, but without stemming analysis in their solr field definitions.  In my Solr/Blacklight, I actually already have these fields in place — I use them to boost matches on unstemmed terms higher than matches on stemmed terms, using Solr dismax qf/pf boosts.

These un-stemmed fields are defined with this Solr schema definition.  I use copyField statements in my schema.xml to have the indexer only have to write to the stemmed version of the field, but have the input automatically copied into the unstemmed version at index time. For instance, <copyField source="title_t" dest="title_unstem"/>.

We’ll get into more details later of how we make that ‘unstemmed’ search happen in our app, but first we’ll dive into adding the interface elements alone to our app.

Adding checkbox to advanced search form

We’re going to have our app support a new URL parameter: When &unstemmed_search=1 (or really =anything) is present in the URL, we’ll make it an unstemmed search, when the param isn’t present at all, it’ll be a normal search.

We want to add a checkbox to control this to our ‘more search options’ page provided by the Blacklight advanced search plugin. 

To locally customize this form, we’ll copy the view from the plugin into our local app. (It is a bit much to copy this whole form with lots of layout and logic we don’t want to change, just to add this checkbox. Doing so will ‘freeze’ this into our application, even if the plugin itself improves things in the future. However, later, not as part of this walk-through, we here are going to want to make even more changes to this form, so we dont’ mind).

Copy the file from the blacklight advanced search plugin’s app/views/advanced/_advanced_search_form.html.erb into our a new file in our local application’s root directory app/views/advanced/_advanced_search_form.html.erb. Creating the intermediate “advanced” directory if needed.   Now the file we copied will be used by the running app in preference to the one that comes with the adv search plugin.

And now we add our checkbox to it, wherever we want it to appear in the form, saying it should show up as checked whenever “unstemmed_search” appears in the current request params, and when checked and the form is submitted, should add unstemmed_search=1 to the request.

<%= check_box_tag "unstemmed_search", "1", params[:unstemmed_search] %> <%= label_tag "unstemmed_search", "Disable automatic stemming" %>

Almost, but not quite

Sadly, we’re not quite there yet. We can click on ‘more options’ to get advanced search, check this checkbox to select it, click on “search” to go back to results (where our new option isn’t reflected yet), then go back again to ‘more options’, and our choice remained! Great. But in fact we can never get rid of it, we try unchecking it, but it STILL remains in the URL. Oops.

This is because, similarly to the case in my first tutorial, the advanced search form is creating some html input ‘hidden’ fields storing your current search context, so when you switch to advanced search adn then back you don’t lose your previous limits and such. But it’s adding our new “unstemmed_search” to these hidden fields, meaning it gets ‘stuck’ when selected regardless of our new checkbox.

There’s a helper method #advanced_search_context which comes up with the list of fields to include in these hidden fields (and also in the user-presented echo’s list of your current search).  So, okay, we just want to over-ride this helper method to remove :unstemmed_search param from it’s return values. Unfortunately, in Rails3 that seems to be impossible for reasons I can’t totally figure out.

So instead, we make another modification to our local view template, _advanced_search_form.html.erb, every time it calls advanced_search_context, we remove the :unstemmed_search param from it before passing it on:

advanced_search_context.tap{|h| h.delete(:unstemmed_search)}

(We’ve actually left something broken here — the adv plugin tries to fetch facets from within the current search context, but NOT including things editable on the adv search page, to give facet values and counts appropriate for the fixed context. And we haven’t fixed that to leave off :unstemmed_search… but that whole feature ends up being so tricky to implement and extend, that I’m thinking it should be removed from the adv search plugin altogether, and am not going to bother to fix it for our unstemmed_search addition).

Add user feedback that option is selected

Okay, now we can add the option in, and it stays in. It doesn’t do anything yet, but we’ll get to that later. First, another problem is that once you’ve added it to your search, there’s no way for the user to know they’ve added it. And if a user looks at such a search with :unstemmed_search  in their search history, or adds it to their saved searches — there’s also no feedback saying that the search includes that option.

So I decided that we want the area under the search box that echo’s back your current limits:

—to include a bubble “Stemming disabled” when the :unstemmed_search options is checked. And the search list in Search History and Saved Searches are purely textual, they should also have some marker “stemming disabled” for searches with that option.

Turns out all of this is controlled by the Blacklight plugin helper module RenderConstraintsHelper .

Those bubbles on the search results are produced by the #render_constraints helper method. And the textual description of a search in ‘search history’ and ‘saved searches’ is produced by the #render_search_to_s.

So the basic idea is, I want to over-ride those methods in my local app, calling ‘super’ to get the default implementation, but adding on our new constraint, where applicable.

The bubble

So the default render_constraints first calls #render_constraints_query to output the bubbles for the query part, then calls #render_constraints_filters to output the bubbles for the facet part.

We could over-ride the whole thing, first call render_constraints_query, then output our new bubble, then call render_constraints_filters.  But I felt like being even more surgical. I first tried over-riding render_constraints_query … but it turns out that Advanced Search Plugin sometimes causes this method not even to be called, replacing it with it’s own. So okay, we’ll over-ride render_constraints_filters, adding our new constraint before the call to super.

(Blacklight 3.x). Create our own local app/helpers/render_unstemmed_constraints_helper.rb file. (It actually doesn’t matter what we call the file, all app/helpers/*.rb files are kind of munged together by Rails, by default in Rails2 and always in Rails3).

module RenderUnstemmedConstraintHelper

  # note: trying to over-ride render_constraints_query instead ended up
  # interfering with advanced_search_controller, which sometimes doesn't call
  # super. oh well.
  def render_constraints_filters(my_params = params)
    if my_params[:unstemmed_search]
      render_constraint_element(nil, "Stemming disabled", :escape_value => false, :remove => my_params.merge(:unstemmed_search => nil))
    else
      "".html_safe
    end + super(my_params)
  end
end

(note addition of “.html_safe” on the end of the emtpy string for Rails3)

If the parameter :unstemmed_search is present, we output a bubble for it. We use the existing Blacklight render_constraint_element method to output the bubble — this way we know we’re outputting a bubble in whatever Blacklight’s standard way is, and if Blacklight (or other local customization, or another plugin) changes the standard way a search constraint is output, no problem, we’re calling the method that does it without worrying about the implemetnation. I decided “Stemming disabled” should be in italics so it won’t be confused with a user-entered query for the string “Stemming disabled”. So pass the “:escape_value => false” param to render_constraint_element, so we can pass in raw HTML. Then after we output our new constraint (or an empty string if it’s not selected), we call ‘super’ to output the usual output of render_constraints_filters.

In Blacklight/Rails3, this alone is enough, we’re done, our new helper method will be included just by being in the app/helpers directory, and call to ‘super’ will work to call ‘up’ to the original Blacklight engine implementation. In Blacklight/Rails2, you need to take another confusing step. In your app/controllers/application_controller.rb,  add this line inside the “class ApplicationController…” definition, to load my new helper (at the right time so call to ‘super’ will reach Blacklight’s original implementation). ‘include RenderUnstemmedConstraintHelper’

Bingo, it works, when “&unstemmed_search=1” is in the params, we get the feedback, after the query but before the filters:

Textual Representation

Pretty similar for the textual representation of a search, over-riding the analagous method for textual representation of a complete search, adding this to the render_unstemmed_constraints_helper.rb file we created in the last step:

  def render_search_to_s_filters(my_params)
    if my_params[:unstemmed_search]
      render_search_to_s_element(nil, "(Stemming disabled)")
    else
      ""
      end + super(my_params)
  end

The render_search_to_s_element doesn’t take an ‘:escape_value => false’ param (the comments suggest it should, but it doesn’t work, oops, my fault), so, oh well, just putting it in parens is good enough for now.

And that works reasonably well too, although the Search History/Saved Searches display is kind of messy before and after, it’s not an area of Blacklight that’s received much love.

Actual Functionality to disable stemming

Okay, we’ve got a “Disable automatic stemming” checkbox on the advanced search form. We can click it. Once we’ve clicked it, our search tells us it’s a “Stemming Disabled” search, and if we look at that search in the History or save it in Saved Searches, it keeps telling us that.

But it doesn’t actually do anything yet, a ‘stemming disabled’ search behaves exactly the same as any other search.  Okay, let’s make it actually do something.

Blacklight’s mapping from our URL parameters (where we put &unstemmed_search=1), to the actual Solr request parameters (that get us the results from Solr we’ll show to the user) all is implemented in the SolrHelper module. Despite the name, this is not a Rails template helper — it’s an ordinary ruby class, that actually gets mixed into Blacklight’s CatalogController (or potentially other controllers that want to support Solr searches), to provide methods to do Solr searches — including methods to map incoming URL request parameters to outgoing Solr request parameters.

The method that actually does this mapping is #solr_search_params.  Since Blacklight 2.9, this method actually calls out to a sequence of other methods that each do their part in the mapping. What sequence it calls out to is actually determined by an array of symbols representing method names, in the #solr_search_params_logic class method. This architecture is designed to make it easy local apps and plugins to add and remove pieces of mapping logic here, like we’re going to want to do here.

But I had a bit of trouble figuring out how to do it. Recall that we want to switch what Solr fields are referenced in the dismax ‘qf’ parameter sent to Solr, when the unstemmed_search option is checked. The #add_query_to_solr method is responsible for adding the ‘qf’ for a given search field.  So my first try was to de-activate #add_query_to_solr, and add in my own new #add_query_with_unstemmed_to_solr . This required me to copy-and-paste a bunch of logic I didn’t want to change, which is undesirable. But the real deal-breaker is that while this approached worked with a ‘simple’ search, it did NOT work to add “unstemmed search” logic to an “advanced search”, because the advanced search logic piece already switches out #add_query_to_solr for it’s own, when doing an advanced search.

So I thought about how to customize the advanced search solr param generation too, but it was looking dire, this was going to be complicated and fragile code. All this stuff is more complicated too becuase of the diversity of ways Blacklight let’s you set up your Solr requests — for instance, using Solr LocalParams syntax, where the ‘qf’ gets embedded in the Solr ‘q’ parameter instead of being it’s own parameter, which makes it even harder to come in after the fact in our Blacklight process and twiddle it a bit looking at the generated hash.

Thinking about that, and realizing that I was using the Solr LocalParams syntax in my setup, I realized there was a simpler way, relying on that. Let’s take a look at what’s going on in my Blacklight, with some downsides, but it was simple and worked.

The Blacklight to Solr flow for a fielded search

In my config/initializers/blacklight_config.rb, I configure my ‘title’ field, for an example, like this:

config[:search_fields] << {:display_label => 'Title', :key => "title",
    :solr_local_parameters => {
      :qf => "$title_qf",
      :pf => "$title_pf"
    },
    :solr_parameters =>{
      :"spellcheck.dictionary" => "title"
    }
  }

That means the actual search query to Solr looks something like this:

&q={! qf=$title_qf pf=$title_pf}some search terms

That stuff in the {! … } is Solr LocalParams syntax. Which means, for the dismax ‘qf’ parameter, use value found in another request parameter called “title_qf”.  But Blacklight isn’t actually sending a title_qf parameter normally, so where does this come from? A default value set in my solrconfig.xml:

        title1_unstem_search^80
        title1_1^30
        [...]

    [...]

And for an advanced multi-field search, it ends up even more complicated, but still, the way I have things set up, using the LocalParams stuff. Say we searched for title:Consent AND author:Chomsky, what’s sent to Solr is something like this (but with all sorts of escaping I’m not including here for legibility):

  &defType=lucene&q=__query__:"{! qf=$title_qf pf=$title_pf}Consent" AND __query__:"{! qf=$author_qf pf=$author_pf}Chomsky"

It’s a whole lot more complicated, but still relying on those LocalParams, which get their value from another Solr request param — or in my case, from a default value set in the Solr request handler.

Aha! If, in the case of an unstemmed_search, we do send a value for “title_qf”, “author_qf”, etc., then it will use the value we send instead of using the default from the solr request handler. And we have a way to easily selectively change the ‘qf’, that doesn’t require us to care anything about the details of how some other part of Blacklight (or advanced search plugin) is actually constructing these queries. The downside is we’re going to get some really long ugly queries in our Solr logs for unstemmed_searches, but oh well, can’t have everything.

Implementation

Starting with title_search. So the full $title_qf defined in my solrconfig.xml is actually:

  •  title1_unstem_search^80
  • title2_unstem^60
  • title3_unstem^30
  • title1_t^30
  • title2_t^25
  • title3_t^10
  • title_series_unstem^25
  • title_series_t^10

The *_t ones are stemmed fields, the *_unstem ones are First I create a new config variable in my Blacklight.config object to hold the value I want the title_qf to be for unstemmed_search mode — that same list, but including only the *_unstem Solr fields.

config/initializers/blacklight_config.rb:

config[:unstemmed_overrides] = {}
config[:unstemmed_overrides][:title_qf] = [
      "title1_unstem^80",
      "title2_unstem^60",
      "title3_unstem^30",
      "title_series_unstem^25"
    ]

Okay, now we create the actual logic to apply this in case of unstemmed_search mode.

I create a file in my local app, lib/unstem_solr_params.rb, to define a ‘module UnstemSolrParams’.  I define a method like this in it:

    ##
    # If unstemmed_search is selected, then we add params to redefine
    # things like $author_qf, $title_qf, etc.. Normally those are supplied
    # by Solr solrconfig.xml defaults, but we define em explicitly in the
    # request to contain only unstemmed fields.
    def add_unstemmed_overrides_to_solr(solr_parameters, user_parameters)

      if user_parameters[:unstemmed_search]
        Blacklight.config[:unstemmed_overrides].each_pair do |key, value|
          solr_parameters[key] = value
        end
      end            

      return solr_parameters
    end

We’re going to include this module in our CatalogController, and when we do we want it to automatically add a symbol for our new method, :add_unstemmed_overrides_to_solr , to the end of the class’s solr_search_params_logic list, so Blacklight will call our logic to add these params.

We can use ruby self.included to do that:

  def self.included(klass)
    # Replace :add_query_to_solr in solr_search_params_logic with
    # our new method.
    i = klass.solr_search_params_logic << :add_unstemmed_overrides_to_solr
  end

[link to complete module definition]a

And now, in our local lib/controllers/catalog_controller.rb (which you already have in Rails3, or see here for how to add in Blacklight/Rails2) we just want to include the new module we just created:

lib/controllers/catalog_controller.rb

require 'unstem_solr_params' # BL/Rails2, use 'require_dependency' instead

class CatalogController < ApplicationController
  include UnstemSolrParams

[...]

Now we’ve just got to go back to our config[:unstemmed_overrides], and add definitions for each of the other search types we have that need unstemmed alternatives. Some fields (like author), I never use Solr stemming fields, so don’t need alternatives. One search type I had before, series_title, wasn’t using the :solr_local_params config for Solr LocalParams, it was using straight :solr_params instead, so it couldn’t use this technique — so I just changed it to be using LocalParams technique.

And the only weird one was our default “all fields” search. That one didn’t use LocalParams or ordinary params to send a ‘qf’, it didn’t send a ‘qf’ at all, it just relied on the default ‘qf’ defined in the solrconfig.xml request handler. Okay, so when we are in unstemmed mode, we send a whole new ‘qf’ to replace that default one, without any stemmed fields.

Here’s my complete config[:unstemmed_overrides] along with the relevant parts of my solrconfig.xml.

That’s it, and we try it, and it actually works.

One downside is we get REALLY long ugly URLs sent to Solr, in our Solr logs, when we’re in un-stemmed mode, becuase it includes ALL those over-rides on every unstemmed_search type request (doesn’t try to include just the ones for a particular field searched, because advanced search can include multiple fields, and we want to catch those too).

Pro’s of this approach

  • It works.
  • It’s not very much code.
  • It is very loosely coupled to Blacklight and advanced search plugin. We don’t care HOW they create their Solr request, as long as they use Solr LocalParams to do so, using the names we expect, it’ll keep working.

Con’s of this approach

  • Very long ugly requests to Solr
  • Splitting our Solr ‘qf’ config in several places that need to be kept sync’d. The usual ones are in Solr’s solrconfig.xml, but the unstemmed over-rides  are in blacklight config ruby code — but are based on the ones in solrconfig.xml, and a change in solrconfig.xml will require a change in the ruby config. 
  • while it’s not sensitive to Blacklight code structure, it IS very sensitive to our Blacklight/Solr configuration. It only works using the LocalParams method of setting qf.  If we change the name of our various Solr params and such, it’ll break. 

Improvements that could be made in Solr itself to help with this use case

Solr dismax only supports dollar-sign param de-referencing in LocalParams syntax, embedded in a ‘q’ using {! } syntax.  What if Solr dismax allowed param dereferencing in ordinary request params too, and even would follow a chain of multiple de-references? Then we could do something like (shown without neccesary escaping for clarity):

?q={! qf =$title_qf}search words
&title_qf=$title_unstemmed_qf 

# And in the solrconfig.xml itself, define
# our title_unstemmed_qf as a default request param value

That would take out a couple of the ‘con’s if Solr dismax would do that. Probably not too hard to add to solr dismax in fact, but my own familiarity with the Solr codebase and Java in general is pretty low, I probably won’t be doing it anytime soon.

Improvements that could be made in Blacklight to help with this use case

We wound up using this somewhat hacky approach of over-riding default values for LocalParams because it was hard to surgically inject ourselves in Blacklight at the point of qf determination, what with there being basic search logic and advanced search logic and both handling both ordinary param literals and LocalParam de-references.

But both the basic search and advanced search use the exact same Blacklight call to come up with qf and other Solr parameters for a given search type:

Blacklight.search_field_def_for_key( search_type_key )

What if we could inject/over-ride there, to return a different search_field_def (the ‘qf’ etc) when we’re in unstemmed_search mode?

The problem is that this is on this weird global Blacklight object, which makes it somewhat hard to inject into, and also means the method definition of search_field_def_for_key doesn’t have any information about the current action context, the fact that :unstemmed_search is in the params, etc.

For a while I’ve wanted to move all this Blacklight config to be attached to the given controller (such as CatalogController) instead of a global object — and to be called by consumers not off a global class-level singleton, but methods on the current controller instance.  This use case further pushes for that architecture — then it would be easy to over-ride a #search_field_def_for_key  method on the controller, and have the logic return different response if params[:unstemmed_search]

Conclusion

So, it works. And in actually very few lines of code, probably under 100. And it’s code that isn’t too sensitive to changes in Blacklight — we can update to future versions of Blacklight probably without breaking much, becuase we really surgically injected our code just where it should be for the behavior we wanted.  This shows that Blacklight’s architecture is pretty good for allowing this kind of customization.

Just because it’s few lines of code doesn’t mean it was easy though. It required a bunch of contextual knowledge about Solr, Blacklight, and Rails to come up with the code. I have pretty good contextual knowledge there, and it still took me a couple days to come up with.  In some ways, the more abstract and flexible we make Blacklight, the more contextual knowledge it takes to use that abstraction/flexibility — this is a trade-off I think is probably inherent to development of re-useable/customizable software frameworks.

Hopefully by writing this tutorial, I can help you improve your contextual knowledge of Blacklight, Solr, Rails, and their interactions, to make it easier for you to write such customizations too.

This was a really long post — I’m not sure anyone will actually read/use it. If you found it useful, please let me know in comments, so I know if it’s worth me spending the time to write such things!

This entry was posted in General. Bookmark the permalink.

One Response to customizing Blacklight: “disable automatic stemming”

  1. Dorothea says:

    Hard to know right off when a techie post is useful. I used to get email about my DSpace posts YEARS after their original posting.

    If I may suggest, there’s a good code4lib journal article in this post.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s