Upgrading a Blacklight app from Solr 1.4 to Solr 4.3

updated version of post previously published, then unpublished,  more content added

Our public catalog is a Blacklight-based app. It went live using a Solr 1.4 index, and it still does.

Meanwhile, the latest Solr is now 4.3. It’s time to upgrade our Solr. I’ve spent a few days getting things working with Solr 4.3; haven’t deployed it to production yet, but it seems to be working out. I have a bit more testing to do, then just figure out how to schedule the switchover.

Here are some notes for some things I had to change or do for a switch all the way from 1.4 to 4.3.   I think some others of you are in the midst of just such a long awaited jump-update too or soon will be (cause it’s summer maybe?), so maybe this will be helpful.

I can’t neccesarily cover everything I changed here exhaustively. I was basically guided by

  • comparing my existing solrconfig.xml and schema.xml to the new example ones from the Solr 4.3 example app;
  • as well as comparing to the solrconfig.xml and schema.xml from the current Solr 4.x-ready Blacklight example solr app; although I am not sure any Blacklight committers are actually using the out of the box example Solr in production, so I don’t neccesarily trust it to be optimal or error free.
  • Also, very useful, check the “Logging” section from left-hand sidebar in the (amazingly improved) Solr admin console, for any deprecation notices or other warnings.

Update SolrMarc

If you use SolrMarc for indexing, you need to be using a version that’s compatible with Solr 4.3.

Blacklight comes with a SolrMarc.jar, but it can be confusing to figure out which version you have built-in, depending on what version of Blacklight you’re using, etc.

However, the built-in rake tasks in Blacklight that use solrmarc (such as `rake solr:marc:index`) will also use a local SolrMarc.jar if one is present.

I downloaded the most recent pre-built SolrMarc.jar, and put it in my local app at ./lib/SolrMarc.jar.  The built-in rake tasks will find it there, and use it.

Logging .jars

In Solr 4.3 for the first time, you need to supply your own logging-related jars, they aren’t bundled in the solr war.  At the time I was doing this, the Solr docs weren’t neccesarily clear (to me anyway) on this, and I spent a bunch of time dealing with it.

If you just used the example jetty bundled with Solr 4.3, you wouldn’t run into this issue, because it’s already got the required jars. But we’ve been using tomcat here (not neccesarily for any good reason, I don’t neccesarily recommend it), and were trying not to change all the moving parts at once as part of this switch.

So you’ve got to copy some jars from the example jetty to the relevant place in your tomcat (or other).  Or you get some not very explanatory error messages.

We have a tomcat “split install” where $CATALINA_HOME and $CATALINA_BASE are not the same.  Ideally, it made sense to me to put the custom jars in the (instance-specific) CATALINA-BASE’s lib directory. But I could not get that to work (and could not figure out why it wasn’t working through debugging of tomcat’s configuration files; it really looked like it ought to have), so ended up giving up and putting them in the cross-instance $CATALINA_HOME.

New built-in ICU analyzers for unicode

Previously, with Solr 1.4, I was using Bob Haschert’s custom unicode normalization jars for handling unicode.

(A number of us have been using these, but I am honestly not certain where, if anywhere, source and/or compiled jars for this code lives on the web, or even what it’s official name is for googling. Sorry!)

These did a couple things:

  • Normalize all unicode to Normalization Form KC. This is absolutely required for any indexing of unicode, really, to avoid unexpected behavior.
  • Additionally, normalize to an “English ascii” character set, for instance normalizing an e with an accent to just plain “e”.  This is something that you absolutely would not want to do in some non-English contexts, and even in English may or may not want to — but for our use case, it was essential, searching for “Simón Bolívar” should find “simon bolivar” and vice versa.

In Solr 1.4, there was no built-in way to do these things, thus Bob Haschart’s custom code came to our rescue.  Some time in between there and Solr 4.3, now there is: Include the ICUFoldingFilter in your analyzer chain, to do both of these things.

The ICUFoldingFilter, unlike Bob’s, also normalizes up/downcase, so I could eliminate the LowerCaseFilterFactory I was using before.

As long as I was at it, and noticing it was there, I also changed from using a WhitespaceTokenizerFactory to the new multi-language-aware ICUTokenizerFactory.   In addition to splitting on whitespace, for letters from non-Roman scripts it will sometimes tokenize in other language-appropriate places.

(Note, some, such as Tom Burton-West at umich and Naomi Dushay at Stanford, have found that the basic approach I’m using is not sufficient for effective Chinese/Japanese/Korean searching. But my Solr 1.4-based index wasn’t doing the things they are trying to do anyway, and at this point I’m not trying to improve that aspect of my index).

To use these ICU filters, you need to copy a number of optional jars from the Solr example distro to your Solr core `lib` folder (or other place on the classpath for your webapp).

  • icu4j-49.1.jar
  • lucene-analyzers-icu-4.3.0.jar
  • solr-analysis-extras-4.3.0.jar (Note this one is not mentioned by some of the applicable Solr documentation that mentions the other two, but is needed too).

Here’s my standard text field definition for Solr 4.3:

    <fieldType name="text" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
      <analyzer>
        <tokenizer class="solr.ICUTokenizerFactory"/>

        <filter class="solr.SynonymFilterFactory" synonyms="punctuation-whitelist.txt" ignoreCase="true"/>

        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>

        <!-- folding need sto be after WordDelimiter, so WordDelimiter
             can do it's thing with full cases and such -->
        <filter class="solr.ICUFoldingFilterFactory" />

        <!-- ICUFolding already includes lowercasing, no
             need for seperate lowercasing step
        <filter class="solr.LowerCaseFilterFactory"/>
        -->

        <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>

It is potentially an unusual choice to combine a WordDelimiterFilterFactory (which generally only does sensible things with English), with a ICUTokenizerFactory (which is intentionally language-agnostic, and may not be what you’d choose if you knew your input was English) — But I decided it made sense for me. I want to optimize for English, and my corpus and users are mostly (but not entirely) English, but I still want to do reasonable things with the non-English (which is generally mixed in with the English), where it can be done without causing any significant problems for English. I think I’ve done so here.

solr.UpdateRequestHandler replaces XML and Binary subclasses

This is just one example of things pointed out as warnings or deprecations when I checked the logs after starting up. I forget if this was just deprecated and would still work or not, but when doing a switch like this, I’d just as soon update anything deprecated that I notice.

Previously found in my 1.4 solrconfig

<requestHandler name="/update" class="solr.XmlUpdateRequestHandler" />
<requestHandler name="/update/javabin" class="solr.BinaryUpdateRequestHandler" />

Now the standard solr. does both XML and binary, deciding which based on http headers:

 <requestHandler name="/update" class="solr.UpdateRequestHandler" />
 <requestHandler name="/update/javabin" class="solr.UpdateRequestHandler" />

Solr schema 1.3 => 1.5

My schema.xml in Solr 1.4 had a version=”1.3″ attribute in the root <schema> tag.

This actually identifies the version of the Solr schema language itself, for backwards compatibility maintenance. I could have left it 1.3 for compatibility with my existing schema, but I find it makes for less confusing maintenance in the long run to try and keep everything up to date when you do a migration like this.

So I changed this to the “1.5” version that is latest for Solr 4.3 — and then made one change to my actual schema.xml to keep it (for now) semantically the same as it was when it was Schema version 1.3, after consulting the docs on changes between versions. 

  • Added `autoGeneratePhraseQueries=”true”` to the field type definitions of my text field types.

Update ISBN/LCCN normalization

I had been using some custom Solr analyzers from Bill Dueber to normalize ISBNs and LCCNs.  The idea is that ISBN-10 and ISBN-13 should be interchangeable, and so should various alternate forms of LCCN that mean the same thing according to LCCN normalization rules.

These custom analyzers had to be rejiggered a bit to work with Solr 4.3. Which I somehow convinced Jay Luker to do for us out of pure kindness (my Java chops and familiarity with Solr under the hood sourcecode are…. not great).

The source is now available for you too on github though if you want it https://github.com/billdueber/solr-libstdnum-normalize

Blacklight: Deal with weird escaping of Marc21 binary

We store Marc21 binary records in a Solr text field.

With Solr 1.4 and previous versions of SolrMarc, somehow this just worked — although it’s not clear to me if it should have, what SolrMarc may have had to do (possibly using techniques not documented to be supported by Solr) to get the binary data (including control characters) into a Solr text field, in a way that it somehow came out the other end as intact string literals including control characters.  I honestly don’t really understand what was going on.

But I can say that with Solr 4.3 and the latest SolrMarc… it stopped working quite like that. The binary data still made it into the Solr text field, but when my Blacklight app on the other end made requests to Solr and received back responses with the stored field with the marc21 in it….  the control characters come back… oddly.  As for instance the control character represented by hex `1D` comes back as literal ascii “#29”.  Which is some kind of escaping, yeah, I have no idea how it happens as an interaction of SolrMarc, Solr, and what else.

Long-term, the solution is to stop trying to use Solr in this weird undocumented way. Either store MarcXML or Marc-in-json (non-binary, ordinary text stream), or store Marc21 binary in a Solr ‘binary’ field (requiring Base64 encoding and decoding on the client end), or stop storing it in Solr at all, etc.

In the near term, I just want to get things working as closely to how they did in my Solr 1.4 version as possible, with as little work as possible.

No problem, just add some code to my local Blacklight-based application to over-ride the Marc fetching code to properly unescape.

In your local `./app/models/solr_document.rb` (which was generated into your app by Blacklight), I added this (my local Solr stored field where the marc is, is called ‘marc_display’):

  # Custom hack to unescape Marc21 control characters that
  # SolrMarc escapes weirdly, and somehow were automatically unescaped in
  # Solr 1.4, but no longer using Solr 4.3. I don't understand it. This is a mess.
  # This is no longer needed when we stop using SolrMarc, or stop storing in binary Marc21,
  # or both.
  module LoadMarcEscapeFix
    def load_marc
      if _marc_format_type.to_s == "marc21"
        value = fetch(_marc_source_field)

        # SolrMarc escapes binary marc control chars like this, we need to
        # unescape. Yes, we might theroetically improperly unescape literals too.
        # it's a hell of a system.
        value.gsub!("#29;", "\x1D")
        value.gsub!("#30;", "\x1E")
        value.gsub!("#31;", "\x1F")

        return MARC::Record.new_from_marc( value )
      else
        return super
      end
    end
  end
  use_extension(LoadMarcEscapeFix) do |document|
    document.key?( :marc_display )
  end

It might be nice to somehow contribute this fix back to Blacklight… but I can no longer comprehend what’s going on, all the various interacting software, what’s actually supposed to be happening at each point, what’s intended by the code designers/maintainers (if anything), what the backwards-compatible choice is, etc. I’m overwhelmed. (And still using Blacklight 3.5.0 here, not the latest BL).

That’s about it

Doesn’t sound to painful, does it? It still took me, oh, probably in aggregate nearly a week (although spread out over a couple) to work through all this, plus some other details here and there, testing, etc.   So it goes, it’s why it’s taken many of us have been running Solr 1.4 for so long. Hopefully these notes will reduce time for someone else.

So now I mostly just need to figure out how to transition in production.

I already operate with a Solr master/slave replication, where I index to the master, and periodically replicate to the slave — only the slave serves queries from the production app.

This has definitely ended up a convenient setup for any kind of maintenance, to have these two Solr servers.

So to update with no downtime I’ll probably:

  • turn off the replication
  • Upgrade my master to Solr 4.3, including switching to the core ‘conf’ directory I have in a “solr4.3” branch in my git project for my solr configurations.
  • Re-index my entire catalog to master. (overnight).
  • Temporarily point the production apps to be served by master instead of slave.
  • Upgrade my slave to Solr 4.3, replicate master to slave, point production apps back to slave.
  • Non-Profit!
This entry was posted in General. Bookmark the permalink.

One Response to Upgrading a Blacklight app from Solr 1.4 to Solr 4.3

  1. jrochkind says:

    Update:

    Turns out there was at least one more trick. Changing to the ICUTokenizer from my previous whitespace tokenizer, made it eat punctuation, which meant my method of using synonyms to preserve things like “C++” started failing.

    Thanks to Naomi Dushay for pointing out the problem. Thanks to Bill Dueber for finding a pointer to the solution from the solr listserv. Thanks to Shawn Heisey for pointing out the solution on the solr listserv in the first place (and I think, helping to develop it too, maybe)

    Found the Latin-break-only-on-whitespace.rbbi file in the solr source, moved it to my Solr core ‘conf’ directory, changed the tokenizer definition to tokenizer class="solr.ICUTokenizerFactory" rulefiles="Latn:Latin-break-only-on-whitespace.rbbi"

    …it LOOKS like it’s working in the Solr admin analyzer, but need to reindex my db to confirm. Stay tuned.

    (What this should do, we think, is act like a whitespace tokenizer within substrings identified as Latin script, but still use other script-appropriate means of tokenization in substrings of graphemes identified as scripts other than Latin)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s