solr stop words/dismax gotcha

So in Solr, normally we’re used to stopwords just kind of magically working. If you enter a stop word in a query, it’ll just be silently ignored and stripped out (unlike my legacy OPAC, which will give you zero results whenever you include a stopword!) — if you include a stopword in a phrase search, it’ll do even better: “kill a mockingbird” basically changes into “kill * mockingbird”, kill and mockingbird seperated by one word, and succesfully matches indexes with “kill a mockingbird” (along with any other “kill * mockingbird”).

Great! So normally we don’t have to think about it too much.

An exception is when you throw dismax into it.   Dismax lets you search multiple solr fields at once (the qf parameter). It also lets you search with a multi-clause query, where, depending on your “mm” settings, only SOME of those clauses have to match for results to be included in the hitlist.

So you have multiple Solr fields involved. As long as each of those solr fields is configured for stopwords (and the same) stopwords, everything Just Works the way you’d expect.  But if one of those fields does not have stopwords configured, then (depending on your mm settings), you can easily end up getting zero hits for any (non-phrase) query clause that is a stopword.  This kind of makes sense when you think about it — since at least one field didn’t have stopwords, there was a clause included for that stopword you entered. And that clause won’t possibly match on any of your stopword fields, so it’s a clause that can’t match, which depending on your mm (and the contents of all your fields, phew) will result in no hits.

A bit more information in this solr listserv thread.

If you have fields included in a dismax qf that all have stopwords configured, but with different stopwords lists, the results could be even more confusing.

The solution?

If you are using dismax, make sure all fields included in a qf have exactly the same stopwords settings. Either they all need to have stopwords configured with the same stopwords file, or they all need to have stopwords not configured.

Just not using stopwords seems like the simplest solution to me.  What’s the reason for stopwords in the first place?  Generally performance, a very common word will end up with a huge result set when there’s a search clause on that word, which will slow down lucene/solr.   My Solr is not as performant as I’d like, it’s true, but there are a whole bunch of different things I really need to look at for performance (So many that it’s kind of overwhelming to consider, honestly) — Since using stopwords would make my solr configuration more confusing and error prone, I think assuming that lack of stopwords is my most important bottleneck without profiling of some kind is a kind of “premature optimization”. So no stopwords for now.

Erik Hatcher suggested in an IRC chat that if very common words are a performance bottleneck, rather than stopwords it might make more sense to investigate Solr’s (or lucene’s?)  “commongrams capability”.   Need to put that on my list to look into, I know little about that; I get the basic concept, but dont’ know how it’s implemented in solr/lucene or how to set it up.

commongrams capability
This entry was posted in General. Bookmark the permalink.

3 Responses to solr stop words/dismax gotcha

  1. Pingback: Binary » Blog Archive » Solr Search, Stop Words and DisMax Search Handler

  2. Pingback: Creating a unified search experience | Engineering Blog | The OpenSky Project

  3. Pingback: More dismax gotchas: varying field analysis and ‘mm’ | Bibliographic Wilderness

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s