More dismax gotchas: varying field analysis and ‘mm’

There is a known gotcha with the Solr dismax query parser, where using fields with varying stopwords definitions together in an ‘qf’, results in ‘mm’s not being met even when you think they should be, and fewer results being in the result set then you expect.

See here and here to understand that problem, to understand why it turns out it’s an even bigger problem then this, not limited to differing stopwords definitions. In some cases other differences in field analysis between two fields used in a dismax ‘qf’ can result in the exact same problem, as I will explain.

I had a case where there were two fields involved in a dismax ‘qf’, neither of which had any stopwords in their analysis at all, so I thought I was exempt from the issue. But I had a problem report for a query not behaving as expected, which turned out to be another manifestation of exactly the same issue.

One field in the dismax qf used an analyzer that stripped punctuation. (I’m actually not positive at this point _which_ analyzer in my chain was stripping punctuation, I’m using a bunch including some custom ones, but I was aware that punctuation was being stripped, this was intentional. I think actually maybe the WordDelimiterFilterFactory, yeah?)

So “monkey’s” turns into “monkey”.  “monkey:” turns into “monkey”.  So far so good. But what happens if you have punctuation all by itself seperated by whitespace?  “Roosevlet & Churchill” turns into ['roosevelt', 'churchill'].  That ampersand in the middle was stripped out, essentially _just as if_ it were a stopword. Only two tokens result from that input.

You can see where this is going — another field involved in the dismax qf did NOT strip out punctuation. So three tokens result from that input, ['Roosevelt', '&', 'Churchill'].

Now we have exactly the situation that gives ride the dismax stopwords mm-behaving-funny situation, it’s exactly the same thing. Resulting in 0 hits for “Roosevelt & Churchill” only when the field-that-didn’t-strip-punctuation was included in the qf, but not when it was not.

(As we expect ‘qf’ to work, adding a field to ‘qf’ should never result in fewer hits  then the same ‘qf’ list without that added field. These are cases where they do, whether from differing stopwords or other differeing analysis with the same effect).

Now I’ve fixed this for punctuation just by making those fields strip out punctuation, by adding these analyzers to the bottom of those previously-not-stripping-punctuation field definitions:

<!-- strip punctuation, to avoid dismax stopwords-like mm bug -->
<filter class="solr.PatternReplaceFilterFactory"
    pattern="([\p{Punct}])" replacement="" replace="all"
/>
<!-- if after stripping punc we have any 0-length tokens, make
     sure to eliminate them. We can use LengthFilter min=1 for that,
     we dont' care about the max here, just a very large number. -->
<filter class="solr.LengthFilterFactory" min="1" max="100"/> 

And things are working are how I expect again, at least for this punctuation issue. But there may be other edge cases where differences in analysis result in different number of tokens from different fields, which if they are both included in a dismax qf, will have bad effects on ‘mm’.

Dismax potentially dangerous with fields with any variation in analysis

The lesson I think, is that the only absolute safe way to use dismax ‘mm’, is when all fields in the ‘qf’ have exactly the same analysis.  But obviously that’s not very practical, it destroys much of the power of dismax. And some differences in analysis are certainly acceptable — but it’s rather tricky to figure out if your differences in analysis are going to be significant for this problem, under what input, and if so fix them. It is not an easy thing to do. And it may seem to mostly work, but there are edge case queries that will cause it to break, as of yet undiscovered (and maybe never discovered — NOT getting hits you really ought to be getting is not always a noticable thing). So dismax definitely has this gotcha potentially waiting for you, whenever mixing fields with different analysis in a ‘qf’.

About these ads
This entry was posted in General. Bookmark the permalink.

One Response to More dismax gotchas: varying field analysis and ‘mm’

  1. Pingback: Boosting on Exactish (anchored) phrase matching in Solr: (SST #4) « Robot Librarian

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s