Notes on oddities of Solr WordDelimiterFilter

A edited version of post I sent to the Blacklight listserv…

I have a WordDelimiterFilter configured in my analysis for the ‘text’ type. I thought I originally inherited that from Blacklight suggested configuration, although it doesn’t appear to be there at the moment if I’m looking at the right place:

I’m not sure if this repo respresents Stanford’s current code for it’s Blacklight-based catalog, but it has a WordDelimiterFilter much like mine:

Note that it’s got `splitOnCaseChange=”1″`, for both index and query time (no separate index/query analysis). Mine has the same. Although stanford applies the ICUFoldingFilter (case-insensitivity) _before_ the WDF, which actually probably means splitOnCaseChange isn’t doing actually doing anything, by the time the filter gets it, there are no more case changes. In mine, I do the ICUFoldingFilter _after_ the WDF, so the WDF can still do it’s thing.

I’ve noticed something unexpected and probably undesirable with my setup:

Specifically, if the query includes a mixed-case term like “DuBois”, I expected this would match soure term “dubois” OR source term “du bois”.

But it turns out it _only_ matches source term “du bois”. Which was unexpected for one user that noticed it, and knew that our search was generally ‘case insensitive’ — a search for “dubois” would match source term “dubois”, but a search for “duBois” would not, violating their expectations. And I agree this is probably bad.

I thought the WDF could do what I wanted. But after spending a bunch of time with the docs, playing around with different configurations, and trying to get advice on the solr-user listserv — frankly, I’m still really confused about exactly what the WDF will do in various configurations, it’s a complicated thing.

But I think the WDF is not capable of doing quite what I expected.

I think what I need to do is split into separate index and query time analysis, which can be identical in all ways except in query time analysis splitOnCaseChange=0 — it still remains on in index time analysis.

The result of this seems to be that query time “DuBois” will only match source material single word “dubois” (in any case, it’ll also match source material “DuBois” still) — if it’s only going to match one of the choices, I think this is the right one.

Source material “DuBois” will still be indexed such that both queries “dubois” (or “DuBois”) and “du bois” will match it — source case changes will be expanded to two words in index, as an alternate, along with one word in index. But you can’t quite do the same thing at query time, to allow query with case change “DuBois” to match both variants in source.

I think this is probably the right thing to do — although in general, the WordDelimiterFilter is scaring me enough that if I had to do it over, I either wouldn’t use it at all, or would use it only with very specific configuration designed to support specific tested cases. As it is, I’m not quite sure what all it’s doing, and am scared to change it a lot. It’s odd to me that the example suggested analysis configuration given in the Solr wiki for the WordDelimiterFilter would seem to be subject to the same problem.

I am curious if anyone has dealt with this, and has any feedback. Especially Stanford, since I know they have a great test suite on their Solr configuration — although if that github represents current Stanford conf, the splitOnCaseChange=1 is probably having no effect at index OR query time, since there’s a case-normalization filter BEFORE it.


One thought on “Notes on oddities of Solr WordDelimiterFilter

  1. I think this behavior is not a problem with WordDelimiterFilter per se, but rather is related to a more general issue with phrase searching: LUCENE-7398.

    A variant form of the problem (and a workaround for some cases) is well described by Mike McCandless in a post on Multi-Token Synonyms and Graph Queries. The issue is partly (but not entirely) a consequence of the fact that token PositionLengthAttribute is ignored at index time, and various phrase query implementations implicitly assume that “According to DuBois, the”, indexed as “0:according 1:to 2:(du|dubois) 3:bois 4:the” has a gap between “dubois” and “the”; thus, a 0-slop phrase query for “according to dubois the” sees a spurious gap, and fails to match.

    I have written a series of posts about this issue, about the implementation (recently released) of a candidate fix to address the issue, and about some potential implications if the fix works as intended:

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s