Many text indexes end up removing punctuation, and thus having trouble with retrieval of terms like “C++” and “C#”. I think even Google used to back in the day, although these days Google seems to index just about all punctuation just fine.
Many Solr-based indexes will have this problem though. A common (default?) Solr ‘text’ analysis chain includes a WhitespaceTokenizer followed by WordDelimiterFilter, which among other things ends up stripping punctuation (at least I think that’s what was doing it!). The StandardTokenizer I think will strip punctuation too.
By stripping punctuation, the effect is that a search for either “C” or “C#” will find all source documents including either “C” or “C#” — there’s no way to distinguish.
You could switch off punctuation stripping — you’d probably have to give up all the other useful features of WordDelimiterFilter though. Not only does leaving punctuation in make it more difficult to apply other reasonable normalization simply, but really in general it’s good to strip out punctuation. You don’t want “Title: Subtitle” to only match on a search for term “title:” and not on “title”, or a word at the end of a sentence to only match when there’s a period.
The simplest thing to do is to keep stripping out punctuation in general — not that there’s no other way to do it, Google leaves punctuation in with good results still, but probably with really sophisticated alternate logic you’re going to have to spend time coming up and probably implement in Java to use with Solr!
So to deal with C++ and C#, the easiest thing to do is to whitelist certain known want-to-keep punctuation-including terms, using the solr synonym filter, as suggested at the end of this solr-user thread.
Keeping c# and c++
So I started out creating a punctuation-whitelist.txt synonym file to preserve c# and c++ like so:
c++ => cplusplus C# => csharp
Include it in your Solr schema.xml analysis chain, before the WordDelimiterFilter, so we’ll map “c++” to “cplusplus” before the WordDelimiterFilter removes punctuation. (at both ‘query’ and ‘index’ time analysis).
<filter synonyms="punctuation-whitelist.txt" ignoreCase="true"/>
Note ignoreCase, so we map both “C++” and “c++”. (It might seem you could alternately make sure a LowerCaseFilter comes before this punctuation-whitelist synonym filter; but if you do that’ll mess up the WordDelimiterFilter, which will split on mixed-case terms like HiFi or ThoughtWorks, so you don’t want to do that. My analysis chain does have a LowerCaseFilter, but after the WordDelimiterFilter).
So this seemed good at first, but then I realized my actual corpus (from library cataloging records) sometimes used “C#”, but sometimes used a unicode musical sharp symbol instead: C♯. (Both are ‘correct’ for the programming language).
The musical sharp sign was also getting stripped out by something in my analysis chain, not even sure exactly what — but even if it wasn’t, I want “C#” and “C♯” to index the same. So no problem, include them both in the synonym line:
C#, C♯ => csharp
Putting utf-8 unicode in the synonym file works just fine — at least in this case. See below for some possible gotchas in the general case.
What this will do:
- Search for either “C#” or “C♯” will find anything in source material that included either “C#” or “C♯”. (It will also find anything in source material that included the actual word “csharp”, which I’m not sure is desired, but an unlikely word, and no disaster if it happens).
- Searching for “C” alone will not find source material “C#” or “C♯”,
- and searching for “C#” or “C♯” will not find source material “C” by itself. We think this is probably desirable behavior for our source material and use cases.
Oh but wait, musical keys too
So I started to add similar lines for F# and J# which are also programming languages, cause, why not. But then I realized that C# and F# are both musical keys too. And my corpus includes plenty of musical keys — almost always written using the actual unicode musical sharp, but occasionally using the ascii number-sign (#).
If we’re going to be doing this special normalization for C# and F#, I figured better do it for all the letters in musical notation A-G, or users who are searching for musical keys are going to be confused about why searching one way for C♯ works fine, but searching that same way for G♯ does not. Okay, no problem.
And then I realized, hey if we care about consistency we better consider musical flats too. We now have a way to search for C♯ without having to figure out how to enter a musical sharp symbol on your keyboard (or copy and paste it as I’ve been doing). But meanwhile B♭ is still in the situation C++ and C# used to be — it’s indexed identically to straight “B”, even if you figure out how to enter a ♭, searches for either B♭ or B both end up returning all B♭ and B in the source material, there’s no way to distinguish.
Okay, so the most common way to approximate musical flat sign in ascii seems to be lower-case ‘b’, so I added these synonyms for musical flat keys:
A♭ => Ab ... G♭ => Gb
What this will do:
- Searching for either B♭ or Bb (case-insensitive, ‘bb’ too, since my whole index is case insensitive), will find any source material that contained either B♭ or Bb,
- but will not find source material that contained a straight ‘B’.
- So we can distinguish between B♭ and B, but now we actually can’t distinguish between B♭ and Bb and bb, and there are some non-musical ‘bb’ or ‘cb’ or ‘ab’ etc in my source material. But this seemed good enough for our corpus and use cases.
Note on unicode and order of analysis gotcha
Any programmer working with unicode has got to understand unicode normalized forms, and the facts of unicode that make them neccesary — that the exact same element of language (‘letter’ basically) can sometimes be displayed with different combinations of unicode codepoints, and thus different bytes in UTF-8. Unicode’s own documentation on this is pretty accessible; although it’s a confusing enough topic that I periodically go back to review it for a refresher. (If that’s still confusing, check out Chapter 2 of the unicode standard itself for some background).
Seriously, if you are doing anything with software and unicode and dont’ understand composed vs. decomposed forms and normalization in general, you will mess it up at some point.
So anyway, normally if you want to handle unicode properly in a Solr index, you’re going to need some unicode form normalization in your analysis chain, to make sure that different forms of the same characters are indexed the same, and searched the same when in queries. Probably form NFKC at least for fields meant for searching, although possibly a different normalized form on fields meant for sorting. It gets confusing.
I’m still using Solr 1.4, and using some custom Java code by Bob Haschart that I believe takes care of the unicode normalization, although I don’t understand completely what it does, and am not sure it’s documented. In Solr 3.x , we have the various “ICU*” filters to do the same thing. Haven’t used those myself yet, and am not sure exactly what they all do, but I think various of them can be used to do unicode normalization in different ways. Absolutely vital if your corpus is going to include non-ascii you want to search properly.
Now, the musical sharp sign and musical flat sign I’m using here, I think do not have alternate forms, they do not have ‘decomposed’ forms, and are, I think, the same under all forms of unicode normalization. So I can get away without thinking about it.
But if you had a synonym who’s left-hand side involved, say, an é, you can easily have problems if your synonym file matches input only using one of those forms and not all of them. So either you’re going to need to include variant terms using every possible utf-8 encoding of é on the left-hand-side (a mess), or your going to have to make sure your synonym analysis goes after an analysis step that does unicode normalization and then use the right normalized form in your synonym file. And I’m not even sure how you can be sure which form of a unicode letter is in your synonym file — not sure if there’s a way to enter unicode in a synonym filter source file in an escaped hex or unicode codepoint form, cause otherwise you can’t tell by looking, very confusing. And you want it to go after some analysis that does unicode normalization, but still probably before analysis steps that do word-splitting, and if you’ve just got one ICU* analyzer that’s doing some of each, this can get really hard to do without writing java. What a mess!
Thankfully I didn’t have to deal with that for musical sharp or flat, or at least things seem to be working fine without me dealing with it. I think it’s because musical sharp and flat do not have ‘decomposed’ forms, we’re fine.
Alternate approaches: Multi-word, or writing Java
Another approach to using the synonym file but getting different searching behavior would be to use multi-word right-hand-sides:
C#, C♯ => c sharp B♭ => B flat
Doing things this way:
- Searching for “C#” or “C♯” will find either “C#” or “C♯” in source material, and will also find “C sharp” in source material — which could be a good or bad side-effect.
- Searching for “C sharp”, even as two terms NOT phrase quoted, will find “C#” or “C♯”, as well as any other document including the terms “C” and “sharp”. Searching for “C sharp” phrase quoted will find “c sharp” phrase as well as “C#” or “C♯”.
- Search for “C#” or “C♯” will not find “C”.
- However, searching for “C” will find “C#”, “C♯” as well as just “C”.
- But phrase searching for “C programming language” as a phrase will not find “C# programming language” or “C♯ programming language”
- For flats, it’s even weirder: Searching for “B flat” will find “B♭” or literal “B flat” in source material, but will no longer find “Bb” or “bb”. Searching for “Bb” will only find “Bb” (case insensitive), it won’t find B♭. You might be able to do something with query-time-only synonym (requiring a second synonym file) to restore the feature where searching for “Bb” still find “B flat” while also finding “Bb”, but searching for B♭ will only find B♭ or “b flat”. But I get confused thinking about it. Maybe something like this:
Bb => Bb, B flat
You could also of course use the single-word-left-hand-side approach for flats, but the two-word approach for sharps.
Note that the Solr synonym filter documentation warns against using two-word synonyms at query time, but my reading of the warning is that it only applies to two-word synonyms on the left-hand (input) side, not the right-hand (output) side as we’re doing here. I think.
Nonetheless, while some aspects of the two-word approach seem preferable, it gets complicated enough to understand all the implications that I think the one-word approach is better — and for my users probably has preferable behavior anyhow.
If you need different behavior than either of these synonym approaches will give you — you’re probably going to need to write your own analyzer in Java. In particular, if you want/need to use one of these “does a bunch of things at once” analyzers like the WordDelimiterFilter (or possibly one of the ICU filters does a bunch of things at once too?), but need your special punctuation/weird-symbol replacement to happen in the middle of em, you’d probably need to write Java. Fortunately I’ve avoided ever having to write Java for Solr analysis thus far; it would be hard enough for me to do, that I tend to err on the side of ‘good enough’ to avoid it.
Conclusion: Wow, complicated
The lesson here is that text indexing, including even some normalization, seems simple enough at first — dealing with unusual cases can quickly become very complicated. There’s no one right answer to “how do I make C# searchable?”, there end up being at least a couple of easy approaches which will have different effects, as well as many more not-so-easy approaches if neither of those effects are suitable for you. You can end up with unexpected side-effects on other sorts of inputs, and whatever you do can be very sensitive your exact analysis chain (including ordering), sometimes in some unexpected ways. (And don’t forget unicode normalization). Phew.
However, Solr’s analysis chain toolkit is pretty good, and there’s a lot you can do pretty simply. The two synonym filter approaches (single word or double word output) should be ‘good enough’ for many corpuses, I think, and are easy to implement.