It’s fairly simple in the end, but took me a while to figure out as a Solr newbie. So I’ll document it here, in case someone else wants to do the same thing, or someone else finds it useful as a simple getting started solr example.
So I wanted a Solr indexed field for ISSNs. Mainly, the important thing here is that a query gets a match whether it uses a hyphen (1234-5678) or not (12345678), and whether the original data used a hyphen or not. That’s pretty much the only interesting part of an ISSN field in Solr.
Additionally, I wanted to make it possible for multiple ISSNs to be in one “value” in Solr, to make it easy to index records where it might be stored that way. For instance, send an 020$a and the new 020$l in one “value” — or ‘bad’ data that has multiple ISSNs in an 020$a even. — sure, you could have the indexer split em up first, but I figure offload as much of this work to solr, so it’ll be there for any indexer that uses it.
So here’s my annotated Solr field type definition for issn.
<fieldType name="issn" sortMissingLast="true" omitNorms="true"> <analyzer> <!-- tokenize just splitting on whitespace. So if multiple ISSNs are present seperated by whitespace, we'll catch em all in their own tokens. But note that means you can't have an ISSN like "1234 5678", that'll end up being considered two ISSNs. We're not using the StandardTokenizer, becuase we want to keep "1234-5678" as one token, not split it into two! Whitespace tokenizer is sufficient. ---> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <!-- ISSNs can have an X as a last checksum 'digit'. While I think that's always supposed to be uppercased, I don't trust my data or my users entering querries to make it so, so change all lowercase x to X to make sure querries always match regardless of case of the X --> <filter class="solr.PatternReplaceFilterFactory" pattern="(x)" replacement="X" replace="all" /> <!-- ISSNs are composed just of numbers and X. So strip out anything that isn't that. This will get rid of hyphens, so allow hits whether or not there's a hyphen match between original data and query. It will also turn any tokens that don't have those into empty strings. --> <filter class="solr.PatternReplaceFilterFactory" pattern="([^0-9X])" replacement="" replace="all" /> <!-- get rid of empty string tokens. At first I didn't think that mattered, but if you don't, then you get weird behavior if someone enters a query that doesn't look like an ISSN. It gets analyzed into empty string tokens, which then match empty string tokens in the index, which gives you unexpected hits. I don't really care about the max chars, but the length filter requires one, so I just use a high number. --> <filter class="solr.LengthFilterFactory" min="1" max="100"/> </analyzer> </fieldType>
So that’s the fieldType, now we simply declare a field:
It’s multi-valued becuase a record can and often does have more than one ISSN (although our field type above in some cases will have multiple ISSNs just as different tokens in one value, it’s convenient to let the indexer send multiple values to Solr too (and it’s really preferable when this happens, it’s cleaner)). It’s not stored, because I just care about this as an index lookup, I display the actual ISSN by parsing MARC at display time.
<field name="issn" type="issn" indexed="true" stored="false" multiValued="true"/>
And for the record, here’s the SolrMarc definition that fills up this issn field. I made the choice to assign any ISSNs recorded for a series the record belongs to as ISSNs for the record; I think this will lead to expected behavior? (I did not include the 776x in the ISSN index although our legacy OPAC seems to have done so, that seems weird to me and doesn’t seem to make sense. Anyone know if I’m missing something?)
issn = 022al:490x:440x:800x:810x:811x:830x