ISSN search field in Solr

It’s fairly simple in the end, but took me a while to figure out as a Solr newbie. So I’ll document it here, in case someone else wants to do the same thing, or someone else finds it useful as a simple getting started solr example.

So I wanted a Solr indexed field for ISSNs.  Mainly, the important thing here is that a query gets a match whether it uses a hyphen (1234-5678) or not (12345678), and whether the original data used a hyphen or not.   That’s pretty much the only interesting part of an ISSN field in Solr.

Additionally, I wanted to make it possible for multiple ISSNs to be in one “value” in Solr, to make it easy to index records where it might be stored that way. For instance, send an 020$a and the new 020$l in one “value” — or ‘bad’ data that has multiple ISSNs in an 020$a even.  — sure, you could have the indexer split em up first, but I figure offload as much of this work to solr, so it’ll be there for any indexer that uses it.

So here’s my annotated Solr field type definition for issn.

<fieldType name="issn" sortMissingLast="true" omitNorms="true">
 <analyzer>
     <!-- tokenize just splitting on whitespace. So if multiple ISSNs
            are present seperated by whitespace, we'll catch em all
            in their own tokens. But note that means you can't have
            an ISSN like "1234 5678", that'll end up being considered
            two ISSNs. We're not using the StandardTokenizer,
            becuase we want to keep "1234-5678" as one token, not
            split it into two! Whitespace tokenizer is sufficient. --->
     <tokenizer class="solr.WhitespaceTokenizerFactory"/>

     <!-- ISSNs can have an X as a last checksum 'digit'. While I
          think that's always supposed to be uppercased, I don't
          trust my data or my users entering querries to make it so,
          so change all lowercase x to X to make sure querries always
          match regardless of case of the X -->
     <filter class="solr.PatternReplaceFilterFactory"
         pattern="(x)" replacement="X" replace="all"
     />

     <!-- ISSNs are composed just of numbers and X. So strip out
          anything that isn't that. This will get rid of hyphens,
          so allow hits whether or not there's a hyphen match between
          original data and query. It will also turn any tokens
          that don't have those into empty strings. -->
     <filter class="solr.PatternReplaceFilterFactory"
       pattern="([^0-9X])" replacement="" replace="all"
      />

     <!-- get rid of empty string tokens. At first I didn't think that
          mattered, but if you don't, then you get weird behavior
          if someone enters a query that doesn't look like an ISSN.
          It gets analyzed into empty string tokens, which then
          match empty string tokens in the index, which gives you
          unexpected hits. I don't really care about the max
          chars, but the length filter requires one, so I
          just use a high number. -->
     <filter class="solr.LengthFilterFactory" min="1" max="100"/>
   </analyzer>
</fieldType>

So that’s the fieldType, now we simply declare a field:

It’s multi-valued becuase a record can and often does have more than one ISSN (although our field type above in some cases will have multiple ISSNs just as different tokens in one value,  it’s convenient to let the indexer send multiple values to Solr too (and it’s really preferable when this happens, it’s cleaner)).  It’s not stored, because I just care about this as an index lookup, I display the actual ISSN by parsing MARC at display time.

<field name="issn" type="issn" indexed="true" stored="false" multiValued="true"/>

And for the record,  here’s the SolrMarc definition that fills up this issn field. I made the choice to assign any ISSNs recorded for a series the record belongs to as ISSNs for the record; I think this will lead to expected behavior?  (I did not include the 776x in the ISSN index although our legacy OPAC seems to have done so, that seems weird to me and doesn’t seem to make sense. Anyone know if I’m missing something?)

issn = 022al:490x:440x:800x:810x:811x:830x
This entry was posted in General. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s