A “print” format limit in a MARC-based catalog

So our Solr/Blacklight based catalog offers a ‘format’ facet, which is (much like our user’s own mental models of ‘format’ or ‘type’) a mish-mash of some media/carrier characteristics, some genre characteristics, etc.  Here’s a list of all of the categories over our entire corpus, but this list is of course more typically accessed as a ‘limit’ on a search.

Depending on when you read that, that list may or may not include the “Print” category. It doesn’t when I write this, but will in a day or two when reindexing completes.

Print was missing, but as we add more and more ebooks and other electronic content to our catalog, not having a print limit was clearly missing a user need (“I want a book I can actually check out and take on the bus, I don’t want an ebook, especially when most of the things in your catalog you call ebooks I haven’t figured out how to load on my e-reader anyway.”)  And librarians were asking for it, on behalf of users.  And we asked them “Do you really mean ‘print’, like not including a DVD or a CD? Or do you mean “really in the library”, like anything that’s a physical object, not online?”  They said they really meant ‘print’.

Now, the logic behind the existing ‘format’ facet was kind of ridiculous already — it is very difficult to get form/format/genre/carrier information out of AACR2-MARC. (To the extent that Michael Doran is proposing a talk on the subject at the Code4Lib Conf! Titled “Down the Rabbit Hole.” Yep, I feel ya Michael.)

(RDA may or may not make this easier, it’s trying to, and trying to in the right ways… but there are some significant problems with how the data is then encoded in MARC. But that’s not the topic of this blog post, and we’ll have a bunch of legacy AACR2-MARC data for the foreseeable future anyway).

The existing “figure out format categories from MARC” algorithms we were using used an unholy combination of MARC leader bytes 6 and 7, 007 fields, 008 fields, and only occasionally supplemented by GMD (245$h).    The details of how all those other limits were constructed is also not what this blog post is about.

What this blog post is about: How do you figure out if a bib is “print” or not from a MARC record?

Turns out it’s in some ways even harder than the other ones. The problem is that the origins of AACR2-MARC sort of assume print as a default, there’s no leader bytes or 007 or 008 code for ‘print’, print is sort of the absence of anything else. But in a way that gets confusing to figure out. Especially when you keep in mind that a bib might have both a print copy hanging off it and something other than print hanging off the same bib (we have some bibs that represent both an online ‘copy’ and a print copy; perhaps others that have both a DVD and an accompanying book on the same bib; at different points in library history different ways of doing this were considered the ‘right’ way by different people).

What we ended up with was mostly using the GMD — for the non-print format designations, I found the GMD was not sufficiently reliable, not as good as leader/007/008 at properly collocating things.  But for print, it seemed to really be best.

Thanks to the expert analytic and QA work of local cataloger Chris Case, here’s what we came up with:

  • IF the record has at least one RDA-style 338 field(s) with a subfield $2 of “rdacarrer”
    • then the bib should be considered “print” if and only ifone of those RDA-style 338 fields has:
      •  subfield $a of one of “volume”, “card” or “sheet” ;
      • or a subfield $b of the equivalent coded values (“nc”, “no”, and “nb”)
  • If the record does not have an RDA-style 338, then it should be considered print if and only if it has NO 245$h GMD. (print is like the “null” GMD)

Now, this surely isn’t perfect, it’s probably going to missclassify some things as print that aren’t, and miss some print things too. But it gets close enough for horseshoes and hand grenades.  Oh, wait, except….

Chris found it was especially likely to misclassify something as print that wasn’t, when that something was really an audio recording.  Especially for audio records older than 20 years or so; apparently because of different practices or common errors back in the day, records for audio recordings often had no GMD either. So additionally:

  • If the record has been classified as “Musical Recording” or “Non-Musical Recording” by existing format classifier, then exclude it from consideration as ‘print’, nothing marked Musical Recording or Non-Musical Recording will ever also be marked Print.

That might miss some records that legitimately have both an audio recording and a ‘print’ manifestation attached to the same record (book with accompanying CD?), but ce la vie.

So there you have it.

update oops it gets even more complicated than that. Turns out we’ve got about 300 records in our catalog that have no MARC 245 field at all.  This was tripping up my indexer that assumed one existed to check if there was a 245#h.  So now, algorithm enhanced, if there’s no 245 at all, we do not add it to the “print” bucket, it’s a broken record (ha! “broken record”) and there’s nothing we can say about it.

About these ads
This entry was posted in General. Bookmark the permalink.

2 Responses to A “print” format limit in a MARC-based catalog

  1. Alan Cockerill says:

    Our historical approach to format is a stone around our neck alright http://jculibrarytechnology.blogspot.com.au/2010/07/format-philosophy-of.html

    Don’t even get me started on kits.

  2. jrochkind says:

    > The format names are open to interpretation, overlap and ambiguity – for example ‘Government Document’ and ‘Report’ and ‘Web Resource’ could all be applied to the same document, but we can only map one to an item.

    At least we aren’t bound by that restriction in our custom Blacklight/Solr based solution — our documents map to more than one format. Like I said, I haven’t tried to make a rational taxonomy of format/carrier/media/genre/form — because our users own mental models have no such rationality, they’re just a big pile of overlapping categories, so is our format facet at the moment. Certainly that does cause some dilemmas for our interface and some confusion, but I think it’s the best we can do at the moment.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s