So, it can be very useful to have some kind of broad categories as facet-style limits to partition your result set. When you get way too many results, and want to narrow down, narrowing to just “Art” or “Biology” or “Nursing” can be pretty useful. Actual LCSH are way too narrow a categorization to be useful this way (whether or not “de-coordinated” into subdivisions), the user may not be ready to narrow down to so specific as specific as an LCSH subdivision or heading, but still want to use a broad categorization to explore the result set.
And a piece of traditional cataloging data actually provides some meaning which can be used for such a broad categorization as a limit, even though that’s not exactly how it was intended to be used: The classification, ie the “classified call number”, that is Dewey call number or Library of Congress Classification.
Most of our collection is LCC, although we’ve got a significant minority of Dewey (as well as plenty of records with no classified call number at all, more on that later). Dewey has a better hieararchical set of classes (which I don’t believe is publically available in it’s entirely, but here’s the first couple levels, you can pay OCLC for the rest), but LCC still has an okay couple of levels.
So as a first pass, I figured, for items that have an LCC, take the top-level LCC schedule name, and just use those as a set of broad categories. If you have an LC call number, you can just take the first letter of it, and easily determine the LC “schedule” it comes from, and use that as a broad category.
I figured I’d use both LCC’s from the bib record itself (usually assigned by LC itself, or else by another cataloger and obtained by us from WorldCat), as well as any locally assigned LCC’s, for maximum coverage.
Sounds simple, right? Oh, if only. It turns out actually identifying an LCC from an actual bib record is anything but simple. And if your software takes the first letter from a call number that is NOT LCC Scheduled (I say ‘scheduled’ for reasons that will become apparent later), you’re going to put it in an arbitrary broad category that in fact it wasn’t classified as.
MARC 050 is titled “LC Call No.” At first I figured, great, so if something is in an 050, that means I can take the first letter and determine the LCC Schedule.
But it turns out not every “LC Call No.” is actually from the LC Schedules. This field is used for any call number assigned by LC, and not every call number assigned by LC is actually the kind of “Library of Congress Call Number” we think of — some of them are not from the schedules. (Thus I’ve started talking about “Scheduled” or “Classified” LCC to be specific when talking about ones from the schedules — what do actual catalogers call these?).
The OCLC documentation has a few examples of such non-Scheduled LCC’s, including “Microfilm 19072 E”, and “Newspaper”. You have software just taking the first letter of these, and you’ll wind up with M==”Music” and N==”Fine Arts”. Nope.
An additional example not mentioned in the OCLC documentation but found in the wild is 050 of “MLCS 83/5180 (P)”. A local cataloger tells me this is called an “MLC”, and is sometimes assigned by LC as a “LC Call Number” when they don’t want to spend the time to actually create a classified call number. And it turns out these aren’t the extent of it, there are a whole bunch more non-Classified/Scheduled LC Call Numbers that can be in 050. And they have exactly the same indicators and subfields as an actual scheduled LCC, there’s no clear distinction made in the Marc record.
Oops. This is yet another example of Marc showing it’s age and origin as a format for specifying printed cards. If you’re just printing cards, who cares if the call number is classified or not? But it’s yet another example of catalogers spending expensive expert time creating data that can not be unambiguously extracted from the records they create — there are all sorts of interesting things you might want software to do with a Classified/Scheduled LCC, but it requires being able to identify it as a Classified/Scheduled LCC!
So the Marc alone isn’t going to do that, but I still really want to. At first I considered trying to blacklist certain known starting strings, like “Newspaper”, “Microform”, and “MLCS”, having software know those are not Classified/Scheduled LCCs. But it is still very unclear to me if there is any such list of all these “bad” prefixes, or even if all the non-classified LCC’s can be identified by prefixes.
So instead, we’re going to have to try and use a regular expression to have software positively identify 050′s that do look like a classified LCC. This turns out to be trickier than you might think too. It would be nice if LC itself provided some canonical regular expression they guaranteed would match all legal scheduled LCC’s and be unlikely to match other things, but they don’t, so you’ve got to reverse engineer it hoping you get it right. One or two letters followed by a number? Oops, sometimes it can be three letters. Followed by a number? Sometimes with no space between the letter and number, but sometimes with space.
So, okay, after fighting with this for a while, I asked on the code4lib list, and several people had already experimented to come up with a good one (coming up with several different variations), but I eventually settled on my own local variation of this awfully crazy complicated one from Bill Dueber. (Thanks Bill for posting your work in public, many of us do that not enough).
And why do I mention 060? Well, 060 is a National Library of Medicine (NLM) call number, and we’ve got a bunch of records in our database that don’t neccesarily have an LCC anywhere on em, but do have an NLM. NLM classification is sort of an ‘extension’ to LCC, they all begin with letters un-used in the LCC.
For this initial broad categorization attempt, I figure, okay, anything with an NLM call number at all, just call it “Medicine”, one of the top-level LCC schedules, as a broad-categorization.
I have no idea if 060 suffers from the same issue as 050, where sometimes it contains “NLM Call Numbers” that aren’t actually classified/scheduled. Since for now I’m just calling ANYTHING with an NLM call number “Medicine”, it doesn’t matter too much, but nonetheless planning for the future I run 060′s through the same pattern and only pay any attention to them if they match. (Do NLM call numbers have a pattern that exactly matches scheduled LCCs? I have no idea, but my code is assuming so right now.)
090 and 096
Okay, 050 and 060 is one place to find classified LCC/NLM call numbers. But if you forget the 090 and 096, you’ll end up not assigning some records that could be broadly categorized becuase of scheduled class numbers in 090/096.
090 is for “Locally Assigned LC-type Call Number”. How is this different than an 050 with second indicator 4 “Assigned by agency other than LC”. I have no idea. 096 is “Locally Assigned NLM-type Call Number”, which again I’m not sure how that’s different from an 060 with 2nd indicator 4 “Assigned by agency other than NLM”. And on top of that, OCLC documentation for 096 says “Call numbers based on National Library of Medicine (NLM) or Library of Congress (LC) classification schedules” — wait, so a locally assigned LCC might be in 096 instead of 090, or instead 050 with second indicator 4? You got me.
Aside from it being an interesting ironic case of MARC seeming to make too many distinctions with unclear use when it usually makes too few (like not being able to distinguish from classified and non-classified 050), it doesn’t really matter that I have no idea when a local LCC/NLMC would be in one field instead of the other, for this particular project. All that matters for this particular project is that I better look at all of those fields if I want to minimize the “Not able to categorize” records, get the categorization out for anything that has a call number at all. Okay then.
But our own locally assigned LCC/NLMC are even weirder than LC’s. In that we sometimes put prefixes in front of the actual Classified LCC. Apparently we’re not alone in this, becuase Bill Dueber’s code tried to account for it, using his own list of umich’s locally assigned prefixes. But we’ve got different prefixes here. For instance:
090 a| CAGE a| QK3 b| .B592 1817
You just take that as it is and decide that C=”AUXILIARY SCIENCES OF HISTORY” (or, erm, “History”), no that’s not right. But we already learned that we had to put these things through a regex anyway to confirm “possibly a real classified LCC.” (Actually the order I learned these things isn’t nearly as neat as the order I’m recounting them for you, but anyway). But if you put “CAGE QK3 .B592 1817″ through the regex, it will just say “not a valid classified LCC”, and this thing might get thrown in the “unknown” bucket instead of the possibility of correctly assigning it to Q=”SCIENCE”.
So, okay, I could try to get a list of all the possible prefixes we use, and skip over them when determining a possibly valid classified LCC. That list might (or might not) exist locally, but not in any easy place for me to get, and at this point (actually near the end of my trevails, although only the middle of this blog post), I just wanted to be done with it.
So I noticed that they had handily put “$a CAGE” in a different $a subfield then “$a QK3″. I have no idea if that’s what everyone does, but it seems to be what has been done (mostly? Always? I don’t know) here. So, okay, I just take each $a, and match it against the pattern matcher. “CAGE” is thrown out, then “QK3″ is accepted by the pattern, and matches Q=Science. Great.
Until I noticed that repeated $a’s are sometimes used in my local records, for some reason I don’t know, for the ‘cutter’ part of the call number.
096 a| QZ200 a| N277m no.53 1979
Wait, why is the N277m part in an $a instead of a $b? Cause now what I arrived at to deal with the “$a CAGE” is going to first decide that Q=Science (true!), then decide that N=“FINE ARTS” — no!
I think that’s just a cataloging error, and the N277 should be in a $b, but I’m not really sure. (The docs for 096 do say that the $a is repeatable, although I’m not sure why. The docs for 090 say the $a is not repeatable, which doesn’t mean it never repeats in our data).
Nonetheless, at this point, I’m really ready to be done with this and cry ‘good enough for now’, realizing I’m going down a bottomless pit of effort to try and make it ‘perfect’, so I just leave the algorithm as is, that record is indeed incorrectly put into both the Science category (true) and Fine Arts (not true).
Locally assigned call numbers
Now, our actual locally assigned call numbers on our individual items don’t appear in the MARC at all. Sometimes the locally assigned call number is put in the 090, but sometimes it isn’t, even if it is indeed a classified LC call number. I imagine practice here has changed several times over the many decades we’ve been cataloging.
Even if I get every single valid classified LCC/NLMC out of the record, there’s still a bunch (nearly 1/3rd of our database) with no LCC/NLMC at all, so I don’t want to miss any valid one in the record.
So in addition to looking in the MARC, I want to get the call numbers actually assigned to our individual items, which our ILS does not put in the MARC ordinarily.
The ordinary export from our ILS doesn’t include these, but we had already paid John Craig of alpha-g consulting for a custom exporter that is capable of including local call numbers in the export, so it will be available for our Solr indexing stage for our discovery layer. Great.
Again, there’s the problem of telling the actual LCC/NLMC’s from the chafe. At first (many months ago), I was thinking of the chafe being Dewey that would begin with a number and would not map to an LC schedule anyway, so just ignored it. But no such look, there’s all sorts of local call numbers that begin with a letter but are not a classified LCC/NLM, from SuDoc classifed numbers, to “VIDEO 1234″, to the forementioned “CAGE[etc] [followed by actual LCC!]“.
So, okay we throw all these through the regex too, and only use them for broad categorization if they match the pattern. In this case, the “CAGE” and the actual LCC aren’t in seperate fields, so at the moment my algorithm is missing out on classifying those, so be it.
I also realized that while SuDoc call numbers couldn’t be put into broad categories using the LCC Schedule, there was useful information there, these items didn’t need to just wind up in the “unknown” category. So everything with a SuDoc, let’s put it into a “Government Publication” category; not exactly a ‘discipline’, but it is a ‘perspective’, which is what ‘discipline’ really gets at, seems more useful than not.
So how the heck do I tell if a local item call number is a SuDoc? Well, I could try to come up with a regex pattern for them, but man, I’m even less familiar with SuDocs and not sure if there’s any “prior art” to copy. So, okay, theoretically my ILS keeps track of “call number type” internally. So, okay, go back and reconfigure the alpha-g custom exporter to include that in the export too. (Possible to do becuase the alpha-g product is nice and flexible, although adding this additional element in the export does slow down the already slow export).
If the ILS internally says there’s a local item-level call number that’s type “SuDoc”, we throw it into “Government Publication” category, great. And in fact, for consistency, as long as we’re making a “Government Publication” category, if the record has any 086 at all, we’ll also throw it in “Government Publication”, even if it doesn’t have a locally assigned call num of type “SuDoc”.
That last ends up being handy, because it turns out the ILS tracked “call number” type isn’t all that reliable. There could be a locally assigned SuDoc call number, but it’s not actually marked as “SuDoc type” in our ILS. (After all, this wasn’t used for too much before — except putting the call num in the right ‘call number browse’ list, but apparently nobody noticed the lots of call numbers missing from these lists before).
In fact, originally I had tried to just trust the internal ‘call number type’ to map to LCC schedule too. But then I realized how unreliable it was (both false negative and false positive). And by that time I had already been forced to use the regex pattern for the 050, so I swung back around and used it here too. Phew.
Did anyone actually make it this far? If you’re exhausted, you’re not half as exhausted as I got trying to implement it. It seemed so straightforward at first! And it’s still not perfect, it’s missing out on broad categorizations for some of those “[LOCAL PREFIX] [Classified LCC]” types, and it’s incorrectly categorizing some of those repeated-subfield-a types, and there are probably some other problems I have yet to discover. But I think it’s good enough for now. (Note: The final procedure I describe here hasn’t yet been applied to our live index as I write this, but it should be in place by tomorrow or the next day).
But I think this illustrates a couple really unfortunate things about our current metadata environment I keep harping on, because on some of the listservs I read some catalogers are still resisting recognizing they are unfortunate things at all.
1. Expensive cataloger time used to record things that can’t actually be used by software
A classified call number is actually a REALLY useful thing to have. Not just for printing on a spine and shelving a book, but for all sorts of interesting features. Like this broad categorization, but also for all sorts of other useful things, maybe a hiearchical navigation with multiple levels, maybe expanding the keyword index with additional words from the schedules. (I think someone wrote about the utility of doing that last one way back in the 80s, but I’m way too exhausted to go find a citation. Martha someone? Diane someone?)
And very expensive and very expert cataloger time is spent creating these classified call numbers — but then they’re stored in the record in such a way that the software can’t really figure out if or where they’re in there for sure, I’m reduced to confusing guessing.
I firmly believe that we will continue to need metadata experts (ie, catalogers) in the foreseeable future. But inefficiently spending time recording info that can not in fact be efficiently retrieved by software is not the way to make the case for that.
2. Documented… where?
To finally get to the bottom of all of this stuff I’ve painfully recounted for you, it’s not like I just looked at some documentation, oh, no problem. I did look at the OCLC documentation, but it was not nearly sufficient, while it hinted at the fact that non-classified LC Call Numbers might appear in the 050, I didn’t realize what it was really saying until I was forced to, going back to figure out why my initial naive approaches weren’t working. To really get to the bottom of it, I had to look at documentation in several different places, talk to several different catalogers (each of whom had to think carefully, as well as consult several different pieces of documentation, some of which are not in publically available spots), and then talk to several different programmers who had tried to deal with this before too.
It’s not like you can just tell a programmer “Oh, look at the documentation, it’ll explain what the data is so you can use it.” And you can’t even just say “ask a cataloger” either. It’s an incredibly complicated metadata regime we’ve got, where to figure out what’s really going on, you’ve got to consult MARC documentation (both LC and OCLC doesn’t hurt), AACR2, ISBD, LC documentation (some of which might be publically accessible? But my catalogers just look at it on paywalled site), and then there’s a whole bunch of stuff that’s still just “standard practice” not neccesarily documented anywhere and isn’t neccesarily even clear to catalogers (WHAT sorts of prefixed non-classified call numbers can appear in 050 exactly?)
This LCC/NLMC problem isn’t even close to the worst one that I or other programmers have dealt with. All this makes trying to get the most out of our MARC data very expensive in terms of everyone’s time, the programmers, the catalogers who the programmers ask questions to, etc. Again, very expensive to make use of the data is not the best way to make the case for cataloger staff positions being used efficiently.
Anyhow, some next steps, further enhancements
So, I’ve got the broad categorization on LCC basically working. (Again, as of the time of this writing it’s not deployed live yet, but will be in a day or two). It’s mis-categorizing some things, which I could try to fix (neccesarily by getting a list of all local “prefixes” we use before LCC’s — hmm, unless I can write a regex that ignores any arbitrary prefix while still matching LCC’s and excluding most other things).
But in addition to that, I can think of two major enhancements, that would certainly be interesting to spend time on, but I probably won’t until higher priorities are done, and unless we have additional evidence that this kind of broad categorization is a significant benefit to users.
1. More useful categories
The LC top-level schedule names are better as broad categories than nothing, but they’re definitely a bit dated, both in terms of the labels they use (easy to fix), and more importantly in terms of the categories they represent. They’re not quite the categories that well match contemporary research needs.
Umich has very usefully come up with a mapping from LCC to more useful modern categories. That mapping was pretty expensive for them to come up with and maintain, but certainly less expensive than trying to reclassify millions of documents according to some new system (outright infeasible).
So I might in the future want to put all my LCC’s through the umich mapping (which they’ve conveniently made available in XML), to get better categories. Umich’s mapping is customized for their own local academic divisions (schools, departments, programs etc), but it’s probably close enough to any other large research university to be useful. Or really close enough to just about any user group’s needs to be more useful than the straight LCC classification, and a lot easier than creating and maintaining a mapping of your own.
The umich mapping doesn’t include NLM call numbers, but the NLM schedule is a little bit less dated then the LCC one, the actual labels from the NLM schedule could be used, instead of just calling all NLM’s “Medicine”. (It might or might not be important to also do a bit of normalization between the NLM schedule labels and the health-related labels in the umich mapping).
Also, I could go similarly go further with SuDoc’s, using the SuDoc schedule to get more specific than “Government Publication”, perhaps normalizing labels to match the “Government Information” section of the umich mapping.
2. Classify more documents
Using the procedure described above, still around 1/3rd of our collection is left unclassified. I believe this is by and large items which really lack any useable classified LCC, NLM, or SuDoc call number, rather than items with classified call numbers missed by the procedure. Examples include third-party-vendor supplied records for electronic books, which usually lack 050′s in the bibs, and for which we don’t locally assign call numbers either (we don’t put em on a shelf, why would we? But even with this new use, we simply don’t have the resources to.) Or items that are on our shelves with unclassified call numbers like “VIDEO 1234″, and which also have no classified call numbers in 050/060/086/etc.
I can think of a couple interesting ways to try to automatically classify these records into broad categories.
One might be to try using the OCLC Classify service. Although for the best use, that requires supplying an OCLCnum, LCCN, or ISBN, which not all my records have. (And for those that do…. oh boy, see my rant on the 020$z issue. If an 020$z is a print ISBN on an e-record, that would be the best one to use with the classify service, most likely to get a match in WorldCat. But if the 020$z is truly a mis-assigned ISBN, then using it would get the wrong record. And there’s no good way to know which an 020$z is.) Theoretically, author/title could be used instead, with presumably more risk of both false negatives or false positives — it’s still just trying to match a record in Worldcat and give you it’s classified call numbers.
Or the Classify service allows you to supply a FAST subject heading — but geez people, I don’t have FAST subject headings, why don’t you let me give you an LCSH? Or even multiple LCSH’s? The FAST Classify lookup is probably actually using something similar to another option I thought of that could be done locally:
The other option, which I can do on my local catalog. If I have a record that does not have any classified but does have subject headings. Then look up in my local catalog records that have as many of those same subject headings as possible (either all, or if none have all, whatever have the most). Then from the result set, see what the most popular Broad Categories are, and figure those broad categories are probably appropriate for THIS record too. Wow, that’s a pretty clever idea, huh? Once you have your records loaded in Solr anyway, as I do, Solr facetting can be used to answer this question. Or you could even use the Solr “more like this” function to do the same thing on records with no subject headings at all, just finding records with similar words as a whole, although I’m not sure how well that would work on library metadata records (as opposed to full text, for which it’s intended). Would be fun to play around with though.
Since that last idea maps to a Broad Category, not to an actual call number, you’d want to decide whether you were going to use LCC Schedule straight labels, or umich mappings, or something else, first.
If anyone actually made it to the end of this hugely painful essay, let me know? I’m wondering if I’m writing for anyone other than myself when it gets this long and detailed.