Broad categories from class numbers

jrochkind General April 4, 2011April 4, 2011

So, it can be very useful to have some kind of broad categories as facet-style limits to partition your result set. When you get way too many results, and want to narrow down, narrowing to just “Art” or “Biology” or “Nursing” can be pretty useful. Actual LCSH are way too narrow a categorization to be useful this way (whether or not “de-coordinated” into subdivisions), the user may not be ready to narrow down to so specific as specific as an LCSH subdivision or heading, but still want to use a broad categorization to explore the result set.

And a piece of traditional cataloging data actually provides some meaning which can be used for such a broad categorization as a limit, even though that’s not exactly how it was intended to be used: The classification, ie the “classified call number”, that is Dewey call number or Library of Congress Classification.

Most of our collection is LCC, although we’ve got a significant minority of Dewey (as well as plenty of records with no classified call number at all, more on that later). Dewey has a better hieararchical set of classes (which I don’t believe is publically available in it’s entirely, but here’s the first couple levels, you can pay OCLC for the rest), but LCC still has an okay couple of levels.

So as a first pass, I figured, for items that have an LCC, take the top-level LCC schedule name, and just use those as a set of broad categories. If you have an LC call number, you can just take the first letter of it, and easily determine the LC “schedule” it comes from, and use that as a broad category.

I figured I’d use both LCC’s from the bib record itself (usually assigned by LC itself, or else by another cataloger and obtained by us from WorldCat), as well as any locally assigned LCC’s, for maximum coverage.

Sounds simple, right? Oh, if only. It turns out actually identifying an LCC from an actual bib record is anything but simple. And if your software takes the first letter from a call number that is NOT LCC Scheduled (I say ‘scheduled’ for reasons that will become apparent later), you’re going to put it in an arbitrary broad category that in fact it wasn’t classified as.

050

MARC 050 is titled “LC Call No.” At first I figured, great, so if something is in an 050, that means I can take the first letter and determine the LCC Schedule.

But it turns out not every “LC Call No.” is actually from the LC Schedules. This field is used for any call number assigned by LC, and not every call number assigned by LC is actually the kind of “Library of Congress Call Number” we think of — some of them are not from the schedules. (Thus I’ve started talking about “Scheduled” or “Classified” LCC to be specific when talking about ones from the schedules — what do actual catalogers call these?).

The OCLC documentation has a few examples of such non-Scheduled LCC’s, including “Microfilm 19072 E”, and “Newspaper”. You have software just taking the first letter of these, and you’ll wind up with M==”Music” and N==”Fine Arts”. Nope.

An additional example not mentioned in the OCLC documentation but found in the wild is 050 of “MLCS 83/5180 (P)”. A local cataloger tells me this is called an “MLC”, and is sometimes assigned by LC as a “LC Call Number” when they don’t want to spend the time to actually create a classified call number. And it turns out these aren’t the extent of it, there are a whole bunch more non-Classified/Scheduled LC Call Numbers that can be in 050. And they have exactly the same indicators and subfields as an actual scheduled LCC, there’s no clear distinction made in the Marc record.

Oops. This is yet another example of Marc showing it’s age and origin as a format for specifying printed cards. If you’re just printing cards, who cares if the call number is classified or not? But it’s yet another example of catalogers spending expensive expert time creating data that can not be unambiguously extracted from the records they create — there are all sorts of interesting things you might want software to do with a Classified/Scheduled LCC, but it requires being able to identify it as a Classified/Scheduled LCC!

So the Marc alone isn’t going to do that, but I still really want to. At first I considered trying to blacklist certain known starting strings, like “Newspaper”, “Microform”, and “MLCS”, having software know those are not Classified/Scheduled LCCs. But it is still very unclear to me if there is any such list of all these “bad” prefixes, or even if all the non-classified LCC’s can be identified by prefixes.

So instead, we’re going to have to try and use a regular expression to have software positively identify 050’s that do look like a classified LCC. This turns out to be trickier than you might think too. It would be nice if LC itself provided some canonical regular expression they guaranteed would match all legal scheduled LCC’s and be unlikely to match other things, but they don’t, so you’ve got to reverse engineer it hoping you get it right. One or two letters followed by a number? Oops, sometimes it can be three letters. Followed by a number? Sometimes with no space between the letter and number, but sometimes with space.

So, okay, after fighting with this for a while, I asked on the code4lib list, and several people had already experimented to come up with a good one (coming up with several different variations), but I eventually settled on my own local variation of this awfully crazy complicated one from Bill Dueber. (Thanks Bill for posting your work in public, many of us do that not enough).

060

And why do I mention 060? Well, 060 is a National Library of Medicine (NLM) call number, and we’ve got a bunch of records in our database that don’t neccesarily have an LCC anywhere on em, but do have an NLM. NLM classification is sort of an ‘extension’ to LCC, they all begin with letters un-used in the LCC.

For this initial broad categorization attempt, I figure, okay, anything with an NLM call number at all, just call it “Medicine”, one of the top-level LCC schedules, as a broad-categorization.

I have no idea if 060 suffers from the same issue as 050, where sometimes it contains “NLM Call Numbers” that aren’t actually classified/scheduled. Since for now I’m just calling ANYTHING with an NLM call number “Medicine”, it doesn’t matter too much, but nonetheless planning for the future I run 060’s through the same pattern and only pay any attention to them if they match. (Do NLM call numbers have a pattern that exactly matches scheduled LCCs? I have no idea, but my code is assuming so right now.)

090 and 096

Okay, 050 and 060 is one place to find classified LCC/NLM call numbers. But if you forget the 090 and 096, you’ll end up not assigning some records that could be broadly categorized becuase of scheduled class numbers in 090/096.

090 is for “Locally Assigned LC-type Call Number”. How is this different than an 050 with second indicator 4 “Assigned by agency other than LC”. I have no idea. 096 is “Locally Assigned NLM-type Call Number”, which again I’m not sure how that’s different from an 060 with 2nd indicator 4 “Assigned by agency other than NLM”. And on top of that, OCLC documentation for 096 says “Call numbers based on National Library of Medicine (NLM) or Library of Congress (LC) classification schedules” — wait, so a locally assigned LCC might be in 096 instead of 090, or instead 050 with second indicator 4? You got me.

Aside from it being an interesting ironic case of MARC seeming to make too many distinctions with unclear use when it usually makes too few (like not being able to distinguish from classified and non-classified 050), it doesn’t really matter that I have no idea when a local LCC/NLMC would be in one field instead of the other, for this particular project. All that matters for this particular project is that I better look at all of those fields if I want to minimize the “Not able to categorize” records, get the categorization out for anything that has a call number at all. Okay then.

But our own locally assigned LCC/NLMC are even weirder than LC’s. In that we sometimes put prefixes in front of the actual Classified LCC. Apparently we’re not alone in this, becuase Bill Dueber’s code tried to account for it, using his own list of umich’s locally assigned prefixes. But we’ve got different prefixes here. For instance:

090 a| CAGE a| QK3 b| .B592 1817

You just take that as it is and decide that C=”AUXILIARY SCIENCES OF HISTORY” (or, erm, “History”), no that’s not right. But we already learned that we had to put these things through a regex anyway to confirm “possibly a real classified LCC.” (Actually the order I learned these things isn’t nearly as neat as the order I’m recounting them for you, but anyway). But if you put “CAGE QK3 .B592 1817” through the regex, it will just say “not a valid classified LCC”, and this thing might get thrown in the “unknown” bucket instead of the possibility of correctly assigning it to Q=”SCIENCE”.

So, okay, I could try to get a list of all the possible prefixes we use, and skip over them when determining a possibly valid classified LCC. That list might (or might not) exist locally, but not in any easy place for me to get, and at this point (actually near the end of my trevails, although only the middle of this blog post), I just wanted to be done with it.

So I noticed that they had handily put “$a CAGE” in a different $a subfield then “$a QK3”. I have no idea if that’s what everyone does, but it seems to be what has been done (mostly? Always? I don’t know) here. So, okay, I just take each $a, and match it against the pattern matcher. “CAGE” is thrown out, then “QK3” is accepted by the pattern, and matches Q=Science. Great.

Until I noticed that repeated $a’s are sometimes used in my local records, for some reason I don’t know, for the ‘cutter’ part of the call number.

096 a| QZ200 a| N277m no.53 1979

Wait, why is the N277m part in an $a instead of a $b? Cause now what I arrived at to deal with the “$a CAGE” is going to first decide that Q=Science (true!), then decide that N=“FINE ARTS” — no!

I think that’s just a cataloging error, and the N277 should be in a $b, but I’m not really sure. (The docs for 096 do say that the $a is repeatable, although I’m not sure why. The docs for 090 say the $a is not repeatable, which doesn’t mean it never repeats in our data).

Nonetheless, at this point, I’m really ready to be done with this and cry ‘good enough for now’, realizing I’m going down a bottomless pit of effort to try and make it ‘perfect’, so I just leave the algorithm as is, that record is indeed incorrectly put into both the Science category (true) and Fine Arts (not true).

Locally assigned call numbers

Now, our actual locally assigned call numbers on our individual items don’t appear in the MARC at all. Sometimes the locally assigned call number is put in the 090, but sometimes it isn’t, even if it is indeed a classified LC call number. I imagine practice here has changed several times over the many decades we’ve been cataloging.

Even if I get every single valid classified LCC/NLMC out of the record, there’s still a bunch (nearly 1/3rd of our database) with no LCC/NLMC at all, so I don’t want to miss any valid one in the record.

So in addition to looking in the MARC, I want to get the call numbers actually assigned to our individual items, which our ILS does not put in the MARC ordinarily.

The ordinary export from our ILS doesn’t include these, but we had already paid John Craig of alpha-g consulting for a custom exporter that is capable of including local call numbers in the export, so it will be available for our Solr indexing stage for our discovery layer. Great.

Again, there’s the problem of telling the actual LCC/NLMC’s from the chafe. At first (many months ago), I was thinking of the chafe being Dewey that would begin with a number and would not map to an LC schedule anyway, so just ignored it. But no such look, there’s all sorts of local call numbers that begin with a letter but are not a classified LCC/NLM, from SuDoc classifed numbers, to “VIDEO 1234”, to the forementioned “CAGE[etc] [followed by actual LCC!]”.

So, okay we throw all these through the regex too, and only use them for broad categorization if they match the pattern. In this case, the “CAGE” and the actual LCC aren’t in seperate fields, so at the moment my algorithm is missing out on classifying those, so be it.

I also realized that while SuDoc call numbers couldn’t be put into broad categories using the LCC Schedule, there was useful information there, these items didn’t need to just wind up in the “unknown” category. So everything with a SuDoc, let’s put it into a “Government Publication” category; not exactly a ‘discipline’, but it is a ‘perspective’, which is what ‘discipline’ really gets at, seems more useful than not.

So how the heck do I tell if a local item call number is a SuDoc? Well, I could try to come up with a regex pattern for them, but man, I’m even less familiar with SuDocs and not sure if there’s any “prior art” to copy. So, okay, theoretically my ILS keeps track of “call number type” internally. So, okay, go back and reconfigure the alpha-g custom exporter to include that in the export too. (Possible to do becuase the alpha-g product is nice and flexible, although adding this additional element in the export does slow down the already slow export).

If the ILS internally says there’s a local item-level call number that’s type “SuDoc”, we throw it into “Government Publication” category, great. And in fact, for consistency, as long as we’re making a “Government Publication” category, if the record has any 086 at all, we’ll also throw it in “Government Publication”, even if it doesn’t have a locally assigned call num of type “SuDoc”.

That last ends up being handy, because it turns out the ILS tracked “call number” type isn’t all that reliable. There could be a locally assigned SuDoc call number, but it’s not actually marked as “SuDoc type” in our ILS. (After all, this wasn’t used for too much before — except putting the call num in the right ‘call number browse’ list, but apparently nobody noticed the lots of call numbers missing from these lists before).

In fact, originally I had tried to just trust the internal ‘call number type’ to map to LCC schedule too. But then I realized how unreliable it was (both false negative and false positive). And by that time I had already been forced to use the regex pattern for the 050, so I swung back around and used it here too. Phew.

Phew. Oy!

Did anyone actually make it this far? If you’re exhausted, you’re not half as exhausted as I got trying to implement it. It seemed so straightforward at first! And it’s still not perfect, it’s missing out on broad categorizations for some of those “[LOCAL PREFIX] [Classified LCC]” types, and it’s incorrectly categorizing some of those repeated-subfield-a types, and there are probably some other problems I have yet to discover. But I think it’s good enough for now. (Note: The final procedure I describe here hasn’t yet been applied to our live index as I write this, but it should be in place by tomorrow or the next day).

But I think this illustrates a couple really unfortunate things about our current metadata environment I keep harping on, because on some of the listservs I read some catalogers are still resisting recognizing they are unfortunate things at all.

1. Expensive cataloger time used to record things that can’t actually be used by software

A classified call number is actually a REALLY useful thing to have. Not just for printing on a spine and shelving a book, but for all sorts of interesting features. Like this broad categorization, but also for all sorts of other useful things, maybe a hiearchical navigation with multiple levels, maybe expanding the keyword index with additional words from the schedules. (I think someone wrote about the utility of doing that last one way back in the 80s, but I’m way too exhausted to go find a citation. Martha someone? Diane someone?)

And very expensive and very expert cataloger time is spent creating these classified call numbers — but then they’re stored in the record in such a way that the software can’t really figure out if or where they’re in there for sure, I’m reduced to confusing guessing.

I firmly believe that we will continue to need metadata experts (ie, catalogers) in the foreseeable future. But inefficiently spending time recording info that can not in fact be efficiently retrieved by software is not the way to make the case for that.

2. Documented… where?

To finally get to the bottom of all of this stuff I’ve painfully recounted for you, it’s not like I just looked at some documentation, oh, no problem. I did look at the OCLC documentation, but it was not nearly sufficient, while it hinted at the fact that non-classified LC Call Numbers might appear in the 050, I didn’t realize what it was really saying until I was forced to, going back to figure out why my initial naive approaches weren’t working. To really get to the bottom of it, I had to look at documentation in several different places, talk to several different catalogers (each of whom had to think carefully, as well as consult several different pieces of documentation, some of which are not in publically available spots), and then talk to several different programmers who had tried to deal with this before too.

It’s not like you can just tell a programmer “Oh, look at the documentation, it’ll explain what the data is so you can use it.” And you can’t even just say “ask a cataloger” either. It’s an incredibly complicated metadata regime we’ve got, where to figure out what’s really going on, you’ve got to consult MARC documentation (both LC and OCLC doesn’t hurt), AACR2, ISBD, LC documentation (some of which might be publically accessible? But my catalogers just look at it on paywalled site), and then there’s a whole bunch of stuff that’s still just “standard practice” not neccesarily documented anywhere and isn’t neccesarily even clear to catalogers (WHAT sorts of prefixed non-classified call numbers can appear in 050 exactly?)

This LCC/NLMC problem isn’t even close to the worst one that I or other programmers have dealt with. All this makes trying to get the most out of our MARC data very expensive in terms of everyone’s time, the programmers, the catalogers who the programmers ask questions to, etc. Again, very expensive to make use of the data is not the best way to make the case for cataloger staff positions being used efficiently.

Anyhow, some next steps, further enhancements

So, I’ve got the broad categorization on LCC basically working. (Again, as of the time of this writing it’s not deployed live yet, but will be in a day or two). It’s mis-categorizing some things, which I could try to fix (neccesarily by getting a list of all local “prefixes” we use before LCC’s — hmm, unless I can write a regex that ignores any arbitrary prefix while still matching LCC’s and excluding most other things).

But in addition to that, I can think of two major enhancements, that would certainly be interesting to spend time on, but I probably won’t until higher priorities are done, and unless we have additional evidence that this kind of broad categorization is a significant benefit to users.

1. More useful categories

The LC top-level schedule names are better as broad categories than nothing, but they’re definitely a bit dated, both in terms of the labels they use (easy to fix), and more importantly in terms of the categories they represent. They’re not quite the categories that well match contemporary research needs.

Umich has very usefully come up with a mapping from LCC to more useful modern categories. That mapping was pretty expensive for them to come up with and maintain, but certainly less expensive than trying to reclassify millions of documents according to some new system (outright infeasible).

So I might in the future want to put all my LCC’s through the umich mapping (which they’ve conveniently made available in XML), to get better categories. Umich’s mapping is customized for their own local academic divisions (schools, departments, programs etc), but it’s probably close enough to any other large research university to be useful. Or really close enough to just about any user group’s needs to be more useful than the straight LCC classification, and a lot easier than creating and maintaining a mapping of your own.

The umich mapping doesn’t include NLM call numbers, but the NLM schedule is a little bit less dated then the LCC one, the actual labels from the NLM schedule could be used, instead of just calling all NLM’s “Medicine”. (It might or might not be important to also do a bit of normalization between the NLM schedule labels and the health-related labels in the umich mapping).

Also, I could go similarly go further with SuDoc’s, using the SuDoc schedule to get more specific than “Government Publication”, perhaps normalizing labels to match the “Government Information” section of the umich mapping.

2. Classify more documents

Using the procedure described above, still around 1/3rd of our collection is left unclassified. I believe this is by and large items which really lack any useable classified LCC, NLM, or SuDoc call number, rather than items with classified call numbers missed by the procedure. Examples include third-party-vendor supplied records for electronic books, which usually lack 050’s in the bibs, and for which we don’t locally assign call numbers either (we don’t put em on a shelf, why would we? But even with this new use, we simply don’t have the resources to.) Or items that are on our shelves with unclassified call numbers like “VIDEO 1234”, and which also have no classified call numbers in 050/060/086/etc.

I can think of a couple interesting ways to try to automatically classify these records into broad categories.

One might be to try using the OCLC Classify service. Although for the best use, that requires supplying an OCLCnum, LCCN, or ISBN, which not all my records have. (And for those that do…. oh boy, see my rant on the 020$z issue. If an 020$z is a print ISBN on an e-record, that would be the best one to use with the classify service, most likely to get a match in WorldCat. But if the 020$z is truly a mis-assigned ISBN, then using it would get the wrong record. And there’s no good way to know which an 020$z is.) Theoretically, author/title could be used instead, with presumably more risk of both false negatives or false positives — it’s still just trying to match a record in Worldcat and give you it’s classified call numbers.

Or the Classify service allows you to supply a FAST subject heading — but geez people, I don’t have FAST subject headings, why don’t you let me give you an LCSH? Or even multiple LCSH’s? The FAST Classify lookup is probably actually using something similar to another option I thought of that could be done locally:

The other option, which I can do on my local catalog. If I have a record that does not have any classified but does have subject headings. Then look up in my local catalog records that have as many of those same subject headings as possible (either all, or if none have all, whatever have the most). Then from the result set, see what the most popular Broad Categories are, and figure those broad categories are probably appropriate for THIS record too. Wow, that’s a pretty clever idea, huh? Once you have your records loaded in Solr anyway, as I do, Solr facetting can be used to answer this question. Or you could even use the Solr “more like this” function to do the same thing on records with no subject headings at all, just finding records with similar words as a whole, although I’m not sure how well that would work on library metadata records (as opposed to full text, for which it’s intended). Would be fun to play around with though.

Since that last idea maps to a Broad Category, not to an actual call number, you’d want to decide whether you were going to use LCC Schedule straight labels, or umich mappings, or something else, first.

Postscript

If anyone actually made it to the end of this hugely painful essay, let me know? I’m wondering if I’m writing for anyone other than myself when it gets this long and detailed.

Published by jrochkind

View all posts by jrochkind

Published April 4, 2011April 4, 2011

29 thoughts on “Broad categories from class numbers”

Jerry Persons says:

April 4, 2011 at 6:41 pm

I follow your exploits with great interest. I’m working on projects aimed at how we map from data that served to print masses of catalog cards to a future that truly enables discovery and navigation across all the resources that research universities produce and consume. Publishing the work you do in this realm is extremely valuable.
Laurie Taylor says:

April 4, 2011 at 7:36 pm

This blog is the one I most look forward to reading because you deal with the same problems that have to be fought through. It’s refreshing to read any and all articulate and accurate examinations of both MARC and modern computing. Too many things slide into false oversimplification of one or the other (or abandon the fight with hopes of a new technology to come). For those of us working with making MARC work in modern terms, we want it to be simple, but getting it there is very complicated thanks to basic errors, the weird not-really-a-computer format known as MARC, and all of the odd practice that’s grown to support MARC . I greatly appreciate your blog and especially appreciative of your long and detailed posts on MARC because you often uncover problems I’ve yet to face (and hopefully help for when I’ll encounter them).
Candy Schwartz says:

April 4, 2011 at 9:32 pm

I read it all the way through – laughing (in a “nod, oh yes” kind of way). I too greatly appreciate your blog and your faith in the value of library metadata (not the container that currently holds it).
David says:

April 5, 2011 at 12:04 am

I read to the end, and always enjoy reading your posts. (The previous commenters have adequately summed up my sentiments. :-)

You say “LCC still has an okay couple of levels” – yes and no there. The page you link to offers versions in PDF, Word and Word Perfect. Not great for extracting the full range for easy manipulation.

I’ve been trying to play around with a visualisation where you can drill down by each section of the call number, with the display size weighted by the number of items in that classification in our collection. It worked fine for Dewey – I used the first three numbers to good effect. For LC the two (or three) letters are fairly easy to get the data for but it’s still too broad. I started manually processing the info from the LC site you linked but didn’t get very far before realising that it was simply too much effort. And haven’t been able to get my hands on anything that gives me that info in a (close to) machine manipulable form. :-/
(I have access to the paywalled ClassificationWeb, where the tabularised info should be parsable. But the very limited number of lines shown on a single page would require quite a bit of manual effort just to extract it all…)

BTW I haven’t had issues with getting the call numbers out of the database (so far) as we store ours in the holding (852$h) rather than bib record.
Owen Stephens says:

April 5, 2011 at 4:33 am

Of course I read to the end! How can you doubt it :)

I’d echo your words “Thanks Bill for posting your work in public, many of us do that not enough” and use the same praise for you – doing and sharing this stuff is key – so thank you.
Esther Arens says:

April 5, 2011 at 5:32 am

I too really like your posts (and read this one through to the bitter end). Although I don’t do such things myself I try to just about follow what you’re describing/explaining and learn a lot from it! It might even make my cataloguing a bit better…
Thx
Saskia says:

April 5, 2011 at 6:24 am

Posts like this are extremely important and I hope many catalogers read your blog, because as Esther said we get a glimpse into an area we’re not directly involved with but where what we do and especially how we do it makes a huge impact. We (catalogers and programmers/developers alike!) are in a real pickle and I wonder how to clean up all this mess? Anyway, keep up the great work!
suzi w. says:

April 5, 2011 at 8:10 am

Wow. That was long. As a self-trained cataloger, I had to read half way down and then come to the comments to see if you were a “good guy” or a “bad guy.” Then I read half way up and down. But yes. I did read the whole thing.

And I understood what you were talking about!! I work at a Dewey library, so the 082 is what would be the equivalent for us.

And yes. Catalogers and meta-data folk will be needed. Shout it from the rooftops, we STILL need libraries and librarians in this digital age, and maybe more so, as the information increases.
Alison says:

April 5, 2011 at 8:24 am

I find that a lot of these problems are caused by historical practices (changes in practice over time) rather than what we are currently doing as cataloguers. The best place for call number in our Voyager records seems to be in the 852 of the marc holdings format record which does have an indicator for whether or not the class number is LC, DDC, etc. However, for things like rare books that just have a shelving number, you would then want to look back at the 050 in the bib record to get a subject facet! Tricky, very tricky! After reading this I think I may have to go back and tweak our normalization rules in Primo… Great post!
Melanie says:

April 5, 2011 at 8:30 am

Like the rest, I do read it with great interest. As a cataloger who is buried in MARC, your travails are one of the best examples of why MARC does need a replacement. My biggest frustration? Nobody seems to be offering anything that I’ve heard of. People keep talking, but nobody seems to be doing anything about writing a new code. And it is not something I have the knowledge to do.
Michael Doran says:

April 5, 2011 at 10:03 am

Anything you write that is library-related is automatically on my reading list, Jonathan. I’ve wrestled with some of these same issues. In theory, certain data in the MARC record should be programmatically useful. In practice, we discover the complications that confound such use.
Jason Thomale says:

April 5, 2011 at 11:13 am

I find your posts so valuable *because* of the detail you put into them. To borrow from Michael’s comment, it’s exactly those practical complications confounding programmatic use of library data that are hugely important to know and document. For those of us who aren’t skilled catalogers and don’t know the complex history behind how various cataloging practices developed, the only way to learn is through investigations like yours. What you’re doing by posting it here in some part addresses the (great) points you make in the Documented…where? section. And sure, local practice is going to vary, but at least it gives us a foothold on what we need to watch out for. So…thank you. Please know that you’re providing a valuable service by documenting this stuff.
Julia says:

April 5, 2011 at 11:52 am

As others have said above, I find your posts valuable and always read them the whole way to the end. At MPOW we’re a good bit behind y’all at Johns Hopkins (although I aspire to do a lot of the same kinds of things you are doing), so I find it quite helpful to have your detailed warnings about the pitfalls we’re going to face if/when we tackle some of the things you’ve tackled.

Also, I frequently pass on posts like this to the technical services folks here, because you’re so much more patient and detailed at explaining just WHY MARC is not actually “machine readable” than I am. (I think I’m normally passably good at explaining technical things to non-technical people, but MARC so offends my coder/data-maven sensibilities that it’s a real struggle for me to explain why it’s preventing me from doing the things we want without devolving into “MARC is the root of all evil!” ranting.)
Dorothea says:

April 5, 2011 at 12:16 pm

I read every word you write. Every single one. Don’t stop writing; what you’re doing is crucially important.
Robin Sinn says:

April 5, 2011 at 1:37 pm

I read all of this, Jonathan! While I don’t understand everything (being a reference librarian), it makes me a better catalog and database user. And it lets me know how much you CARE about your work. I think all the ref librarians at JHU should be reading your blog.
Daniel Forsman says:

April 5, 2011 at 4:24 pm

I read the whole post with great interest as usual. The concepts of your posts are applicable even in sweden.
Jonathan Rochkind says:

April 5, 2011 at 7:09 pm

Wow, I didn’t really mean to be begging for compliments, but thanks anyway, appreciated! So, okay, I know that someone is actually reading when I write this stuff, I’ll keep writing, thanks!
Catherine says:

April 6, 2011 at 9:52 am

Another cataloger here. I look forward to seeing your results on helpfulness to users published (whether here or in a journal), and of course would also be interested in seeing (1) what changes were suggested to local cataloging practice to facilitate the process and (2) whether they were adopted, and why or why not. I think that often catalogers are blamed for having a “stuck in the past” attitude, when the real problem, as I see it, is that we just don’t have R&D money. Not to mention that we have to maintain our 100+ years of legacy data because the cost of redoing it all would simply be astronomical. There are others out there who are, like you, working on these problems and sharing the results, but not enough–and certainly not enough people who are both catalogers and programmers. Thank you for sharing via your blog!
jrochkind says:

April 6, 2011 at 5:18 pm

Procedure described here is now represented in our live catalog.

https://catalyst.library.jhu.edu/
https://catalyst.library.jhu.edu/catalog/facet/discipline_facet
Christina Pikas says:

April 8, 2011 at 9:46 pm

I read the whole thing. God willing, I’ll never have to *know* this stuff, but it’s like reading a mystery novel. I love how you reveal all of the things you tried and the dead ends you encountered. I’ll definitely appreciate the discipline categories more now.
jrochkind says:

April 9, 2011 at 2:20 am

Thanks Christina. I’m still not actually sure the discipline facet works all that well; I’ve still got various ideas for improvement, just a question of how important it would be to users even if it were improved, compared to other things, with limited time to do everything. The nature of the business we’re in, you can spend a lot of time on something and it sometimes still needs even yet more to actually be good enough, and sometimes it’s not possible in the actual real world with the real data we’ve got. Other times you can find some clever with huge benefit for less work, those are the best ones.
Pingback: Why programmers hate free text in MARC records « Robot Librarian
Wayne Lam says:

June 9, 2011 at 11:36 pm

Thanks man,
This really help me a lot when i am working with local marc data.
This is exactly my problem and it provided a lot of valuable experience.
Paul Hoffman says:

June 17, 2011 at 10:09 pm

(Sorry to be so late chiming in here.)

The problem is, OCLC’s documentation does not describe any particular standard. The Library of Congress has the definitive word on the MARC21 standard:

http://www.loc.gov/marc/bibliographic/

For example, it’s quite clear from LC’s description of the 020 field that subfield $z is only supposed to be used for a cancelled or invalid ISBN. And here’s an excerpt from the documentation for the 050 field:

(begin quote)

050 – Library of Congress Call Number (R)

[…]

FIELD DEFINITION AND SCOPE

Classification or call number that is taken from Library of Congress Classification or LC Classification Additions and Changes. […]

Second indicator values distinguish between content actually assigned by the Library of Congress and content assigned by an organization other than LC.

(end quote)

You write (concerning the ambiguity of 050 field contents):

(begin quote)

Oops. This is yet another example of Marc showing it’s age and origin as a format for specifying printed cards. If you’re just printing cards, who cares if the call number is classified or not?

(end quote)

But IMO the blame lies with OCLC and with anyone putting a non-LCC call number in an 050 field — not with MARC21 (and USMARC before it).
jrochkind says:

June 18, 2011 at 12:59 am

Apparently, it’s PCC, I think, Paul? Which isn’t OCLC. Or LC. We have too many different standards which are all theoretically supposed to go together, but make it hard to predict what ‘standardized’ data will look like. Yeah, ideallyI think guidelines/standards by someone like PCC or OCLC should be consistent with the underlying Marc21 standard and not conflict with it, but it doesn’t end up working that way, maybe because those making the other guidelines find MARC (as in the Marc21 standard literally) not flexible enough or not clear enough to meet their needs.
Pingback: Playing with the Open Library | B&log
Pingback: On catalogers, programmers, and user tasks | Gavia Libraria
Pingback: A method to map from query to broad topic, and associated resources | Bibliographic Wilderness
Pingback: On catalogers, programmers, and user tasks – Gavia Libraria