So yet another in the long list of cases where actually-existing MARC data makes certain elements not really usable by software, because of encoding it in ambiguous ways. This one is particularly mysterious and frustrating to me, because it’s a fairly recent phenomenon, and I wonder — why the heck, catalogers?
It’s also one that I have no idea where it’s documented. The difficulty of figuring out from documentation what to expect in MARC, and the fact that the documentation you do find is often not accurate with actually-existing MARC — that’s yet another problem with writing software to work with MARC.
In this case, I’m talking about the $z subfield in tag 020 (ISBN), or sometimes tag 022 (ISSN).
This is documented as “Canceled/invalid ISBN” (or ISSN). It might be used for a just plain WRONG ISBN, that appeared on the title page but is actually for somethign else entirely (typo), or that was entered in a previous MARC record errantly, etc. Just plain wrong ISBN.
But it seems also be used for alternate-format ISBN/ISSN. For instance, if the record is for an e-book, the ISBN for the print manifestation might be added in 020$z. Now, I can’t find this practice documented anywhere, but it is common, and I believe someone told me once it is “official” and “standard” — from PCC maybe? I don’t know! Any catalogers can clear this up?
Matching foreign databases
Why is this a problem? Well, in our modern world, one of the most useful things to do with a library record database is to match it to other databases. In both directions. For instance, for a record in the catalog, I might want to know if a record exists in a foreign (possibly non-library) database — full text or partial text in Google Books, Amazon or OpenLibrary; holdings by other libraries in OCLC; many more. And in the other direction too, someone finds a book in some non-library interface, and I want software to be able to automatically tell them if we have that book in our library catalog.
Now, you can try to ‘match’ on author and title, but that doesn’t work all that well. Because the exact same manifestation might be entered slightly differently in the different systems. Maybe the author has just initials, or maybe the author is spelled out. Maybe different punctuation or spacing is used, or a subtitle is present or absent. So you try to ‘normalize’ these things in order to compare them, but this is inherently a difficult and imperfect process, that you can spend lots of time trying to get right, and still not be perfect. Inevitably, you’ll end up with some false matches (think it’s a match, but it’s an entirely different work with a similar author/title), as well as false negatives (think there’s no match, but there was, the author/title was just entered differently). And trying to minimize false matches generally causes your false negatives to go up, and vice versa.
So, what works a lot better is some type of identifier. And we have some types of identifiers available, not fancy new URIs, but identifiers none the less. The ones we typically have available are: LCCN, OCLCnum, ISBN, ISSN. Of those, LCCN and OCLCnum are less likely to be found in non-library databases, but ISBN and ISSN are quite popular, and if the work is new enough to have one at all, third party databases usually include em too. So this makes ISBN and ISSN, especially ISBN, awfully useful for this kind of matching.
So what’s the problem with the $z cancelled/invalid? Well, it’s that there’s no way to tell if a $z is just completely wrong, applying to some completely different work, or if it’s an intentionally added alternate manifestation ISBN. And the software needs to treat those two things entirely differently when doing matching between databases. If I knew it was just completely wrong, I would not want to use it to find a match in Google/Amazon/etc. If I knew it was an alternate manifestation ISBN, I might want to try to use it to find matches on Google/Amazon. If I in fact knew the current record was for an ebook (and that alone is hard to get out of the MARC), the $z is an ISBN for an alternate manifestation, and that alternate manifestation was print — I might prefer to use the $z to find a match in Google/Amazon etc., because ebook ISBN’s are weird, and I’m probably not going to find a match on one, but finding a match on the print book corresponding to this ebook is pretty good.
Likewise for incoming matches — some ISBN “X” is found in some external database. I want to know if we have the book with that ISBN “X” in our local catalog. If we have a record in our local catalog with X in a 020$z, but it represents a completely wrong ISBN there, then the answer is ‘no’. But if it represents an alternate manifestation, the answer might be ‘yes’, or ‘yes, in an alternate manifestation’, or even if the data was there ‘yes, but we only have it in e-book even though you asked for a print ISBN’.
But I have no idea which category an 020$z is. So the two choices are to treat them all like “just wrong” (missing lots of matches, in either direction), or to treat them all as “alternate manifestation” (getting some false positives, when the $z really was a completely wrong ISBN).
This is frustrating, because someone spent the time to record this data, and this data would be very valuable to us and to our users if software was able to take advantage of it. Ebooks make things much more complicated, and having the print ISBN there on an ebook record is phenomenally valuable for being able to tell users if we have a certain book they found elsewhere, or for being able to link to services on a book in our catalog elsewhere (Google Books, Amazon, etc). That is, it would be phenomenally valuable if it had been recorded in an unambiguous way. Instead, by recording it ambigously, expensive cataloger time has been spent recording something that would be valuable, but is really, well, if not useless, of significantly diminished value.
The cataloger hour to value gained ratio is hugely decreased. Those worried about administrators deciding that cataloging is too expensive for what it gives us ought to care a lot about the cataloger-hour-to-value ratio. And in very many ways, that ratio is diminished by the manner in which we store things in MARC. You might worry paying attention to that ratio would mean turning cataloging into even more of a sweatshop than it already is, but in fact storing our data clearly instead of ambiguosly (and this ISBN thing is just one example) can have gains well beyond trying to make catalogers crank out things quicker while still being quality, which is probably impossible anyway, and unkind to catalogers too.
Why, why why?
So this is a fairly recent phenomenon, this putting of alternate manifestation ISBNs in the 020$z. We can’t blame it on legacy data. We just started doing this! And apparently it was actually officially promulgated by someone (who?!?) as a standard recommended practice — although it’s not mentioned in the OCLC or LC documentation for MARC 020, which is a huge problem in itself if you expect anyone to figure out how to deal with data.
WHY did they do this? They should have know better, this is taking data that was painstakingly entered by expensive human catalogers, and making it impossible for computers to use. If I knew who promulgated this standard, I’d think they had nobody on their committee that was a software developer or otherwise familiar with the needs of software. And I’d email them to say “This was a big mistake, can it be changed, what is the process for proposing it be changed?” But I have no idea how to figure out who is behind it or how to change it–which is yet another problem with our complicated environment where to understand our actually existing MARC data, even just current records not even talking about legacy ones, you’ve actually got to understand MARC, AACR2, LC guidance, OCLC rules, PCC rules and rules from other bodies I don’t know about, things that are just ‘what everyone does’ even though it’s nto written down anywhere, maybe ISBD too — all of which are documented in different places or in some cases no places.
Maybe it was done because there was no other place in MARC to put it, and hey, a print ISBN is technically “invalid” for an ebook, it kind of fits, right? Except is your concern fitting into a standard, or actually being useful? If there was no place in MARC for it, the correct thing to do would be to get MARBI to add one — ideally one that let you record some nature of the alternate manifestation too, like actually somehow record machine-readably that this belongs to a print manifestation. If it’s so much work to get MARBI to add one that you decide, eh, just stick it in the $z — well that’s yet another way in which MARC is holding us back and making it expensive or impossible to use our data effectively.