Why MARC makes computers cry: Exhibit #273: ISSN/ISBN $z

So yet another in the long list of cases where actually-existing MARC data makes certain elements not really usable by software, because of encoding it in ambiguous ways.  This one is particularly mysterious and frustrating to me, because it’s a fairly recent phenomenon, and I wonder — why the heck, catalogers?

It’s also one that I have no idea where it’s documented. The difficulty of figuring out from documentation what to expect in MARC, and the fact that the documentation you do find is often not accurate with actually-existing MARC — that’s yet another problem with writing software to work with MARC.

In this case, I’m talking about the $z subfield in tag 020 (ISBN), or sometimes tag 022 (ISSN).

This is documented as “Canceled/invalid ISBN” (or ISSN).  It might be used for a just plain WRONG ISBN, that appeared on the title page but is actually for somethign else entirely (typo), or that was entered in a previous MARC record errantly, etc.  Just plain wrong ISBN.

But it seems also be used for alternate-format ISBN/ISSN. For instance, if the record is for an e-book, the ISBN for the print manifestation might be added in 020$z.   Now, I can’t find this practice documented anywhere, but it is common, and I believe someone told me once it is “official” and “standard” — from PCC maybe?  I don’t know!  Any catalogers can clear this up?

Matching foreign databases

Why is this a problem?  Well, in our modern world, one of the most useful things to do with a library record database is to match it to other databases. In both directions.  For instance, for a record in the catalog, I might want to know if a record exists in a foreign (possibly non-library) database — full text or partial text in Google Books, Amazon or OpenLibrary;  holdings by other libraries in OCLC; many more.  And in the other direction too, someone finds a book in some non-library interface, and I want software to be able to automatically tell them if we have that book in our library catalog.

Now, you can try to ‘match’ on author and title, but that doesn’t work all that well. Because the exact same manifestation might be entered slightly differently in the different systems.  Maybe the author has just initials, or maybe the author is spelled out.  Maybe different punctuation or spacing is used, or a subtitle is present or absent.  So you try to ‘normalize’ these things in order to compare them, but this is inherently a difficult and imperfect process, that you can spend lots of time trying to get right, and still not be perfect. Inevitably, you’ll end up with some false matches (think it’s a match, but it’s an entirely different work with a similar author/title), as well as false negatives (think there’s no match, but there was, the author/title was just entered differently). And trying to minimize false matches generally causes your false negatives to go up, and vice versa.

So, what works a lot better is some type of identifier. And we have some types of identifiers available, not fancy new URIs, but identifiers none the less. The ones we typically have available are:  LCCN, OCLCnum, ISBN, ISSN.  Of those, LCCN and OCLCnum are less likely to be found in non-library databases, but ISBN and ISSN are quite popular, and if the work is new enough to have one at all, third party databases usually include em too.  So this makes ISBN and ISSN, especially ISBN, awfully useful for this kind of matching.

So what’s the problem with the $z cancelled/invalid?  Well, it’s that there’s no way to tell if a $z is just completely wrong, applying to some completely different work, or if it’s an intentionally added alternate manifestation ISBN. And the software needs to treat those two things entirely differently when doing matching between databases.  If I knew it was just completely wrong, I would not want to use it to find a match in Google/Amazon/etc.   If I knew it was an alternate manifestation ISBN, I might want to try to use it to find matches on Google/Amazon. If I in fact knew the current record was for an ebook (and that alone is hard to get out of the MARC), the $z is an ISBN for an alternate manifestation, and that alternate manifestation was print — I might prefer to use the $z to find a match in Google/Amazon etc., because ebook ISBN’s are weird, and I’m probably not going to find a match on one, but finding a match on the print book corresponding to this ebook is pretty good.

Likewise for incoming matches — some ISBN “X” is found in some external database. I want to know if we have the book with that ISBN “X” in our local catalog. If we have a record in our local catalog with X in a 020$z, but it represents a completely wrong ISBN there, then the answer is ‘no’.  But if it represents an alternate manifestation, the answer might be ‘yes’, or ‘yes, in an alternate manifestation’, or even if the data was there ‘yes, but we only have it in e-book even though you asked for a print ISBN’.

But I have no idea which category an 020$z is. So the two choices are to treat them all like “just wrong” (missing lots of matches, in either direction), or to treat them all as “alternate manifestation” (getting some false positives, when the $z really was a completely wrong ISBN).

This is frustrating, because someone spent the time to record this data, and this data would be very valuable to us and to our users if software was able to take advantage of it. Ebooks make things much more complicated, and having the print ISBN there on an ebook record is phenomenally valuable for being able to tell users if we have a certain book they found elsewhere, or for being able to link to services on a book in our catalog elsewhere (Google Books, Amazon, etc). That is, it would be phenomenally valuable if it had been recorded in an unambiguous way. Instead, by recording it ambigously, expensive cataloger time has been spent recording something that would be valuable, but is really, well, if not useless, of significantly diminished value.

The cataloger hour to value gained ratio is hugely decreased.  Those worried about administrators deciding that cataloging is too expensive for what it gives us ought to care a lot about the cataloger-hour-to-value ratio.  And in very many ways, that ratio is diminished by the manner in which we store things in MARC. You might worry paying attention to that ratio would mean turning cataloging into even more of a sweatshop than it already is, but in fact storing our data clearly instead of ambiguosly (and this ISBN thing is just one example) can have gains well beyond trying to make catalogers crank out things quicker while still being quality, which is probably impossible anyway, and unkind to catalogers too.

 

Why, why why?

So this is a fairly recent phenomenon, this putting of alternate manifestation ISBNs in the 020$z.  We can’t blame it on legacy data.  We just started doing this! And apparently it was actually officially promulgated by someone (who?!?) as a standard recommended practice — although it’s not mentioned in the OCLC or LC documentation for MARC 020, which is a huge problem in itself if you expect anyone to figure out how to deal with data.

WHY did they do this? They should have know better, this is taking data that was painstakingly entered by expensive human catalogers, and making it impossible for computers to use.  If I knew who promulgated this standard, I’d think they had nobody on their committee that was a software developer or otherwise familiar with the needs of software. And I’d email them to say “This was a big mistake, can it be changed, what is the process for proposing it be changed?”    But I have no idea how to figure out who is behind it or how to change it–which is yet another problem with our complicated environment where to understand our actually existing MARC data, even just current records not even talking about legacy ones, you’ve actually got to understand MARC, AACR2, LC guidance, OCLC rules, PCC rules and rules from other bodies I don’t know about, things that are just ‘what everyone does’ even though it’s nto written down anywhere, maybe ISBD too — all of which are documented in different places or in some cases no places.

Maybe it was done because there was no other place in MARC to put it, and hey, a print ISBN is technically “invalid” for an ebook, it kind of fits, right?  Except is your concern fitting into a standard, or actually being useful?  If there was no place in MARC for it, the correct thing to do would be to get MARBI to add one — ideally one that let you record some nature of the alternate manifestation too, like actually somehow record machine-readably that this belongs to a print manifestation.   If it’s so much work to get MARBI to add one that you decide, eh, just stick it in the $z — well that’s yet another way in which MARC is holding us back and making it expensive or impossible to use our data effectively.

16 thoughts on “Why MARC makes computers cry: Exhibit #273: ISSN/ISBN $z

  1. This is yet another result of the Provider-Neutral record for e-books. The only part I can find quickly right now is: http://www.loc.gov/catdir/pcc/bibco/PN-Guide.pdf. I suspect there’s a more up-to-date version somewhere, but I can’t find it right now. But this provides the documentation for using 020|z for alternate formats. I think you’re right about no one on the committee being familiar with the needs of software. It’s a great idea, but pretty lousy execution.

  2. If I knew who promulgated this standard, I’d think they had nobody on their committee that was a software developer or otherwise familiar with the needs of software.

    Bingo.

  3. Okay, it’s some LC report. So who do I talk to to give them feedback that they got it wrong, and suggest they change it?

    It is cheapening the value of catalogers to have them spend time entering data that can not be used by software.

  4. Your general concern is well placed. However, I’d reconsider not matching on wrong ISBNs. They are in the record most likely, beacuse they were printed on the item wrong. Someone getting their info from the item would use that as the ISBN in their description and databse. Unless they did a bit of research or tried to validate the ISBN they would have no idea that it was incorrect.

  5. Ah, but should that apply to OUTGOING matches too, David?

    If I have a record in my catalog that has an 050$z of X (and it was there as an incorrect/invalid ISBN, not an alternate format, say because it was printed wrong on the title page, okay), and then I query Amazon for X, and Amazon says oh yeah, we’ve got X, and I tell the user, hey, you can look at a preview of this item on Amazon, sending them to the Amazon page for X…. aren’t I probably going to be sending the user to a different item?

    I understand what you mean for user-entered queries where the user manually enters an ISBN, and I give them a result list. But when I’m trying to match records from one database to another behind the scenes…. yeah, it’s too bad someone got an ISBN wrong on a title page, but it’s WRONG, it applies to an entirely different book, and I’m going to be giving the user bad links if I use it to link, no?

    I will say that I got used to these $z’s in the LCCN field, where when there’s an ‘incorrect/invalid’ LCCN in… whatever field it holds LCCNs…. it usually is Just Plain Wrong, and matching on it will give you the wrong item.

  6. 776$z is a legal place for that too, as far as I know. But apparently standard practice now, as a result of that LC resport, is to put it in 050$z. And I’m certainly seeing that in my real world records, in spades.

  7. But yeah, Laura, the fact that we can’t even be SURE where to expect this sort of element, and how to interpret the various elements we find… it does not speak well of MARC as actually existing.

  8. Makes me weep as well. However, I’m not sure the LC report is the source of the problem (although it is probably responsible for popularising the practice as regards e-books). The problem seems to derive from the definition of valid/invalid in the MARC documentation:

    “ISBN may also be considered to be application invalid if it is not directly applicable to the bibliographic item represented by a particular record. Application invalidity is usually related to the cataloging treatment employed by a particular agency in terms of the number of records involved. For example, if there is a record for a multivolume set as well as separate records for each of the volumes in the set, the ISBN for the set is considered application invalid on the records for the volumes. Only the ISBN applicable to the entity represented by a particular record is considered valid on that record”

    I realise this isn’t particularly helpful in terms of the problem – just recording here for info really (and perhaps to partially absolve the authors of the ‘provider neutral’ report of some responsiblity). It does suggest that one route (possibly the only one?) to changing the practice would be via a MARC21 change proposal (http://www.loc.gov/marc/chgform.html).

  9. Actually, it is not LC you need to contact but PCC. Although hosted on LC’s server space, it is PCC’s report. Contact the PCC Operating Committee, as the report was put forth by one of their task forces.

  10. Of course , this was all available and individually coded in UKMARC, which makes those of us who were used to using that really cry. Field 021

  11. The problem isn’t MARC per se, although MARC is out of date and clunky, the real problem is that most of the people using MARC don’t understand how machines use the information or even why they should care how machines use the information.

  12. Marian: That was my impression like 12 years ago when I first started engaging with this stuff. Since 2011 (7 years ago) when I wrote this post, I have mostly withdrawn from community discussions about MARC or bibliographic data formats. It’s a shame if that’s still true.

    For 10-15 years people have been saying libraries have to develop a different relationship to technology if they want to survive… I wonder how we know when the ‘too late’ deadline has passed….

Leave a comment