Of Identifiers, matching, OCLCnums, and Umlaut

So Lorcan Dempsey had some examples of discovery tools that include references to third party services. I thought it would be fun to use the same example he used and see what the Umlaut does with it.

http://catalog.library.jhu.edu/ipac20/ipac.jsp?index=BIB&term=724872

Now, our catalog record for this item doesn’t have an ISBN in it (the title is probably just a bit too old to have an ISBN). It also doesn’t have an OCLCnum in it (it was probably cataloged here before we used the OCLC. The cataloging cognoscenti can perhaps see evidence of a regional ‘bibliographic utility’ in the MARC, but I couldn’t tell you).

It does have an LCCN.

So looking through what Umlaut could do with this, I saw that while Amazon might have a record for this title, Umlaut’s Amazon’s matching is based wholly on ISBN, so it wasn’t going to find a match. And I said, but it is finding a match to Worldcat and Worldcat Identities, based on OCLCnum. But wait! I didn’t have an OCLCnum! Where did that come from?  I know my code for Worldcat doesn’t use LCCNs (not sure if they’re supported for Identities, and they are only supported for the Worldcat API if you can guarantee your end user is affiliated with your institution, which Umlaut can’t do).

So what’s going on? So I looked into it. Turns out Umlaut is consulting OpenLibrary too. It’s really consulting OpenLibrary for possible cover images, but as long as it’s stopping for OpenLibrary, the OpenLibrary plug-in says might as well enhance any metadata with useful info we get from OpenLibrary too, and storing it so it’s available to other plug-ins. And the OpenLibrary record for this LCCN happened to give me an OCLC num. Neat! This is really just kind of serendipitious, that it happened to consult OpenLibrary before it did it’s WorldCat thing (it generally tries to do as much as possible in parallel for speed of response time to the end-user, so it can sometimes be unpredictable which comes first).

Future Enhancements

Another source of this same data is the OCLC xID services, that would also let Umlaut take an LCCN and figure out what OCLCnum or ISBN or ISSN might also apply to that title, and vice versa. Umlaut doesn’t currently use xID, but should.

The real power of xID isn’t just in cross-referencing identifiers for a single bibliographic record or manifestation, but also identifying alternate editions/manifestations, and the identifiers that go with them. Umlaut really needs to make use of that info too, but there are some tricky architectural decisions I need to make with regard to how Umlaut is going to store that info for maximum re-useabiltiy.  Umlaut wants to not only know about all the other editions and their manifestations, but ideally have some human-readable description of what that other edition is (ie, like a MARC 260, or like Amazon’s “paperback Viking Press 1957” style of edition statement. I’ll take whatever I can get.)

This, done well, would start to get at another point of Lorcan’s, the importance of these interfaces distinguishing editions for people. I’d like Umlaut to be able to say “Here is a digital copy (or more metadata about) the exact edition you asked for. Over HERE is a digital copy of another edition, the Random House hardcover 1962 edition, or something.”

That gets tricky, because when trying to combine all these disparate bibliographic databases, you don’t always have enough information to really be sure what you’re doing. But it’s actually not as hard as you might think. Fortunately, the very useful identifiers OCLCnum, LCCN, and ISBN, all more or less correspond to a particular edition or manifestation. (There might be multiple ISBNs for the same ‘edition’, if you consider hardcover and paperback the same edition. But there should never be one ISBN for multiple editions).

OCLC number as identifier

We deal with a lot of confused, and incomplete metadata, recorded according to a variety of standards, and often in error according to those standards. I think this example demonstrates the extreme use of what has become to me a sort of holy trinity of identifiers: ISBN, LCCN, and OCLCnum.

I recently had a discussion with Karen Coyle (over in the comments section here), where she was negative toward the idea of using an OCLCnum as an identifier.

If you consider the OCLC number the primary identifier for bibliographic records, then OCLC owns our identification system, as Jonathan points out, which is very frightening.

But let us remember that OCLC numbers, while useful, are not identifiers for bibliographic data, only for that bibliographic data that is in OCLC’s database. For some of us that is all of our records, but for many it is not. I think it is important to be clear that the OCLC number identifies the OCLC record; and while that can be handy for many services, it is not a generalized bibliographic record identifier, but specific to that one database.

Depending on these numbers, however, means continuing dependence on OCLC and it means continuing to see OCLC as the source of all things bibliographic.

Well, yes, and no. Let’s take that apart.

Yes, an OCLC number, by intention identifies a particular record not a particular edition/manifestation. Incidentally, so does an LCCN.  However, in both cases, the record identified is supposed to have a more or less 1-to-1 relationshp with an edition. There shouldn’t be more than one OCLC record for the same manifestation/edition (when there are, they combine then to fix it), and there shouldn’t be more than one LC record for the same manifestation/edition either (and I believe there seldom is, because of LC’s workflow).It is of course possible for a record for a given manifestation/edition record not to exist in OCLC or LC, but in general most do.

So, generally, one edition/manifestation should have zero or one OCLC record with one OCLCnum, and zero or one LC record with one LC num, and generally the OCLC record is in fact the same record as the LC record, since LC contributes to OCLC.

So this makes both OCLC numbers and LC in fact effective as edition/manifestation identifiers. Along with ISBN. They work very well for this. Sure, not perfectly. Sure, we could use a more rational standard not neccesarily affiliated with any business interest–and that allows collective shared generation of the identifer from people not OCLC or LC affiliated (an important point, to be sure). Sure, it’s possible that OCLC and LC and ISBN will disagree on where to draw edition/manifestation boundaries (although except for the previously mentioned multiple ISBNs for one “library” manifestation, which causes no real problem, I doubt this is much of a problem in practice).  Sure, it would be nice if the important work of establishing manifestation identifiers were integrated into cataloger workflow, instead of being a lucky epiphenomenon.

But in the meantime, we’ve got what we’ve got, and it actually works pretty well.

(But again, I’m not saying that OCLC number is necessarily the primary manifestation identifier. But that the trinity of OCLC num, LCCN, and ISBN have been serving me pretty well, as external databases incorporate them. It’s nice that GBS allows matching on all three, thanks GBS. OCLC probably does have the broadest range of any of those, especially when you include pre-ISBN materials. We could use something better than this ad hoc trinity, but we’ve got is surprisingly decent.)

Legal terrain of OCLC number usage

So should we be concerned, as Karen says, about such a useful identifier tied to a particular business with it’s own interests? Well, here’s the interesting legal thing.  The groundbreaking U.S. legal decision Bender v. West (reversing the previous West v. Mead decision as a result of the really groundbreaking Feist v. Rural) said that West Publishing could not in fact keep other publishers from citing page numbers in West’s books. (For why this mattered for the legal publishing business, follow those links).  West had no copyright control over it’s page numbers, anyone who wanted to could cite those page numbers.

An OCLC number is an awful lot like a page number. They’re just assigned sequentially to every record that comes into WorldCat, in order.  There’s no creativity involved, they’re just a sequential list, just like a page number.

I’m no lawyer, but I’m pretty sure this means that, in U.S. law at least, nobody needs OCLC’s permission to ‘cite’ OCLC numbers in their own cataloging records.  You can just do it. OCLC can’t actually exert much control over it.

So, sure, as Karen points out, many libraries who are not OCLC members currently have corpuses of records that don’t include OCLC numbers.  But there’s nothing legally stopping them from adding them. You’d want to add it to your record, for clarity of our data, in a way that made clear the record it was attached to was not in fact the record originally identified by that OCLC number. Instead, it was a record describing the same manifestation as the OCLC record identified by that OCLC number, and the OCLC number is being included as a useful identifier for matching manifestations in different databases. And I don’t think there’s any reason you’d need any kind of a relationship with or permission from OCLC to do this (although I’m still not a lawyer at the end of this paragraph either).

Now, you’d have to get the OCLC identifiers from somewhere, figure out what OCLC numbers go with what records of yours.  You could certainly have humans lookup up records in various libraries catalogs, noting OCLC numbers, and adding them to your own database. Probably not very feasible, although LibraryThing showed you can get geeky volunteers to do a surprisingly large amount of bibliographic control for you, when something’s in it for them (like showing off their book collection).

But you can also get those correspondences, for some but not all of our collective corpus, from Google Book Search API, from OpenLibrary API, from HathiTrust/Merlyn API, from z39.50 to lots of catalogs, and probably from an only ever-increasing number of machine-acessible services.

When OCLC folks read this, I hope that the lesson they take from this is not that they’ve got to try to start cracking down on services that allow mapping from ISBN or LCCN to OCLCnum — if they even have any legal way to do that.

Instead, the lesson is that our collective library efforts facilitated through OCLC have resulted in the OCLC number being an enormously useful quasi-identifier, and this is good. But it’s good only if we can use it, and if OCLC really did somehow succeed in cracking down on that, it would in fact ruin it’s usefulness for OCLC members, and for libraries and library patrons in general — not somehow protect it’s value for OCLC members.

This entry was posted in General and tagged . Bookmark the permalink.

4 Responses to Of Identifiers, matching, OCLCnums, and Umlaut

  1. Thom Hickey says:

    Nice post. It’s pretty easy to find ISBNs (and OCLC Numbers for that matter) associated with boxed sets and other collections, so you have to be careful about associating them with an ‘edition’. And a single edition might have several manifestations. We generally equate the OCLC# with a manifestation.

    –Th

  2. jrochkind says:

    Yeah, I was using edition and manifestation interchangeably, which maybe was sloppy. But yeah, I consider a boxed set to in fact be a ‘manifestation’, if one that itself may contain other manifestations (a relationship that FRBR is still figuring out how to model, I think).

    Do you know if the xID services will give me info on that relationship? “The OCLC number you asked [a boxed set] for _contains_ these other OCLC numbers [records for the individual elements]”. Or vice versa, the one you asked for is contained by this other one. That would be an awfully useful thing for xID to do.

    I’m sure Xiaoming would find it an interesting problem, if not neccesarily feasible to actually do. :)

  3. Xiaoming Liu says:

    xID uses Worldcat’s FRBR algorithm, which tries to group the boxed set and each volume as separate groups, such as:

    http://www.worldcat.org/oclc/51603388/editions

    vs.

    http://www.worldcat.org/oclc/48950119/editions

    We also try to apply same concept in xISBN service. Especially by taking advantage of a cataloging rule about ISBN numbers in sets/volume: “If you are cataloging a multivolume monograph, enter both the set number and the individual volume numbers, if available. Enter the number for the set first.” I found out this is very useful clue in handling the case of sets/volume situation.

    Regarding the question of “Here is a digital copy (or more metadata about) the exact edition you asked for. “, it is something we want to work on but haven’t got good progress, I think it’s approximate to the FRBR expression concept.

  4. jrochkind says:

    Thanks Xiaoming. I guess my question was, when I ask xID about the oclcnum associated with the ‘complete trilogy’, will can xID tell me that it’s _related to_ the group for The Two Towers?

    Sort of like I think xISSN can tell me about different ‘groups’ that are related? But maybe I’m wrong about xISSN.

    But it would clearly be of immense value if xID, while keeping the two groups separate, could tell me when I supply an identifier for one of the groups, that there’s this other group that contains the first one (in whole or in part) too. So xID could tell me that the book (group) for “the two towers” represents something that is contained in the book (group) for “the lord of the rings”.

    As far as ‘here is a digital copy of’, it wasn’t something I was thinking xID should do, that’s something client software (like Umlaut) could do, made easier by _using_ data from xID.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s