normalize your LCCNs

A public service reminder.

You need to normalize an LCCN before using it as a general purpose identifier.  Otherwise there are multiple strings that can represent the same LCCN.

OCLC Identities does not seem to do this normalization, which will seriously inhibit matching from external citations via it’s API etc.(Here is what the normalized form of the same LCCN should be in OCLC. 404).

I didn’t realize that LCCNs were like this for a while, and am glad I discovered it before I had written TOO much bad code. So take heed!

Anyone know who to point this out to at OCLC Identities?

This entry was posted in General. Bookmark the permalink.

24 Responses to normalize your LCCNs

  1. Simon Spero says:

    To be fair, LC doesn’t do this either…

  2. jrochkind says:

    LC doesn’t do it where? In their bib records, fine. But when you get a bib record, you’ve got to normalize it before you do anything ‘linked data-y’ with it.

    If LC is doing linked data-y things with LCCNs without normalizing, it’s a bug and they should be told so.

  3. Mike G. says:

    Well, minor correction:

    The ‘-‘ character has been normalized to ‘0’, but there is still a space between the ‘n’ and the numerals.

  4. jrochkind says:

    The fact that LC’s LCCN string doesn’t match OCLC’s for this record is evidence of why normalization is important.

    If I started from the LC record for this item, and then tried to match it in OCLC Identities by automated means — I’d fail.

    If I started from the LC record, normalized it first, and then querried Identities — I’d still fail. But if Identities normalized the LCCN themselves, then I could start with the LC record, normalize it, query Identities, and succeed — even if LC themselves haven’t normalized their MARC records.

    Because normalization is essentially a one-way process, it’s important for the TARGET of any querries or linked data links. And targets can get benefit from normalization even if sources haven’t normalized yet, because the software in the middle can normalize LCCNs from sources before querrying targets.

    Thanks to LC for documenting a reliable standard way to normalize alternate forms of LCCNs, to make it possible for us to use LCCNs as match points and identifiers effectively EVEN IF LC themselves haven’t normalized their entire database.

  5. Mike G. says:

    A few points about LCCNs and linked data at LC:

    1. The lccn.loc.gov service has some information about lccn normalization in the service FAQ: http://lccn.loc.gov/#n10

    2. The first official linked data projects will be revolving around http://id.loc.gov/. If you want to inquire further about potential LCCN usage in the id.loc.gov effort, keep an eye on that page and for contact information. I believe the Network Development and MARC Standards Office (NDMSO/NetDev) is handling development, if that helps.

    Thanks for the detailed analysis, Jonathan.

  6. Bill Dueber says:

    I just finished a custom solr filter to normalize lccns, so they get normalized at both index and query time; happy to share the code if you’d like it (although I’m almost certainly not integrating the resulting .jar file into solr and solrmarc in the most optimal way).

  7. Ralph LeVan says:

    Great suggestion! Here’s what I can do.

    I have zillions of URI’s out in the world using the pretty version of the LCCN in them. I don’t want to invalidate them. But, I can use a normalized version of the LCCN in my index and normalize the query that comes in as a result of the URI. This means that the URI’s that I’ve been handing out and Jonathan’s preferred URI will both work. If my code works right, I’ll even return a Content-Location header for Jonathan’s URI pointing to my prettier version.

    I can probably have that running on orlabs in a couple of days and put it into production in a couple of weeks.

  8. Pingback: A plea: use Solr to normalize your data » Robot Librarian

  9. Ralph LeVan says:

    Sorry, other things have gotten in the way.

    I decided to do it a little differently than described.

    I really like to browse indexes and the prettier the terms are in the index, the nicer they are to browse. So, I like the well-formatted LCCN’s we’re using in Identities.

    But, it was easy enough to put that normalizer into Identities. When you send me a normalized LCCN, I renormalize it to the pretty form and look that up in the index.

    The example URL you provided in your message (http://orlabs.oclc.org/identities/lccn-n81017073) works now.

  10. jrochkind says:

    Cool, that sounds like a fine solution too. If it really works. I’m a bit suspicious of your ability to reliably transform a normalized LCCN into a “pretty” one. The whole reason for normalization in the first place is because there can be _multiple_ “pretty” (only a librarian could consider those prettier!) forms that normalize to the same thing, that in fact represent the same LCCN. Right? I mean, that’s why normalization is an issue in the first place. And that suggests to me that if you’re still storing (one of many possible) pretty forms in the database, you are still going to run into false negatives on lookup.

    So maybe it doesn’t sound like a fine solution to me in the end after all, hm.

    I personally disagree with you that the non-normalized LCCNs are either “prettier” or “well-formatted” (either “more” or “at all”). I think they represent mistakes in LC practice for a pre-web world, and the normalized ones are actually the only CORRECT identifiers for an LCCN. And when you’re taking a query and matching your database to see if you have a match, you’re using LCCNs as identifiers.

  11. Ralph LeVan says:

    I don’t believe there are multiple pretty forms that map to the normalized form. At least I’ve not run into that problem so far :-)

    There’s a lot of semantics embedded in that LCCN that I think would be a shame to lose. Those initial alphas tell you what kind of object is being controlled, the next 2 or 4 digits tell you the year it was created and the remaining digits are just the sequence number for the year.

  12. jrochkind says:

    I’ll try to find some examples of multiple ‘pretty’ forms. Or get someone at LC to confirm? If that’s not the case, then I don’t understand why they implemented normalization in the first place? Nor do I understand the example from LC’s own catalog that didn’t map to your ‘pretty’ form.

    Okay, here’s the example from the documentation at:

    http://www.loc.gov/standards/uri/info.html

    “Of course sometimes two (or more) apparently different LCCNs are really the same — for example ” 85000002 ” and “85-2 “.

    To me, that suggests that either both 85000002 and 85-2 can be found in a MARC record. But you can reliably map 85000002 to the “pretty” form 85-2 and guarantee no false negative?

  13. jrochkind says:

    Ah, part of my confusion may have been that while Identities accepts _bib_ OCLCnums, and maps to an Identity, it doesn’t actually accept bib LCCNs, only Authority LCCNs. Perhaps Authority LCCNs have less variation and more predictability.

    It would be useful if Identities accepted bib LCCNs like it does bib OCLCnums! Many of our catalog records have an LCCN but not an OCLCnum in them, and there’s not neccesarily any easy way for the catalog to reveal the Authority LCCN(s) associated with the record to my software, if that info is even in my catalog at all.

  14. jrochkind says:

    I was about to apologize for the multi-post, but, hey, I can post as many times as I want on my own blog!

    Okay, I definitey don’t entirely understand LCCN’s or normalization. But tell me what you make of this. This is literally just the first example I stumbled upon, not hard to find, literally the first random example I tried.

    Goldman, William 1931-

    Appears in Identities as:
    http://worldcat.org/identities/lccn-n50-33448

    Appears in LC authority records at authorities.loc.gov as LCCN:
    “n 50033448”

    Both “n50-33448” and “n 50033448” normalize to: “n50033448”

    http://worldcat.org/identities/lccn-n50-33448 resolves, obviously.
    http://worldcat.org/identities/lccn-n-50033448 does not resolve
    http://worldcat.org/identities/lccn-n%2050033448 does not resolve
    http://worldcat.org/identities/lccn-n50033448 does not resolve

    SRU search for: “n50-33448”: resolves:

    http://worldcat.org/identities/search/Identities?query=local.LCCN+exact+%22n50-33448%22&version=1.1&operation=searchRetrieve&recordSchema=info%3Asrw%2Fschema%2F1%2FIdentities

    SRU search for: “n 50033448”: does not resolve:

    http://worldcat.org/identities/search/Identities?query=local.LCCN+exact+%22n+50033448%22&version=1.1&operation=searchRetrieve&recordSchema=info%3Asrw%2Fschema%2F1%2FIdentities

    SRU search for: “n50033448”: does not resolve

    http://worldcat.org/identities/search/Identities?query=local.LCCN+exact+%22n50033448%22&version=1.1&operation=searchRetrieve&recordSchema=info%3Asrw%2Fschema%2F1%2FIdentities

    It would seem like everything isn’t quite there yet. And that perhaps there can indeed be more than one form of “non-normalized” LCCN in the wild. In this case, both “n50-33448”, and “n 50033448”. Although the latter is identical to the normalized form except that it contains a space. It would seem that LCCN’s own catalog already contains “closer” to normalized data than Identities. If I can’t find a match in Identities by searching on the exact form found in the LC authorities catalog, and I can’t find a match by searching on the normalized version of the form found in the LC authorities catalog…. something’s not right, right?

    It seems to me that the entire point of normalization is to allow two string forms representing the same LCCN without being the exact same string to be reliably matched — without having to understand the details of the history of LCCN. If you have some understanding of the details that allows you to fullfill that too with reliabilty (I personally have no idea how to map a “normalized” LCCN in reverse direction to a non-normalized form found in the wild), then I guess that will suffice too. But it doesn’t seem to be there yet?

    If you want to offer browse based on the non-normalized form found in WorldCat (I’m still not convinced that this is neccesarily the ONLY non-normalized form of a given LCCN), perhaps you need two indexes? One for browse, based on the exact strings in WorldCat, and one for lookup based on normalized forms?

  15. Ralph LeVan says:

    I’m sorry for the confusion! That code is NOT in production yet, so all those worldcat.org links will definitely fail. But, if you try then on the Research server, you’ll see that they do work.

    http://orlabs.oclc.org/identities/lccn-n50-33448
    http://orlabs.oclc.org/identities/lccn-n-50033448
    http://orlabs.oclc.org/identities/lccn-n%2050033448
    http://orlabs.oclc.org/identities/lccn-n50033448

    Ralph

  16. Pingback: OCLC still not normalizing their LCCNs » Robot Librarian

  17. lawlesst says:

    Two plus years later, Ralph’s code doesn’t seem to have made it to production and the orlabs links now redirect to the main identities page. So his work seems to be lost.

    I have an LCCN in the local authority file which is stored normalized to n2002112180.

    But the Identites link remains:
    http://www.worldcat.org/identities/lccn-n2002-112180

    id.loc.gov has:
    http://id.loc.gov/authorities/names/n2002112180.html

  18. jrochkind says:

    thanks lawless. VERY frustrating. You want to try and email Ralph and pursue it?

    If the site can’t accept incoming links with lccn in normalized form, there’s no way to reliably link to it.

  19. jrochkind says:

    Verified that direct link to Identities using the normalized form does not work:

    http://www.worldcat.org/identities/lccn-n2002112180

  20. lawlesst says:

    I sent a message to OCLC DevNet – haven’t heard a response.

    This Python snippet seems to be doing the job of getting the ‘non-normalized’ LCCN for the name authorities I’m working with. Obviously not ideal.

    right = lccn[-6:].lstrip(‘0’)
    left = lccn[:-6]
    id = “http://www.worldcat.org/identities/lccn-%s-%s” % (left, right)

  21. jrochkind says:

    Yeah, the reason we need normalization is because the same LCCN can actually appear in several forms ‘in the wild’, with regard to spacing and prefix/suffix/punctuation apparatus. If OCLC wants to use a form different than the published LC normalized form, if they can publish an algorithm by which you can take an arbitrary LCCN from ‘the wild’ and normalize it instead to the form OCLC is using… well, it would at least make it possible to reliably link from real world data to OCLC services, although certainly more complex/confusing than ideal, to have two different normalized forms in use.

    If our major industry players like OCLC and LC can’t manage to get this fairly simple issue down… it does give us an idea of some of the barriers in the way of an effective ‘linked data’ environment for library data. Just about the simplest and easiest pre-requisite for any kind of ‘linked data’ is the ability to take an identifier from one source of data and use it to look up information from another source of data. There are trickier things that need to be done to create a valuable eco-system beyond that, but if you can’t even get that straight….

  22. lawlesst says:

    This is back on OCLC’s radar. See Ralph’s 1/10/12 post to the DevNet mailing list. It’s working on development (again). Thanks for bringing attention to this.

    From my original example:
    http://orlabs.oclc.org/identities/lccn-n2002112180

  23. Ralph LeVan says:

    Turned back on for orlabs and about to go into production. All those orlabs links work now.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s