normalize your LCCNs

A public service reminder.

You need to normalize an LCCN before using it as a general purpose identifier.  Otherwise there are multiple strings that can represent the same LCCN.

OCLC Identities does not seem to do this normalization, which will seriously inhibit matching from external citations via it’s API etc.(Here is what the normalized form of the same LCCN should be in OCLC. 404).

I didn’t realize that LCCNs were like this for a while, and am glad I discovered it before I had written TOO much bad code. So take heed!

Anyone know who to point this out to at OCLC Identities?


24 thoughts on “normalize your LCCNs”

  1. LC doesn’t do it where? In their bib records, fine. But when you get a bib record, you’ve got to normalize it before you do anything ‘linked data-y’ with it.

    If LC is doing linked data-y things with LCCNs without normalizing, it’s a bug and they should be told so.

  2. Well, minor correction:

    The ‘-‘ character has been normalized to ‘0’, but there is still a space between the ‘n’ and the numerals.

  3. The fact that LC’s LCCN string doesn’t match OCLC’s for this record is evidence of why normalization is important.

    If I started from the LC record for this item, and then tried to match it in OCLC Identities by automated means — I’d fail.

    If I started from the LC record, normalized it first, and then querried Identities — I’d still fail. But if Identities normalized the LCCN themselves, then I could start with the LC record, normalize it, query Identities, and succeed — even if LC themselves haven’t normalized their MARC records.

    Because normalization is essentially a one-way process, it’s important for the TARGET of any querries or linked data links. And targets can get benefit from normalization even if sources haven’t normalized yet, because the software in the middle can normalize LCCNs from sources before querrying targets.

    Thanks to LC for documenting a reliable standard way to normalize alternate forms of LCCNs, to make it possible for us to use LCCNs as match points and identifiers effectively EVEN IF LC themselves haven’t normalized their entire database.

  4. A few points about LCCNs and linked data at LC:

    1. The service has some information about lccn normalization in the service FAQ:

    2. The first official linked data projects will be revolving around If you want to inquire further about potential LCCN usage in the effort, keep an eye on that page and for contact information. I believe the Network Development and MARC Standards Office (NDMSO/NetDev) is handling development, if that helps.

    Thanks for the detailed analysis, Jonathan.

  5. I just finished a custom solr filter to normalize lccns, so they get normalized at both index and query time; happy to share the code if you’d like it (although I’m almost certainly not integrating the resulting .jar file into solr and solrmarc in the most optimal way).

  6. Great suggestion! Here’s what I can do.

    I have zillions of URI’s out in the world using the pretty version of the LCCN in them. I don’t want to invalidate them. But, I can use a normalized version of the LCCN in my index and normalize the query that comes in as a result of the URI. This means that the URI’s that I’ve been handing out and Jonathan’s preferred URI will both work. If my code works right, I’ll even return a Content-Location header for Jonathan’s URI pointing to my prettier version.

    I can probably have that running on orlabs in a couple of days and put it into production in a couple of weeks.

  7. Sorry, other things have gotten in the way.

    I decided to do it a little differently than described.

    I really like to browse indexes and the prettier the terms are in the index, the nicer they are to browse. So, I like the well-formatted LCCN’s we’re using in Identities.

    But, it was easy enough to put that normalizer into Identities. When you send me a normalized LCCN, I renormalize it to the pretty form and look that up in the index.

    The example URL you provided in your message ( works now.

  8. Cool, that sounds like a fine solution too. If it really works. I’m a bit suspicious of your ability to reliably transform a normalized LCCN into a “pretty” one. The whole reason for normalization in the first place is because there can be _multiple_ “pretty” (only a librarian could consider those prettier!) forms that normalize to the same thing, that in fact represent the same LCCN. Right? I mean, that’s why normalization is an issue in the first place. And that suggests to me that if you’re still storing (one of many possible) pretty forms in the database, you are still going to run into false negatives on lookup.

    So maybe it doesn’t sound like a fine solution to me in the end after all, hm.

    I personally disagree with you that the non-normalized LCCNs are either “prettier” or “well-formatted” (either “more” or “at all”). I think they represent mistakes in LC practice for a pre-web world, and the normalized ones are actually the only CORRECT identifiers for an LCCN. And when you’re taking a query and matching your database to see if you have a match, you’re using LCCNs as identifiers.

  9. I don’t believe there are multiple pretty forms that map to the normalized form. At least I’ve not run into that problem so far :-)

    There’s a lot of semantics embedded in that LCCN that I think would be a shame to lose. Those initial alphas tell you what kind of object is being controlled, the next 2 or 4 digits tell you the year it was created and the remaining digits are just the sequence number for the year.

  10. I’ll try to find some examples of multiple ‘pretty’ forms. Or get someone at LC to confirm? If that’s not the case, then I don’t understand why they implemented normalization in the first place? Nor do I understand the example from LC’s own catalog that didn’t map to your ‘pretty’ form.

    Okay, here’s the example from the documentation at:

    “Of course sometimes two (or more) apparently different LCCNs are really the same — for example ” 85000002 ” and “85-2 “.

    To me, that suggests that either both 85000002 and 85-2 can be found in a MARC record. But you can reliably map 85000002 to the “pretty” form 85-2 and guarantee no false negative?

  11. Ah, part of my confusion may have been that while Identities accepts _bib_ OCLCnums, and maps to an Identity, it doesn’t actually accept bib LCCNs, only Authority LCCNs. Perhaps Authority LCCNs have less variation and more predictability.

    It would be useful if Identities accepted bib LCCNs like it does bib OCLCnums! Many of our catalog records have an LCCN but not an OCLCnum in them, and there’s not neccesarily any easy way for the catalog to reveal the Authority LCCN(s) associated with the record to my software, if that info is even in my catalog at all.

  12. I was about to apologize for the multi-post, but, hey, I can post as many times as I want on my own blog!

    Okay, I definitey don’t entirely understand LCCN’s or normalization. But tell me what you make of this. This is literally just the first example I stumbled upon, not hard to find, literally the first random example I tried.

    Goldman, William 1931-

    Appears in Identities as:

    Appears in LC authority records at as LCCN:
    “n 50033448”

    Both “n50-33448” and “n 50033448” normalize to: “n50033448” resolves, obviously. does not resolve does not resolve does not resolve

    SRU search for: “n50-33448”: resolves:

    SRU search for: “n 50033448”: does not resolve:

    SRU search for: “n50033448”: does not resolve

    It would seem like everything isn’t quite there yet. And that perhaps there can indeed be more than one form of “non-normalized” LCCN in the wild. In this case, both “n50-33448”, and “n 50033448”. Although the latter is identical to the normalized form except that it contains a space. It would seem that LCCN’s own catalog already contains “closer” to normalized data than Identities. If I can’t find a match in Identities by searching on the exact form found in the LC authorities catalog, and I can’t find a match by searching on the normalized version of the form found in the LC authorities catalog…. something’s not right, right?

    It seems to me that the entire point of normalization is to allow two string forms representing the same LCCN without being the exact same string to be reliably matched — without having to understand the details of the history of LCCN. If you have some understanding of the details that allows you to fullfill that too with reliabilty (I personally have no idea how to map a “normalized” LCCN in reverse direction to a non-normalized form found in the wild), then I guess that will suffice too. But it doesn’t seem to be there yet?

    If you want to offer browse based on the non-normalized form found in WorldCat (I’m still not convinced that this is neccesarily the ONLY non-normalized form of a given LCCN), perhaps you need two indexes? One for browse, based on the exact strings in WorldCat, and one for lookup based on normalized forms?

  13. thanks lawless. VERY frustrating. You want to try and email Ralph and pursue it?

    If the site can’t accept incoming links with lccn in normalized form, there’s no way to reliably link to it.

  14. I sent a message to OCLC DevNet – haven’t heard a response.

    This Python snippet seems to be doing the job of getting the ‘non-normalized’ LCCN for the name authorities I’m working with. Obviously not ideal.

    right = lccn[-6:].lstrip(‘0’)
    left = lccn[:-6]
    id = “” % (left, right)

  15. Yeah, the reason we need normalization is because the same LCCN can actually appear in several forms ‘in the wild’, with regard to spacing and prefix/suffix/punctuation apparatus. If OCLC wants to use a form different than the published LC normalized form, if they can publish an algorithm by which you can take an arbitrary LCCN from ‘the wild’ and normalize it instead to the form OCLC is using… well, it would at least make it possible to reliably link from real world data to OCLC services, although certainly more complex/confusing than ideal, to have two different normalized forms in use.

    If our major industry players like OCLC and LC can’t manage to get this fairly simple issue down… it does give us an idea of some of the barriers in the way of an effective ‘linked data’ environment for library data. Just about the simplest and easiest pre-requisite for any kind of ‘linked data’ is the ability to take an identifier from one source of data and use it to look up information from another source of data. There are trickier things that need to be done to create a valuable eco-system beyond that, but if you can’t even get that straight….

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s