Google Book Search, LCCN and OCLCnumber

Google Book Search officially supports search by ISBN using a field search, either in the HTML interface, or the (new, or old, but new in this case) API.

But in a library context, you sometimes have a book you’re interested in that does not have an ISBN, because it’s older than ISBN. Or for which you don’t know the ISBN.

It would be useful to search on LCCN or OCLCnum. At first this would appear impossible, but in fact although undocumented (I learned of this directly from a Google engineer years ago when GBS first debuted), Google often does know the LCCN or OCLCnumber of an item, and does index them, although in a general index, not a fielded index actually allowing fielded search.

Searching on LCCN or OCLCnum

They are indexed as a single token OCLCX or LCCNX.  For instance “LCCN07020699” or “OCLC1246014”.

It turns out to be important to use double quotes for phrase quoting in these searches, otherwise you can get false positives when Google’s query engine tries to get something close to what you entered instead. For instance, searching for OCLC 1246014 without quotes, you can get two hits. The first, “The Latin Language” actually has OCLCnum 1246014.  The second, “Labor market behavior…” has a similar OCLCnum without the 0 in it: 124614

(Looks like Google’s query parsing analyzers, optimized for actual text search, decide to try the search without any 0’s too?  Also for reasons I can’t explain, in HTML search, if you remove the mysterious “tbm=bks” parameter, you only get the one correct hit even without phrase quotes.  However, without quotes using the API in the standard way, you do get both hits including the false positive. So just use quotes!)

Normalization?

LCCN’s can appear in multiple forms with regard to punctuation and whitespace and some supplementary prefixes and suffixes.  It is not clear to me if Google normalizes them before indexing them. The nature of Google’s index is likely such that you do need to match whatever Google has indexed to get a hit.  Seems best to just normalize before doing the search, even though if Google isnt’ normalizing this might result in some false negatives — if Google isn’t normalizing, there’s no good way to avoid false negatives.

OCLC numbers are simply incremented whole numbers.  They sometimes appear with left-padded zeroes in my own local data; it appears to me anecdotally that removing any left-padded zeroes is your best bet to getting a match in GBS.

LCCN or OCLCnum in Google API response

So the Google systems know something about LCCN and OCLCnum, at least at indexing, in order to index them like this.

Will the Google api response return LCCN or OCLCnumber when it knows about them?  Not usually, but sometimes. There are plenty, seemingly the majority,  of items that can be fetched with an OCLCX or LCCNX search, but which do not include this LCCN or OCLCnum in the response.

But occasionally, an OCLCnum or LCCN is included in the response, via the “industryIdentifiers” array, type “OTHER”. 

An LCCN example:

"industryIdentifiers": [
     {
      "type": "OTHER",
      "identifier": "LCCN:72627172"
     }
]

An OCLCnum example:

"industryIdentifiers": [
     {
      "type": "OTHER",
      "identifier": "OCLC:12345678"
     }
  ]

Don’t count on these being there though, they mostly are not. It’s a mystery why sometimes these are included in the api response and sometimes not. Which is too bad, it would be very useful to have the GBS api as a switching lookup between isbn, oclcnum, and lccn.

Plea for OCLC to lobby Google (again)

So not only do the indexed OCLCX and LCCNX make us think that the Google system knows about OCLCnums and LCCNs for many more items that are not revealed in the api response — additionally nearly every (or every?) item in Google Book Search has an outgoing link to Worldcat.org, “Find In a Library”. And these links use an OCLCnum in the direct URL to worldcat.org. 

So apparently the OCLCnum is available at HTML page display time for nearly every item, many more than the GBS api includes OCLCnums in the response for.

OCLC has a relationship with Google for the worldcat records in GBS, and for these links from GBS to worldcat.org.  I would be so pleased if they’d use this relationship to lobby hard for Google to include the OCLCnumbers in every api response where the Google system is aware of the OCLCnum.  And to actually document this as a reliable feature (along with the ability to look up by OCLCnumber).

It would be incredibly valuable to have the GBS as a lookup switching system between ISBN, LCCN, and OCLCnum.  It would be pretty neutral and irrelevant to Google’s business probably. Possibly OCLC would see it as a threat, wanting it’s walled worldcat system to be the only way to do this.  But I suggest OCLC ought to see it as an opportunity — the OCLC number is the key to OCLC services, the easier it is to find an OCLC number for an item of interest, the more OCLC numbers get integrated into the larger web, the more useful and valuable OCLC’s properties become.

This is especially clear when we realize that if GBS api responses included OCLC numbers, this would also be a lookup switching system between Google’s internal item id and OCLC number. As far as I know, there is no service available anywhere at present that allows you to match back and forth between Google internal ID (like “HbcCPwAACAAJ“) and OCLC number.  At least no machine readable service — Google Books itself is a human-useable one, by lookup up a book and then following the “Find in a Library” link. But an API that allowed software to take a known google books item and find the corresponding worldcat item (or vice versa) would seem to be positive to the  business of both OCLC and Google, making both their services more valuable and accessible.

If OCLC had any energy to try and bring this up with Google and make it happen, it would be of great value to us libraries.

This entry was posted in General. Bookmark the permalink.

4 Responses to Google Book Search, LCCN and OCLCnumber

  1. Dorothea says:

    The G00g, it is inscrutable.

    (Seriously, I read this three times trying to figure out what combination of indexing and retrieval algorithms might produce these results, and I got nothin’.)

  2. Sherman says:

    Hello Source:
    I am a self-published author and I did submit my manuscript to the Library of Congress in 1991. At that time I was assigned a LCCN along with a TXu number. I was unable to find either in your blog.
    My question is: how far back does your records of LCCN go and does the TXu number show up in your system.
    Thank you for your dedication and assistance in this vital area.
    Respectfully,
    Sherman

  3. Almost 3 years later it seems that Google still has not included LCCN or OCLC Number into their API. Based on this, is there another way to quickly retrieve the LCCN based on the ISBN. If I have an excel file with hundreds of ISBN, how can I quickly fill the LCCN column?
    Any help would be greatly appreciated.

  4. jrochkind says:

    Not from Google so far as I know, so far as I know, and I don’t expect it to show up anytime in the future either.

    OCLC may have such services, possibly even available for free to non-members. Or regarding LCCN, it may actually be available from the Library of Congress, maybe even using their z3950 server, an inconvenient legacy protocol but might work. I recommend you join and ask for suggestions on the code4lib listserv.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s