concordances and centralized services

In a post on NGC4Lib, Eric Lease Morgan tells us about a new service in their catalog, where they provide concordances on item detail pages to provide more information about an item. For instance, see the “Analyze using text mining techniques” on this page: http://www.catholicresearch.net/Record/undmarc_000885039 , which will take you to here: http://www.catholicresearch.net/concordances/?id=undmarc_000885039 , where one of the options, for instance, will show you this:  http://www.catholicresearch.net/concordances/?cmd=words&id=undmarc_000885039&n=50
Eric suggests that this interface “provides information services similar to tables-of-contents and back-of-the-book indexes.”

I agree.  In particular, the “top N words” feature is interesting as a a substitute/supplement for tables of contents or abstracts (especially for records that don’t already have such).  The “words beginning with [letter]” feature seems less obviously useful to me — to the extent that it’s sort of useful, it’s probably better to provide an actual ‘search for word in book’ returning page numbers, like Eric also does, and like several current digitized book providers, including HathiTrust,  do even for in-copyright works.

In fact, I’d like to put the top 50 (or other N) words (or top N m-word phrases, etc) directly on a catalog page, not requiring the user to follow a possibly mystifying “Analyze using text mining techniques” link, and then make some more clicks too, to get there.

What Eric’s done locally is quite a bit of work. And can only be done with books he can get his hand on the full text for. Rather than every library trying to do this individually (costing both programmer time and CPU time as well as storage cost for each library), it might be nice to have some central agency doing this once and sharing the results.  And it might be nice if that central agency had access to a whole lot of full text, including in-copyright works, to calculate the concordances.

I can think of only one entity with the proper infrastructure, access to full text, and potential interest in providing such a service in a way useful to libraries: HathiTrust.  (You could include Google if you have a different judgement than me about their potential interest in providing such a service in a way useful to libraries).

HathiTrust actually already does provide a ‘search inside the book’ feature — which works even on in-copyright texts, although the results are just page numbers with no results-in-context. Via their APIs, I actually already embed this search-inside-the-book from HathiTrust in my catalog; which as I mentioned I think serves much the same purpose as Eric’s “search” and “words beginning with letter” features.

But what I don’t have yet is the “top N words” or “top N m-word phrases” that Eric’s got.

It would be interseting and a useful service if HT were to pre-compute these “top N” concordances (choosing a couple useful sets, top 50 words, top 50 2-word phrases, etc), and then provide an API service where you could look em up by ISBN/ISSN/OCLCnum/LCCN. They already have an API that lets you look up HT records by those identifiers, the new thing would just be computing the concordances and advertising them in the API.  Then individual libraries could use them to supplement their displays with this added meta-data — without the individual libraries (programmer time and CPU time) having to do all the work individually.  (HT could of course also provide a bulk download in addition to a per-item API).

Perhaps the pre-computed concordances could be provided in an XML format, where for in-copyright works HT chooses to show no full-text for, the data would just reveal the number of times the word appears, and a list of page numbers that word appears on. I think these type of concordances would reveal no more information than HT’s current “search where results are only page numbers” service, which they provide for in-copyright works too, so presumably the same legal analysis that justified HT’s current service could justify concordances for in-copyright works.

For public domain titles HT does show full text for, these pre-computed and XML-delivered files could actually include “search in context” excerpts for each page number, as well as direct links to the relevant page on the HathiTrust platform.

(Of course, one thing that makes this more complicated is that whether HT chooses to show actual text from a work is dependent on their best guess (IP-geocoding) location of the user too — some texts can only be viewed by people in certain countries. I guess they could only provide the hypothetical “XML with search in context” as a bulk download for titles for which they share with the entire world.)

Whether HathiTrust chooses to do this or not — and it would be legitimate if they chose to do it sharing the data only with HT member institutions that are funding it — the fact that this would be quite feasible for HT to do shows the the wisdom of the creators of HT in creating it. The library community needs a library community institution that is placed to do such things, and we’ve got HathiTrust, nice.  The fact that one could imagine adding such a service to HathiTrust’s existing infrastructure and having it fit in nicely to their existing APIs and such — is evidence how the HT developers have done a great job of designing their infrastructure with an eye on the future. (I personally think it is no coincidence that this infrastructure was not “designed by committee”, but by a team at umich. I hope now that HT has formal membership governance bodies and such, they can nonetheless continue to avoid the perils of ‘design by committee’ that effect so many collaboratively funded library software projects, with deleterious results).

This entry was posted in General. Bookmark the permalink.

2 Responses to concordances and centralized services

  1. Even more than concordances and top-N words, I’d like statistically improbable phrases: they show more about the contents than anything else.

    I like, in Eric’s interface, that you can click through to get the KWIC, e.g.
    http://www.catholicresearch.net/concordances/?cmd=search&id=undmarc_000885039&query=john

    There might be intelligent ways to group these to use word sense (not just words)–getting even closer to a proper index.

    Thanks for pointing this out, Jonathan!

  2. jrochkind says:

    Excellent point on ‘statistically improbable’, Jodi. I wonder if Eric’s tried doing that?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s