What I wish I could do with Internet Archive

To explain my current frustration with Internet Archive a bit more, here’s what I want to do.

What we want to do

I have library interfaces that display detail pages for known items: My catalog, and my link resolver. When my known item is a book, I’d like to do a search on the IA corpus, and see if a publically available digitized full text is available, and tell my users about it. (Check for a publically available audio book version too while we’re at it, why not). Maybe search on identifiers if available, but I know most of the time I’ll be searching on title/author keyword, and experimenting against indexdata’s OCA index, that works Good Enough ™, so good. This would be a really great feature. Users shouldn’t have to know to go to the IA search to find IA content, and shouldn’t have to do a second search, I want to put it right in my own searches.

Brewster suggests an access method

So how do I do this? At code4lib after Brewster Kahle’s keynote, I asked him this, and he sent me a nifty url to a search of IA that would let me limit to collections (like just full text ones, perhaps), and actually returned some XML with a very basic set of metadata on the items (title, collection it came from, format, a url to get associated assets including more metadata). Okay, I said, I wish that was actually documented somewhere, but that’s good enough that’ll work, so good!

The Case of the Disappearing Interface

So, several months later, I (or actually Jason Ronallo interning for me) finally get around to writing the functionality that’s going to use this access—and the access has been removed! First of all, if I already had spent time writing to this interface (at Brewster’s reccommendation!) only to find it disappear–I’d be really mad, and that’s an indication of a lack of proper communication between IA and developers like me. So I guess it’s fortunate that I didn’t, and it disappeared before I started. But still, how do I do what I need now?

It ain’t in the OL

Perhaps IA thinks that the OpenLibrary APIs are sufficient, and that’s why they got rid of their other search. Indications are that the IA is “getting” to some extent with OpenLibrary that they actually need to understand and provide for machine-access needs, if they want developers like us to interface with it. But the OpenLibrary is not good enough for me, because the OpenLibrary database doesn’t in fact include all Internet Archive hosted free digital content! It is only intended to include Internet Archive content that has MARC, but Alexis Rossi of the OpenLibrary project estimates that there are upwards of 100,000 (~25% of entire corpus) digital texts hosted by the Internet Archive which do not have MARC, and thus aren’t included in the OpenLibrary db and there are no plans to include them. Well, gee, I want to search over these too! Not to mention audio book versions–I don’t know if any of these are included in the OpenLibrary database. Why would I want to leave them out of my search to put them in front of my users just because they don’t have MARC?

So I’m (or Jason is) basically reduced to a screen scrape of the Internet Archive HTML search results. An html page which doesn’t even have quite sufficient metadata to do everything I could do with the previous XML hit list, even once I do succesfully screen scrape. And I don’t think this ‘advanced’ search lets me limit to a set of _multiple_ collections like the last one did–I’m going to have to do several searches to search several collections (putting more load on the IA servers while we’re at it). But that’s what I’m going to do, because I see little other choice. And I hate screen scraping. It is a testament to how valuable I believe the IA content is to my users, that I’m even willing to go that route–I wouldn’t do it for just anyone.

A little attention and support could go a long way

So this is an example of why I get the feeling that the Internet Archive is uninterested in developers like me and our attempts to expose Internet Archive content to our users–even as they complain, mystified, about why more developers aren’t hooking into IA content! It’s no mystery to me. I don’t think anyone at the IA even realizes what a problem they caused by eliminating this XML search, because I don’t think anyone is even paying attention to what people like us are trying to do–while complaining that we’re not doing more.

And, yeah, sure, I could harvest the whole darn IA with OAI-PMH, find just the pieces I want in it (full text and possibly audio book collections), and index them locally. Maybe eventually I’ll do that. IA does give me permission. But if IA actually gave me a machine searchable interface to their own indexes so I don’t have to figure out how to do that, it’d go a lot faster, easier, and more locally maintainable for me. I’d really rather not run my own local index of of IA content–experiences with the IndexData OpenContent OCA index shows that such things really do require attention and effort to keep them up and working (that OCA index keeps breaking and going down and being heavily out of date), attention and effort that are in short supply where I am.

If IA wants to encourage people to write hooks into their system, a little bit of support on their end (which would require caring about what we are trying to do in the first place, instead of just telling us what they think we should be trying to do) would go a long way–you get enough infrastructure there for us, we can and want to put all sorts of cool things on top of it.

But you’ve got to get enough actually supported infrastructure there to make it easy enough to do that we can find time to do it, cause we’re busy and underfunded, like everyone. You don’t need to write our applications for us, we really will write open source apps on top of the services you provide, but you’ve got to give us those services—creating our own index on top of your OAI-PMH feed (which incidentally, is also, according to colleagues who have tried to use it, a bear to deal with that often doesn’t behave as expected) is apparently just a bit too much for us out here in library developer world to realistically accomplish succesfully with our resources.

Update 26 May: The XML search is back! Thanks Internet Archive!  Thanks Jason for finding it somehow! I haven’t been checking my email or the lists over the long weekend, so I’m not sure how Jason discovered it was back, or if it’s been advertised, or if it’s a response to this complaint, or what. But thanks!


4 thoughts on “What I wish I could do with Internet Archive”

  1. I don’t even need full text search, I just need metadata search of what’s in the IA. That’s great that they are adding new searches to the Open Library project, but my concern remains with searching the materials hosted by the IA that are NOT in the Open Library database. That’s great they are adding a full text search to the Open Library, but that’s orthogonal to my concerns here—which are also about removing an existing interface that had been recommended to us, without warning, and without a replacement.

    I should join the Open Library email list, but this post is in fact not about the Open Library project, it’s about search of the entire Internet Archive corpus (at least 25% of which is not included in the Open Library database). Indeed, from everything I’ve seen the Open Library project is honestly interested in providing useful APIs.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s