To explain my current frustration with Internet Archive a bit more, here’s what I want to do.
What we want to do
I have library interfaces that display detail pages for known items: My catalog, and my link resolver. When my known item is a book, I’d like to do a search on the IA corpus, and see if a publically available digitized full text is available, and tell my users about it. (Check for a publically available audio book version too while we’re at it, why not). Maybe search on identifiers if available, but I know most of the time I’ll be searching on title/author keyword, and experimenting against indexdata’s OCA index, that works Good Enough ™, so good. This would be a really great feature. Users shouldn’t have to know to go to the IA search to find IA content, and shouldn’t have to do a second search, I want to put it right in my own searches.
Brewster suggests an access method
So how do I do this? At code4lib after Brewster Kahle’s keynote, I asked him this, and he sent me a nifty url to a search of IA that would let me limit to collections (like just full text ones, perhaps), and actually returned some XML with a very basic set of metadata on the items (title, collection it came from, format, a url to get associated assets including more metadata). Okay, I said, I wish that was actually documented somewhere, but that’s good enough that’ll work, so good!
The Case of the Disappearing Interface
So, several months later, I (or actually Jason Ronallo interning for me) finally get around to writing the functionality that’s going to use this access—and the access has been removed! First of all, if I already had spent time writing to this interface (at Brewster’s reccommendation!) only to find it disappear–I’d be really mad, and that’s an indication of a lack of proper communication between IA and developers like me. So I guess it’s fortunate that I didn’t, and it disappeared before I started. But still, how do I do what I need now?
It ain’t in the OL
Perhaps IA thinks that the OpenLibrary APIs are sufficient, and that’s why they got rid of their other search. Indications are that the IA is “getting” to some extent with OpenLibrary that they actually need to understand and provide for machine-access needs, if they want developers like us to interface with it. But the OpenLibrary is not good enough for me, because the OpenLibrary database doesn’t in fact include all Internet Archive hosted free digital content! It is only intended to include Internet Archive content that has MARC, but Alexis Rossi of the OpenLibrary project estimates that there are upwards of 100,000 (~25% of entire corpus) digital texts hosted by the Internet Archive which do not have MARC, and thus aren’t included in the OpenLibrary db and there are no plans to include them. Well, gee, I want to search over these too! Not to mention audio book versions–I don’t know if any of these are included in the OpenLibrary database. Why would I want to leave them out of my search to put them in front of my users just because they don’t have MARC?
So I’m (or Jason is) basically reduced to a screen scrape of the Internet Archive HTML search results. An html page which doesn’t even have quite sufficient metadata to do everything I could do with the previous XML hit list, even once I do succesfully screen scrape. And I don’t think this ‘advanced’ search lets me limit to a set of _multiple_ collections like the last one did–I’m going to have to do several searches to search several collections (putting more load on the IA servers while we’re at it). But that’s what I’m going to do, because I see little other choice. And I hate screen scraping. It is a testament to how valuable I believe the IA content is to my users, that I’m even willing to go that route–I wouldn’t do it for just anyone.
A little attention and support could go a long way
So this is an example of why I get the feeling that the Internet Archive is uninterested in developers like me and our attempts to expose Internet Archive content to our users–even as they complain, mystified, about why more developers aren’t hooking into IA content! It’s no mystery to me. I don’t think anyone at the IA even realizes what a problem they caused by eliminating this XML search, because I don’t think anyone is even paying attention to what people like us are trying to do–while complaining that we’re not doing more.
And, yeah, sure, I could harvest the whole darn IA with OAI-PMH, find just the pieces I want in it (full text and possibly audio book collections), and index them locally. Maybe eventually I’ll do that. IA does give me permission. But if IA actually gave me a machine searchable interface to their own indexes so I don’t have to figure out how to do that, it’d go a lot faster, easier, and more locally maintainable for me. I’d really rather not run my own local index of of IA content–experiences with the IndexData OpenContent OCA index shows that such things really do require attention and effort to keep them up and working (that OCA index keeps breaking and going down and being heavily out of date), attention and effort that are in short supply where I am.
If IA wants to encourage people to write hooks into their system, a little bit of support on their end (which would require caring about what we are trying to do in the first place, instead of just telling us what they think we should be trying to do) would go a long way–you get enough infrastructure there for us, we can and want to put all sorts of cool things on top of it.
But you’ve got to get enough actually supported infrastructure there to make it easy enough to do that we can find time to do it, cause we’re busy and underfunded, like everyone. You don’t need to write our applications for us, we really will write open source apps on top of the services you provide, but you’ve got to give us those services—creating our own index on top of your OAI-PMH feed (which incidentally, is also, according to colleagues who have tried to use it, a bear to deal with that often doesn’t behave as expected) is apparently just a bit too much for us out here in library developer world to realistically accomplish succesfully with our resources.
Update 26 May: The XML search is back! Thanks Internet Archive! Thanks Jason for finding it somehow! I haven’t been checking my email or the lists over the long weekend, so I’m not sure how Jason discovered it was back, or if it’s been advertised, or if it’s a response to this complaint, or what. But thanks!