So there’s lots of open access scholarly article content on the web now. But there’s no good way to actually search all that content, and only that content. Sure, you can use Google (and other web searches), but your results are a mixed bag. You can get non-scholarly info mixed with your scholarly results.
Use Google Scholar (which provides no API at all, unlike other Google searches), and you can eliminate that. But you’ve still got a mix of open access and non-open access stuff. The non-open stuff might be licensed by the individual’s library, or it might not be. And even if it is, the actual link that Google Scholar provides might get them to it (if they are on campus, otherwise unlikely), or might ask them to pay $40 for it, even though their library is already paying for it!
So using web searches to find open access content is inherently flawed for providing a good user experience.
What else is there? At first I thought OAISter (recently taken over (purchased?) by OCLC), but it turns out that much of the content in OAISter is not open access, and there’s no good way to tell which is which — except clicking on it and finding out, just like Google–and if it’s not, it has the same problem of potentially asking the user to pay for content that’s already licensed for them. We’re trying to save the time of the user here, people!
DOAR? It doesn’t have it’s own indexes, but it claims to be a directory of open access repositories — with OAI feeds, you could build your own index, right? Except it turns out that an incredibly high percentage of content you’d get this way wouldn’t be open access after all. While DOAR claims that “sites where any form of access control prevents immediate access are not included”, this just isn’t true. Many of the repositories listed in DOAR mix open access content with access controlled licensed content, and again, there’s no way to tell which is which except clicking and praying.
Google and Yahoo CC searches try to at least eliminate non-open stuff (if not limiting to scholarly type content), but it’s reach is still too narrow, most open access content isn’t yet advertised as open access in the machine readable way that Google CC needs.
So at first I thought there was some easy hackable solution to this, some way to create an open access scholarly search. But I don’t think there is, it’s just too tricky. (And this is largely the fault of our community’s repositories not actually tagging open access content in a machine recognizable way, in their OAI feeds etc. We’ve REALLY dropped the ball here. )
So I think there’s a business opportunity for someone to create such a search, essentially an aggregator service, potentially much like the many we are already paying for, but one focused on open access. I see two possible technical approaches.
Start with OAISter content, add in DOAR listed content, add in content from DOAJ and arxiv and other places that provide significant collections of known open access scholarly material.
But the vendor will then ‘filter’ it to be really truly only open access stuff. Using various algorithmic heuristics, and manually maintained ‘whitelists’ of known open access content in URLs (again, this approach is hard when it comes to DOAR content where a single repository mixes open access and closed content). This isn’t easy, but thats’ why we’d be willing to pay for it right? I think it’s do-able to get something ‘good enough’.
Provide a simple HTML search, but also provide suitable APIs to include content in federated search and other applications. Ideally provide BETTER APIs than most of our vendors do, OpenSearch, SRU, Atom, etc.
We already pay for all sorts of aggretation services. If the price was right, wouldn’t customers be willing to pay for this particular ‘database’ too? If I were a decision maker (which I am not), I would. It’s a needed service for our users.
Another approach, a vendor could provide a ‘filter’ for Google, Yahoo, etc. searches. Somehow filter out the truly open access (and ideally scholarly) content from the cruft, using the same kind of heuristic approaches talked about above.
Google and Yahoo both offer APIs (although again, not for Google Scholar). Google and Yahoo terms of service may or may not allow a vendor to access them on a large scale to provide this ‘filter’ on top.
But we’re in the Web 2.0 world of APIs now. A vendor could provide a customer with simple software that ran on the customer’s site, accessed Google and Yahoo through their APIs with the customer’s own API key from the customer’s own server, and then passed the results to this hypothetical vendor, who would filter them to include only open access stuff.
And then present them to the user, for display according to Google Yahoo etc.’s appropriate terms of service (with appropriate branding etc). Except I’m not sure that Google’s ToS would let you do this, I think you might not be allowed to change their results at all when presenting to the user. And when we talk about using this in federated search like we’d really like to, the ToS issues get even more problematic.
So this ‘filter’ idea is a neat idea. But with the potential ToS problems, on top of the more complex technical architecture, I think all in all I think the aggregator approach is probably a better bet.