This is worth pulling out into a post of it’s own. Thanks to Dorothea Salo for the comments on the post where I broached this issue sort of in passing. Good to know that I’m indeed not alone in worrying about this stuff.
But there are actually a few different (but related) issues Dorothea has identified here, some of which aren’t a problem for my projects at all, others of which are. Let’s analyze them out:
1. Some faculty are unwilling to publish open access.
This might be a problem, but despite this problem there’s plenty of free-web publicly accessible scholarly content available. (I use this phrase because the specific licensing might be unclear, but an unauthenticated user can get it on the web.) I’m thinking specifically about so-called preprint/postprint public accessible versions of articles that also appear in not-open-access journals. There’s lots of it. This is in fact what motivates my desires in the first place.
2. Some repository software doesn’t allow control of access to the level desired by repository managers.
This might be a problem too, but despite it, most supposed “open access” repositories do contain material that the repository does not in fact make available to the general unauthenticated public! So the software might not be flexible enough, but it is often restricting access to contents in it anyway. And including metadata for those restricted items in the general OAI-PMH feed, without any predictable machine-readable way to tell that it is in fact restricted content.
So it’s in fact the ability of many repositories now to restrict content that brings me to my issue:
3. I have no way to identify the universe of actually publically accessible ‘open access’ scholarly content.
Even if I created an aggregate index of OAI-PMH feeds from all “open access” repositories—it would include content which is not viewable by an unauthenticted user! What I want to do in my software is, I have a known-item citation, I want to tell the user if there’s a publically-viewable copy of this citation online. I have no way to find/identify such a copy though! I have no way to weed out the stuff that isn’t really publically accessible. I don’t want to send the user to something they cant’ access—some repositories listed in DOAR actually have the majority of their items (in the OAI-PMH feed) not available to the unauthenticated off-campus user!
So 1 and 2 might be issues in general, but aren’t what’s providing the roadblock for me. 3 is. There are a couple other issues worth nothing, one that is an inconvenience (but not a roadblock) for my project, one that is not.
4. Difficulty of identifying articles in repositories matching a citation.
When I experimentally tried doing a search against OAISter (before I realized that OAISter didn’t even limit itself to so-called open access repositories; and before I realized that even open access repositories weren’t)—I had to do a search based just on title and author keywords. It would be better if I could search based on an identifier (DOI or pmid) when present—or based on structured publication data for the actual publication of the pre/postprint: ISSN, vol, issue, page number. But these things aren’t available in the OAI-PMH feed, and in fact probably aren’t even in most repositories metadata. Most repository metadata doesn’t try to connect a pre or post-print to the actual published version in any way.
This is annoying, but I found that author/title keyword search worked good enough to be useful even without this, so it wasn’t a roadblock.
5. Might be publically accessible, but is it open access?
This gets at what the SPARC/DOAJ initiative is trying to solve. Okay, I’m a reader, I can look at this article online on the free-web, but what am I allowed to do with it? Am I allowed to reproduce it? This matters to readers and is a real issue, but doesn’t in fact matter to my project. All I care about is if I can show them the full text on the public web—once I can do that, I can worry about helping them understanding the license and their access rights, but first I need to help them discover the article in the first place!