jump to navigation

More on open access discoverability May 8, 2008

Posted by jrochkind in Practice, open access, programming.
4 comments

This is worth pulling out into a post of it’s own. Thanks to Dorothea Salo for the comments on the post where I broached this issue sort of in passing. Good to know that I’m indeed not alone in worrying about this stuff.

But there are actually a few different (but related) issues Dorothea has identified here, some of which aren’t a problem for my projects at all, others of which are. Let’s analyze them out:

1. Some faculty are unwilling to publish open access.

This might be a problem, but despite this problem there’s plenty of free-web publicly accessible scholarly content available. (I use this phrase because the specific licensing might be unclear, but an unauthenticated user can get it on the web.) I’m thinking specifically about so-called preprint/postprint public accessible versions of articles that also appear in not-open-access journals. There’s lots of it. This is in fact what motivates my desires in the first place.

2. Some repository software doesn’t allow control of access to the level desired by repository managers.

This might be a problem too, but despite it, most supposed “open access” repositories do contain material that the repository does not in fact make available to the general unauthenticated public! So the software might not be flexible enough, but it is often restricting access to contents in it anyway. And including metadata for those restricted items in the general OAI-PMH feed, without any predictable machine-readable way to tell that it is in fact restricted content.

So it’s in fact the ability of many repositories now to restrict content that brings me to my issue:

3. I have no way to identify the universe of actually publically accessible ‘open access’ scholarly content.

Even if I created an aggregate index of OAI-PMH feeds from all “open access” repositories—it would include content which is not viewable by an unauthenticted user! What I want to do in my software is, I have a known-item citation, I want to tell the user if there’s a publically-viewable copy of this citation online. I have no way to find/identify such a copy though! I have no way to weed out the stuff that isn’t really publically accessible. I don’t want to send the user to something they cant’ access—some repositories listed in DOAR actually have the majority of their items (in the OAI-PMH feed) not available to the unauthenticated off-campus user!

So 1 and 2 might be issues in general, but aren’t what’s providing the roadblock for me. 3 is. There are a couple other issues worth nothing, one that is an inconvenience (but not a roadblock) for my project, one that is not.

4. Difficulty of identifying articles in repositories matching a citation.

When I experimentally tried doing a search against OAISter (before I realized that OAISter didn’t even limit itself to so-called open access repositories; and before I realized that even open access repositories weren’t)—I had to do a search based just on title and author keywords. It would be better if I could search based on an identifier (DOI or pmid) when present—or based on structured publication data for the actual publication of the pre/postprint: ISSN, vol, issue, page number. But these things aren’t available in the OAI-PMH feed, and in fact probably aren’t even in most repositories metadata. Most repository metadata doesn’t try to connect a pre or post-print to the actual published version in any way.

This is annoying, but I found that author/title keyword search worked good enough to be useful even without this, so it wasn’t a roadblock.

5. Might be publically accessible, but is it open access?

This gets at what the SPARC/DOAJ initiative is trying to solve. Okay, I’m a reader, I can look at this article online on the free-web, but what am I allowed to do with it? Am I allowed to reproduce it? This matters to readers and is a real issue, but doesn’t in fact matter to my project. All I care about is if I can show them the full text on the public web—once I can do that, I can worry about helping them understanding the license and their access rights, but first I need to help them discover the article in the first place!

Google feature changes; open access discoverability May 7, 2008

Posted by jrochkind in Practice, open access, programming.
3 comments

So, I’ve found out about a couple new things from Google I hadn’t known about. (Google is such a prominent player in our space, we need to keep up with what’s going on there so we know how to exploit it to maximum effect. I need to remember to go explore google’s interfaces and documentation more regularly to see changes).  1.  Google search API now allows server-side access. 2. Google search allows limit on usage license.  And both these things got me started about open access discoverability again.

1. Google API allows server-side access!

Thanks to Kent Fitch for alerting us on the code4lib listserv.

http://code.google.com/apis/ajaxsearch/documentation/#fonje

“For Flash developers, and those developers that have a need to access the AJAX Search API from other Non-Javascript environments, the API exposes a simple RESTful interface….

“An area to pay special attention to relates to correctly identifying yourself in your requests. Applications MUST always include a valid and accurate http referer header in their requests. In addition, we ask, but do not require, that each request contains a valid API Key.”

This is huge. I’ve complained before about how it was difficult to incorporate Google features into my own service-oriented software in a maintainable way when only javascript AJAX functions were allowed.

Now if only they’d do the same thing for the Google Books Discoverability api. That’s where I really need it; it’s still not clear to me how I might usefully incorporate automated general google search (including google scholar) into my library applications dealing with scholarly materials, because of the high chance that what Google returns will be for-pay and not available to my users: I don’t want to show them that.

So it was with interest I noticed a new feature:

2. Google search supports usage rights limit

Take a look at the Google advanced search page. Click on “Date, usage rights, numeric range, and more”. Look, there’s a “usage rights” limit which filters by CC licenses. When did that show up?  Of course, it can only include things in the filter that advertise a CC license in a way that Google’s bots can recognize. (Not sure how this is done, Google doesn’t say; I think I recall there’s a standard CC-endorsed way to do this?).

Unfortunately, some initial test searches revealed that this is a tiny piece of the actual open access pie.  Many scholarly materials that ARE available online open access are not in fact in Google’s indexes. Probably because they don’t advertise it properly in a machine-readable way? Still, this is a great step by Google, and indicates that Google recognizes users are increasingly having trouble with getting too much restricted content in their google search results.

But my frustration remains with the scholarly open access community. If the problem is that open access repositories aren’t advertising CC licenses properly–why aren’t these software packages (many of them open source) being fixed? Why isn’t there general concerted funded effort from the open access repository community to solve this general problem: And the general problem is there’s no good place to search aggregated open access content and ONLY open access content. To use in software that wants to answer the question “Is there an open access version of the article with this title and author available?” No good way to do it. And this lack of discoverability is a huge problem with the utility of the existing open access repository domain. I don’t understand why there isn’t more concerted effort to solve it.

Although, in fairness, I did recently become aware of a European initiative, that’s apparently actually funded, to address at least part of this issue.  Registering in machine readable format whether content is open access is the first step to building aggregated indexes. (It’s a dirty secret of the ‘open access repository’ domain that much of the content in so-called “open access repositories” is not in fact open access at all, it’s behind IP and password based restrictions. A cursory sample of items in repositories listed in the OpenDOAR–whose collection policies say that a reason for EXCLUSION from OpenDOAR is “Site requires login to access any material (gated access) - even if freely offered”–will reveal that that collection policy is quite often honored in the breach. Although I guess DOAJ has less of a problem with that, and that SPARC/DOAJ initiative is just about DOAJ, so it’s not clear to me that the SPARC project will really address my problem.  I guess the SPARC project is about people not being sure if they can re-use material in DOAJ journals—my problem is being able to do a meta-search limited to publically available open access content in the first place, and I don’t care if it’s licensed for re-use, I just want to find only stuff that is actually viewable online for free!

Hmph.   What can we do to get the open repositories communities to take note of this problem and address resources toward it?

OAISter -> points to plenty of non open access stuff January 25, 2008

Posted by jrochkind in General, Link Resolvers, Practice, open access.
3 comments

So I had been operating under the incorrect assumption that OAISter only aggregated feeds which claimed to be of open access materials.

After embarrassingly sending them a letter (and cc’ing code4lib) asking for clarification I noticed their collection development policy page. (Embarrassing because I should have checked first).

http://www.oaister.org/restricted.html

  • We harvest and retain all records that point to digital resources.
  • This includes freely-available and restricted-access digital resources.

(more…)