Google Scholar linking open access pre-prints to citations?

Hmm, it kind of looks like Google Scholar may have accomplished something I’ve been trying to figure out how to do for quite a while — linking a published citation to an open access pre-print found somewhere.

Am I understanding the results properly?

Check out this result list for a topic that came into my head and I decided to search Scholar for.

It looks to me that if you click on the title, you get taken to a publisher paywall. But in the right-hand column next to many of the hits (5 of 10 on this page) is a link which seems to be an open access pre-print. No?  And they actually manage to link right to the PDF too, not to an annoying DSpace/Fedora/Whatever landing page that makes it really confusing to find the additional click to the actual PDF.

Anyone have any clues as to what’s going on, or how they’re doing it?

Man, I really wish Google Scholar had an API.  I’d really like my link resolver (Umalut) to be able to alert someone to an open access pre-print for the citation they’ve found. But I haven’t been able to find a reliable way to do it (It is deeply sad to me that searching, for instance, DOAR isn’t good enough, because repo’s listed in DOAR actually do have all sorts of embargoed and otherwise restricted content in it too, despite their advertised collection policy, and there’s no way to tell which is which.  They also don’t provide a search service, other than a Google custom search, but that could maybe worked around if one could be confident hits in their collection really were OA.  (OAISter also includes non-OA content, although OAISter doesn’t pretend otherwise).

Really, this is a failing in common metadata — most OAI-PMH harvests will include embargoed and otherwise restricted content along with OA content, with absolutely no metadata advertisement of which is which.  And then there’s the fact that most OAI-PMH  harvests will advertise links to the forementioned annoying landing page, not to the actual assets — and again, no common metadata schema is being used to advertise what a link will actually lead to.  Really, I’m deeply dissapointed that this kind of thing — good metadata that will allow software to know if an item really is OA, and to get a link directly to the content as well as the landing page — doesn’t seem to be a concern of the repository and communities. This has been a problem for YEARS, and if any of the various organizations involved in this stuff are even making any efforts to address it, I haven’t heard about it.

I could screen-scrape Google Scholar, but I think they’d rate limit me. Although I could try, sending proxy headers advertising the fact that i’m proxing for a particular client IP address (not exactly a lie, depending on the meaning of ‘proxying’), and see if that gets me through. Or I could do it solely with javascript in the browser, so it really is coming from the individual client and I won’t get rate-limited, but I hate providing functionality with no non-javascript fall-back.)

Grey OA

The other interesting thing that occurs to me, as I play around with this more, is that many of the PDF links G. Scholar finds are in fact NOT pre-prints. They appear to be the actual page images of the final published version. Often hosted in the personal web areas of one of the authors (guessing from the URLs, that include a tilde and lastname, or the name of a lab or research group).  Wonder how many of these the author actually has the publishers permission to do this with, and how many not?  Dorothea, you reading this? What do you think?

20 thoughts on “Google Scholar linking open access pre-prints to citations?

  1. Interesting new UI element. The fact that Google Scholar found preprints on the web isn’t new — one could always follow the “All x versions” link to a display of publisher-based and self-archived versions of the title. It looks like what they are doing now is calling out one of the open access versions in the main UI.

    I don’t know how they are doing this, but I’m betting it is based on the PDF-to-text indexing that they already do in the main index. They can match on article title and author names (and perhaps even abstract) to get a pretty clear understanding of close matches. Remember — Google does a lot of metadata stuff based on probabilities. (For a refresher on what they do with bibliographic data, take a look at my summary from Midwinter: Mashups of Bibliographic Data: A Report of the ALCTS Midwinter Forum.)

  2. It’s a concern! It’s a concern! It’s just that getting anything DONE about it in the software communities concerned is like wading through hip-deep peanut butter. This repo-rat is completely burned out on engaging with open-source repo software.

    The other problem is that OAI-PMH yet again didn’t think through its problem space very well. They should have built availability flags (as well as error reporting) into the protocol. It’ll never be changed; those responsible are playing with new toys.

  3. Oh, and as for permissions — your suspicion is well-founded. You can check SHERPA’s list of publishers allowing final-pdf deposition at but for most of what you’re seeing on GScholar, dollars to doughnuts it’s not from a cooperating publisher.

    There’s a good article about this: Wren, J. D. (2005). Open access and openly accessible: a study of scientific publications shared via the internet. BMJ, 330, 1128.

    I wouldn’t be surprised if this phenomenon became the next lawsuit after the Georgia State one gets dealt with. GScholar has just made detection rather easy. My suspicion is that the publishers will try to sue an IR rather than faculty members, though they may find that difficult because IRs (in spite of Harnad) tend to flout publisher wishes considerably less than do faculty. We’ll see!

  4. I proposed something akin to an availability flag at a JCDL pre-conference workshop 8 or so years ago, and was surprised to receive blank looks. Truly unfortunate, since I found this by far the most difficult part of managing OAIster. And why we changed our “collection development” policy after realizing that embargoed and such items were impossible to keep out/remove using current protocol features.

  5. “I proposed something akin to an availability flag at a JCDL pre-conference workshop 8 or so years ago, and was surprised to receive blank looks. ”

    I think there’s a major problem here, which is the people working on creating things like OAI-PMH, or at JCDL or what have you — are largely R&D types that aren’t actually working on _real stuff_. That’s the only way they could not realize this is an issue, anyone that tries to _use_ the infrastructure they invent, for things that serve our user needs the most, will run into this pretty quick.

    And as Dorothea mentions, to make matters worse, the R&D crowd considers something “done” and goes onto other things, and the library world extends no more resources to fixing it to actually _work_, our resources have moved on to inventing other things that may or may not actually work out.

    No doubt many people consider “OAI-PMH” a success becuase it’s deployed all over the place. No matter if it’s deployed consistently enough or semantically richly enough to actually DO anything.

  6. “My suspicion is that the publishers will try to sue an IR rather than faculty members”

    The articles I noted, I was suspicious about precisely because they weren’t in an IR, they were on a faculty member’s personal or research team website, judging by URL.

    Of course, the publishers could still sue the university, which is hosting those web pages. Or send them a bunch of DMCA takedown notice. I wonder if a university is an ISP with regards to it’s hosting of faculty content, with regards to DMCA. Probably not, since faculty are employees. Students, who knows.

  7. As an aside, one way to get this fixed might be to run OAI-PMH through an actual standards process (e.g. NISO) rather than having it be a de facto standard.

    Hey, come to think of it, I sit on the NISO Discovery-To-Delivery Topic Committee and could suggest this. Does anyone think it would have legs?

  8. I have not generally found things that go through “actual standards processes” to be improved at all. (*cough openurl*).

    You just need the right people who understand where the rubber meets the road, not just R&D CS theory, and the right people who feel a responsibility for seeing it through and not abandoning it. In my observation/experience, going through standards body beurocracy doesn’t neccesarily help with that, and often actively harms it.

    The real problem is that the pool of technical people who understand the right stuff available/interested to work on library problems is not that big.

  9. Well, the bureaucracy is only as good as the people involved. (There is a lesson in politics there somewhere.) Would you be interested in participating in a working group that worked on putting OAI-PMH through the standards vetting process?

  10. Thanks,but probably not. I’m skeptical of OAI-PMH and of official standards-making processes.

    If the bureaucracy is only as good as the people involved, then if you have good people — do you need the bureacracy? Ideally maybe to guarantee some kind of institutional support, so if the individual people involve lose interest, it’ll still be well supported for future revisions. But I have seen that not happen.

    Part of the problem is that a standard these days needs to be an iterative process. Develop, standardize, develop to standard, re- standardized based on experience, repeat. The standards-making-body process seems to actively work against this, standardize-and-done. That is in a sense what happened to OAI-PMH — curious in what ways you might think the OAI-PMH product might have been different had it gone through an official standards organization — or in what ways are you optimistic that it will be different putting it through one now? If the end result is pretty much what we have anyway, but with a lot more time and effort that went into getting a standards body to put a stamp or approval on it — the stamp of approval doesn’t actually add much to what we’ve already got, right?

    (One actual downside of certain standards bodies is when the standard, once approved, is not available for free. That is counter-productive).

  11. There’s been a gap all along between how Great Minds (not all CS types, either!) thought that the repository ecosystem Should Work, and how it actually Did(n’t) Work in practice.

    We haven’t been very good at bridging that gap. I’ve written at length about why, but the gap persists.

    OAI-PMH’s trajectory is a symptom, not the disease. Fixing OAI-PMH, either by making an IETF-ish fork of it or by throwing it at OASIS or NISO, won’t cure the disease. I don’t know what will.

  12. About the availability flag / metadata field… This is also something that the DLF Best Practices group made recommendations on and that Jenn Riley and I pushed heavily in our shareable metadata workshops. I also tend to be skeptical of standards solving this problem, except that software communities are more likely to pick up a standard and implement it (because it’s a standard, you know!).

  13. Thanks Sarah. Can you give me, or point me to, a very concise to the point description of how to provide an availability metadata field in an OAI-PMH feed that accords to the best practice recommendations your group made?

    That’s the kind of thing that helps get it into software, I believe, make it as quick and easy as possible for developers to see how to do it in an interoperable way.

    Although OAI-PMH may be a lost cause anyway, because why put stuff into your software that no clients will ever use, while meanwhile no clients will ever use what’s not there in the first place — and hey, we implemented OAI-PMH, shouldn’t that be enough, what do you want more for?

    The downside of “standards”, I think, is that, when not written well (what’s a well written standard? concise, clear, providing solutions to real problems whose difficulty to implement corresponds to their value to our businesses and users)–when not written well, all that happens is software developers (and _especially_ vendors, who love this sort of thing) can say “Well, we implemented the standard, the end, we’re done, what are you complaining about?” Without needing to actually think about the problem. Whether or not they really did implement the standard, or whether the standard is clear enough to be sure if they implemented it or not, or if the standard is actually succesful at solving the problem — or if the standard is even _meant_ to solve the problem that the vendor is trying to get out of solving by saying “Hey, we implemented the standard.” (Another critiria of a good standard — be clear what problem it means to solve in what ways, and what things must still be solved beyond simply ‘following the standard’. Some library world standards, it’s not clear to me that the drafters were clear on what problems they were meaning to solve — or if they were, they weren’t the same problems that people now erroneously assume the standards solve).

  14. There is a basic Google hack, for searching only for full-text, at the level of the individual repository:

    site: filetype:pdf your keywords here

    It’s also possible to rip all the DOAR and ROAR URLs and use them to make a Google CSE, then use the ‘filetype:pdf your keywords here’ search method in the resulting search-engine. I’ve done it here, in alpha:

    Of course, you can’t guarantee that the Googlebot has crawled the entire repository.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s