On approaches to Bridging the Gap in access to licensed resources

A previous post I made reviewing the Ithaka report “Streamlining access to Scholarly Resources” got a lot of attention. Thanks!

The primary issue I’m interested in there: Getting our patrons from a paywalled scholarly citation on the open unauthenticated web, to an authenticated library-licensed copy, or other library services. “Bridging the gap”.

Here, we use Umlaut to turn our “link resolver” into a full-service landing page offering library services for both books and articles:  Licensed online copies, local print copies, and other library services.

This means we’ve got the “receiving” end taken care of — here’s a book and an article example of an Umlaut landing page — the problem reduces to getting the user from the open unauthenticated web to an Umlaut page for the citation in question.

Which is still a tricky problem.  In this post, brief discussion of two things: 1) The new “Google Scholar Button” browser extension from Google, which is interesting in this area, but I think ultimately not enough of a solution to keep me from looking for more, and 2) Possibilities of Zotero open source code toward our end.

The Google Scholar Button

In late April Google released a browser plugin for Chrome and Firefox called the “Google Scholar Button”.

This plugin will extract the title of an article from a page (either text you’ve selected on the page first, or it will try to scrape a title from HTML markup), and give you search results for that article title from Google Scholar, in a little popup window.

Interestingly, this is essentially the same thing a couple of third party software packages have done for a while: The LibX “Magic Button”, and Lazy Scholar.  But now we get it in an official Google release, instead of hacky workarounds to Google’s lack of API from open source.

The Google Scholar Button is basically trying to bridge the same gap we are; it provides a condensed version of google scholar search results, with a link to an open access PDF if Google knows about one (I am still curious how many of these open access PDF’s are not-entirely-licensed copies put up by authors or professors without publisher permissions);

And it in some cases provides an OpenURL link to a library link resolver, which is just what we’re looking for.

However, it’s got some limitations that keep me from considering it a satisfactory ‘Bridging the Gap’ solution:

  • In order to get the OpenURL link to your local library link resolver while you are off campus, you have to set your Google Scholar preferences in your browser, which is pretty confusing to do.
  • The title has to match in Google Scholar’s index of course. Which is definitely extensive enough to still be hugely useful, as evidenced by the open source predecessors to Google Scholar Button trying to do the same thing.
  • But most problematically at all, Google Scholar Button results will only show the local library link resolver link for some citations: The ones that have been registered as having institutional fulltext access in your institutional holdings registered with Google.  I want to get users to the Umlaut landing page for any citation they want, even if we don’t have licensed fulltext (and we might even if Google doesn’t think we do, the holdings registrations are not always entirely accurate), I want to show them local physical copies (especially for books), and ILL and other document delivery services.
    • The full Google Scholar gives a hard-to-find but at least it’s there OpenURL link for “no local fulltext” under a ‘more’ link, but the Google Scholar Button version doesn’t offer even this.
    • Books/monographs might not be the primary use case, but I really want a solution that works for books too — and books are something users may be especially interested in a physical copy instead of online fulltext for, and books are also something that our holdings registration with Google pretty much doesn’t include, even ebooks.  And book titles are a lot less likely to return hits in Google Scholar at all.

I really want a solution that works all or almost all of the time to get the patron to our library landing page, not just some of the time, and my experiments with Google Scholar Button revealed more of a ‘sometimes’ experience.

I’m not sure if the LibX or Lazy Scholar solutions can provide an OpenURL link in all cases, regardless of Google institutional holdings registration.  They are both worth further inquiry for sure.  But Lazy Scholar isn’t open source and I find it’s UI not great for our purposes. And I find LibX a bit too heavy weight for solving this problem, and have some other concerns about it.

So let’s consider another avenue for “Bridging the Gap”….

Zotero’s scraping logic

Instead of trying to take a title and find a hit in a mega-corpus of scholarly citations  like the Google Scholar Button approach, another approach would be to try to extract the full citation details from the source page, and construct an OpenURL to send straight to our landing page.

And, hey, it has occurred to me, there’s some software that already can scrape citation data elements from quite a long list of web sites our patrons might want to start from.  Zotero. (And Mendeley too for that matter).

In fact, you could use Zotero as a method of ‘Bridging the Gap’ right now. Sign up for a Zotero account, install the Zotero extension. When you are on a paywalled citation page on the unauthenticated open web (or a search results page on Google Scholar, Amazon, or other places Zotero can scrape from), first import your citation into Zotero. Then go into your Zotero library, find the citation, and — if you’ve properly set up your OpenURL preferences in Zotero — it’ll give you a link to click on that will take you to your institutional OpenURL resolver. In our case, our Umlaut landing page.

We know from some faculty interviews that some faculty definitely use Zotero, hard to say if a majority do or not. I do not know how many have managed to set up their OpenURL preferences in Zotero, if this is part of their use of it.

Even of those who have, I wonder how many have figured out on their own that they can use Zotero to “bridge the gap” in this way.  But even if we undertook an education campaign, it is a somewhat cumbersome process. You might not want to actually import into your Zotero library, you might want to take a look at the article first. And not everyone chooses to use Zotero, and we don’t want to require them to for a ‘briding the gap’ solution.

But that logic is there in Zotero, the pretty tricky task of compiling and maintaining ‘scraping’ rules for a huge list of sites likely to be desirable as ‘Bridging the Gap’ sources. And Zotero is open source, hmm.

We could imagine adding a feature to Zotero that let the user choose to go right to an institutional OpenURL link after scraping, instead of having to import and navigate to their Zotero library first.  But I’m not sure such a feature would match the goals of the Zotero project, or how to integrate it into the UX in a clear way without disturbing from Zotero’s core functionality.

But again, it’s open source.  We could imagine ‘forking’ Zotero, or extracting just the parts of Zotero that matter for our goal, into our own product that did exactly what we wanted. I’m not sure I have the local resources to maintain a ‘forked’ version of plugins for several browsers.

But Zotero also offers a bookmarklet.  Which doesn’t have as good a UI as the browser plugins, and which doesn’t support all of the scrapers. But which unlike a browser plugin you can install on iOS and Android mobile browsers (although it’s a bit confusing to do so, at least it’s possible).  And which it’s probably ‘less expensive’ for a developer to maintain a ‘fork’ of — we really just want to take Zotero’s scraping behavior, implemented via bookmarklet, and completely replace what you do with it after it’s scraped. Send it to our institutional OpenURL resolver.

I am very intrigued by this possibility, it seems at least worth some investigatory prototypes to have patrons test.  But I haven’t yet figured out how where to actually find the bookmarklet code, and related code in Zotero that may be triggered by it, let alone the next step of figuring out if it can be extracted into a ‘fork’.  I’ve tried looking around on the Zotero repo, but I can’t figure out what’s what.  (I think all of Zotero is open source?).

Anyone know the Zotero devs, and want to see if they want to talk to me about it with any advice or suggestions? Or anyone familiar with the Zotero source code themselves and want to talk to me about it?

This entry was posted in General. Bookmark the permalink.

23 Responses to On approaches to Bridging the Gap in access to licensed resources

  1. The simple fact of the official Google Scholar chrome plugin makes me wonder if there *is*, in fact, an (almost certainly undocumented) scholar API now. Anyone taken a look at the source?

  2. jrochkind says:

    There must be something, but it would presumably be something Google doesn’t want you to use, and Google is pretty good at rate-limiting and blocking automated user-agents they don’t want there. I suppose if you used it from the client, like the plug-in itself, you might get away with it.

  3. Aaron Tay says:

    “Instead of trying to take a title and find a hit in a mega-corpus of scholarly citations like the Google Scholar Button approach, another approach would be to try to extract the full citation details from the source page, and construct an OpenURL to send straight to our landing page.”

    Can’t you do both? The later approach would imply you could parse out the article/book title at least so you definitely have the title. Personally I would prefer the former approach for a quick and dirty search matching the title against the mega-index (Summon might substitute though GS is way superior of course) and only if that fails then try the Openurl approach which seems to be more prone to error since you would need to extract a lot more details and the usual fail-ability of openrul.

  4. FWIW, I just sniffed my outgoing traffic and found

    “`
    https://scholar.google.com/scholar?oi=gsb90&q=dueberb%20william&output=gsb&hl=en
    “`
    No idea what oi=gsb90 means, but you get JSON back.

  5. ace0cc says:

    @Bill

    I found an exposed google scholar API that works but as I recall it didn’t have all the details that you get on the page.

    The new google scholar button doesn’t use an API, it just scrapes the page like my extension does (Lazy Scholar). The difference is that I query both the paper url and then page title if the url fails, whereas the button only uses the title.

  6. ace0cc says:

    Hi Jonathan,

    Interesting post. I will add this to my list of things to look into for Lazy Scholar, as I would like to rely less on google scholar if possible. Regarding UI- I agree it is less than ideal, and I’m hoping to improve that this summer. Any suggestions would be appreciated!

    Colby

  7. FWIW, RefWorks and EndNote do a similar thing by scraping the page but then they check what they get against CrossRef and fill in details from there. I think Mendeley might do that too. They for sure check their own database. I think the preferred logic would be to look for identifiers (doi, isbn, pmid… ) and send them to authoritative sources, and short of that, try to fill out the OpenURL directly. Dunno – is this cost prohibitive with CrossRef?

  8. jrochkind says:

    I have heard from users that RefWorks version doesn’t work all that well, not nearly as well as Mendeley’s and Zotero’s. I haven’t spent much time with any of them. Not all articles have DOI’s, in some disciplines/fields they are much less common. (and few books at all have DOIs). We can make as many DOI metadata lookups as we want — in fact SFX, Umlaut, and most link resolvers do that already.

    Where possible, I’d prefer to scrape exactly what’s on the page, including the DOI (and/or PMID), and send it all to our link resolver in an OpenURL. The link resolver is already set up to do the DOI lookup, if a DOI is present, and in some cases it decides to enhance or correct the metadata already; I like centralizing all that logic in the link resolver (in this case Umlaut). But if all you could scrape was a DOI, sure, just send that to the link resolver.

  9. jrochkind says:

    Interesting, thanks Colby. Are you at all considering releasing your source code as open source? There are several reasons that would make it easier for some of us to use, and to collaborate on.

  10. jrochkind says:

    Thanks Aaron. You can possibly do both, but my strategy is to centralize all this sort of logic in the Umlaut link resolver — if you can just get the metadata to Umlaut, then Umlaut can be configured with plugins and adapters to search whatever sources you want, if needed. I have thought about an Umlaut plugin to search a licensed mega-index which has an API (eg Summon) to try and find a match, and/or try to find a more reliable licensed fulltext link. For my particular setup and environment, it hasn’t seemed useful enough to move from idea to implementation. But it could be easily added as a plugin to Umlaut. Umlaut’s got a plugin-based architecture, where each plugin can enhance citation metadata, or provide various service responses that will be added to the page representing fulltext links or other services. And new plugins can be easily added, and you select and configure plugins in your own installation.

    Writing and maintaining reliable scraping code for so many sites would seem like an insurmountable task — if Zotero hadn’t already surmounted it, in open source code too, which is what intrigues me. I haven’t tried the Zotero Bookmarklet, which is documented as more limited, but the browser plug-in succesfully scraped everything in my initial exploratory sample of 8 or 10 sites I thought our patrons would likely want to scrape from.

    When you say “only if that fails then try the Openurl approach” — the trick here is that the “quick and dirty search matching title” basically just returns a list of 0 or more hits. If it’s 0, you know it failed. But if it’s one or more hits, it’s hard for software to know if one or more of them are accurate/correct hits or not without human intervention confirming it. And if you need to show the user the list of hits and have them decide if any are what they were looking for, before deciding to go on to “the OpenURL approach”, that’s an extra user interaction.

    Umlaut is already pretty good at checking multiple sources of information and assembling/aggregating them into a decent UI. There’s still room for improvement, but if we think of Umlaut as the place to do that improvement, and which can take input from multiple sources (traditional OpenURL sources, our catalog or other local services, this hypothetical scraper) — then it simplifies the task of each source, such as this hypothetical scraper — to just getting the citation metadata to Umlaut. Then Umlaut will take care of the rest, the same way, with the same code we only had to write once, whether the citation comes from a traditional OpenURL source, or a scraper, or our catalog, etc. Does that make some sense?

    I’m curious to know more what you mean by the “the usual fail-ability of openrul”?

  11. This isn’t exactly what you are looking for but we have a tool that will take a copy/paste citation and attempt to parse it and then of course that can be fed into a search … http://search.grainger.uiuc.edu/linker/default.asp?paste

  12. Pingback: Latest Library Links 15th May 2015 | Latest Library Links

  13. jrochkind says:

    Thanks Lisa! Yes, that is a third approach to “Bridging the Gap” I’ve also considered and experimented with, citation parsing. How well do you find your tool works, how well does it work for your patrons? How is it implemented? I’m having trouble getting it to do anything, not sure if I’m doing something wrong.

  14. adam3smith says:

    I’m in a bit of a rush right now, so just the essentials — Aaron pinged me on twitter. I don’t think all of the Zotero bookmarklet code is on github, but you should consider it all open source (AGPL3). The relevant files are the code of the bookmarklet itself (which you can just look at) and the three files it calls from the zotero server
    https://www.zotero.org/bookmarklet/loader.js
    https://www.zotero.org/bookmarklet/inject.js
    and especially for IE
    https://www.zotero.org/bookmarklet/inject_ie.js
    (and of course the Zotero translator files which are on github)
    You can probably get the open source status of this confirmed on zotero-dev.
    The bookmarklet actually works very well for about 90% of everything Zotero can import via browser add-on, the only thing it can’t do is cross-domain requests.

  15. jrochkind says:

    Thanks Adam, very helpful! The accessible bookmarklet code itself, along with the JS at those URLs, is ‘compressed’ JS with no whitespace, maybe different variable names, etc. It would be nice to have an uncompressed unobfuscated source file. But yeah. It also occurred to me, as I try to find my way around the zotero architecture, that another possibility would be running my own Zotero translation server to do all the scraping server-side, and write my own simple bookmarklet that just uses that translation server. There are pro’s and con’s to both approaches, I’d probably start with whichever one is going to be _easier_.

    Adam, are you a Zotero committer and/or very familiar with Zotero architecture and code? would you have any time or interest to have a conversation with me about this, help me make sure I understand what’s up and make some suggestions to me, sometime? I’m not in a hurry. Or introduce me to someone that might?

  16. adam3smith says:

    I’m somewhat familiar with the Zotero architecture and have worked a lot on the translators, though the bookmarklet is the part that I know least about.
    My first suggestion would be to post to zotero-dev https://groups.google.com/forum/#!forum/zotero-dev
    also to ask whether it’d be possible to put the bookmarklet code on github. That’s also where the people who wrote the actual code and several others who’ve actively worked on it are.
    If you don’t get an answer there, we can talk and I can do the best I can to help.

  17. ace0cc says:

    @ Jonathan

    “Are you at all considering releasing your source code as open source? There are several reasons that would make it easier for some of us to use, and to collaborate on.”

    Yes, I would like to do this eventually. I am a pretty terrible programmer and need to rewrite large portions of it :). Hoping to make progress on this over the summer.

  18. jrochkind says:

    @ace0cc: I’d encourage you to release it as open source immediately. Put it up on github with an open source license, and say clearly in the README that this is early code, that you plan a major rewrite, you can not support this code, but you are putting it out as open source in case it is helpful to someone.

    I am reluctant to use code, or officially promote code to my patrons, if it is not open source, and has no clear business model; it could disappear or become unsupported in the future, or you could decide to start charging for it, which would be painful if my institution had come to rely on it.

    If you release it open source, not only will it allay that concern, but maybe some people will start contributing suggestions in the form of code (pull requests) to help you improve it too, so you don’t need to do it all by yourself. Or, maybe that won’t happen, but what do you have to lose?

    On the other hand, I know of many many projects, some of which are my own, where a major rewrite was “planned” but never happened (and may or may not have been a good idea anyway).

  19. Phil H. says:

    Some bad news – just tried the Zotero bookmarketlet from Safari on my iPhone – did not work. Tried ACS journal article, WSJ, and a third example. If this was user error (a possibility), the effort needed to create the bookmarklet is enough of a barrier to render it an unviable option.

    IMHO, I think this is initial evidence that, whatever you create, purposeful inoperability will kill.

    Educating on the Google Scholar button for articles, and your discovery layer for books, ebooks, AND articles, et al, might be your/your colleagues’ best bet.

    Overall, I don’t agree that seamless access to full-text is a high-need when most students aren’t accessing scholarly resources to learn, but rather to meet the minimum requirements of assignments that rarely relate to appropriate outcomes for a scholarly environment, if we can still accurately describe modern academia as such.

    That said, what about a bookmarklet that sends scraped/parsed data to a generic email, which then processes the data and sends an automated reply (of the umlaut screen) to the students’ email? Links could be proxied, option to import to citation service could be available, etc.

  20. Phil H. says:

    Great post.

    Tried the Z bookmarklet in iOS via Safari – did not work. I’d be very cognizant of the fact that purposeful inoperability could kill anything you create.

    IMHO, I think you’re overvaluing seamless access to full-text – there are numerous gaps to bridge before that one. One humorous example, maybe JHU could hire you a colleague or two if they merely canceled Scopus or WOS? ;)

    You’re also putting the cart before the horse. Work on making your Blacklight interface easier to digest before trying to get more patrons sent there via a magical cure-all. E.g. Do you really need 5 links to email forms?

  21. jrochkind says:

    Thanks for the comments. I hadn’t tried zotero bookmarklet yet from iOS; I had been thinking that if we do successfully create something based on that technology, we will still need to provide some in-person service for help installing a bookmarklet on a mobile device; iOS and Android both make it confusing.

    I’m not sure what you mean about 5 links to email forms in Blacklight; and in general in this discussion I’m not talking about Blacklight (our library catalog) at all, but in fact workflows that probably won’t involve the catalog at all. Oh, do you mean our Find It/Umlaut page, not Blacklight? Sure, there is room for improvement there, I could get into the business/organizational challenges to having one ‘contact a librarian’ link when we support patrons from 5 entirely organizationally independent libraries making their own decisions about, say, systems for connecting patrons to a librarian.

    But I think I disagree with you about which is the cart and which is the horse. A great interface does no good to someone who never arrives there; and while there is room for improvement in our Find It/Umlaut interface (like there is room for improvement in every single UI we, or anyone, has anywhere; none are perfect), the existing interface consistently gets rated highly in feedback from our patrons. I think we are probably at the point where getting to the Find It interface in the first place — for many types of research finding activities — is a bigger barrier than being confused by it once you do get there. But we did have to improve our “link resolver” interface greatly, from the out of the box SFX UX, to get to that point.

    There are always more things we could be doing than we have the time and resources to do, this is true of everyone. It’s always a question of identifying the current biggest barrier/bottleneck/need, and indeed it is not always obvious and we can sometimes be wrong, and researching our researchers is important to try to stay on top of what that is.

  22. Pingback: III report: “WE LOVE THE LIBRARY, BUT WE LIVE ON THE WEB.” | Bibliographic Wilderness

  23. Pingback: Linked Data Caution | Bibliographic Wilderness

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s