The dangers of the ‘free’ cloud: The Case of CrossRef

SFX, like most link resolvers, depends on CrossRef for DOI-resolving functionality. This is pretty core functionality for the link resolver, required to take an incoming link that has only a DOI (as many do these days, most notably from Google Scholar), and actually DO anything useful with it.

Over the past year or so, every two or three months, CrossRef goes down. For 6 hours, or 24 hours, or 48 hours. During this period, depending on how the link resolver software is written/configured, you either get a timeout from the link resolver itself (as it waits forever for CrossRef), or you get a response from the link resolver which in many cases is pretty useless as it lacks DOI resolution.

This is a danger with relying on third party network services for core functionality in general. It’s especially a danger when you’re relying on a free third party service, for which you have no contract and no service-level agreement.

I’m actually not sure if we libraries pay anything for Crossref resolution services. If we do, it’s a nominal amount. But if it’s even a nominal amount, that might encourage me to complain to them that we need better service, including failover provisions on their end.  If it’s entirely free, I wouldn’t even bother, they don’t really owe us anything.  But either way, it’s a problem for our software.

To make matters more inconvenient, actually getting from “the service isn’t working right” to a diagnosis of “it’s because CrossRef is down” is increasingly time-consuming in our ever more complex infrastructural stacks.   Good logs and hidden-but-on-page debugging info are increasingly crucial to save the time of the programmer.

( Incidentally, the case of JStor, which sends significant numbers of ‘bad’ DOIs to Google Scholar, you get the same kind of ‘broken-ness’ as a matter of course, even when CrossRef is performing fine. I had my first actual report from a patron of a mal-functioning service that ended up being caused by JStor’s bad metadata sent to Google. I assume that for every report from a user, there are 10x or 100x more users that are encountering the same problem, but don’t bother to report it.)

17 thoughts on “The dangers of the ‘free’ cloud: The Case of CrossRef

  1. FYI you do pay CrossRef if you go over your quota for DOI resolution (its in the license agreement you sign when you become a CrossRef member). Most institutions don’t go over that quota though.

  2. I seem to recall that they recently raised the quota to be AWFULLY high, and/or said that non-profit libraries wouldn’t really have to pay even if they went over. But I can’t find the source for that, but it’s in my memory somewhere.

  3. Almost no libraries pay for lookup transactions with CrossRef. The very few are large commercial libraries associated with industry e.g. Bayer is one who perform tens of thousands of queries regularly. The fee that was mentioned on our web site for high volume activity for other non-profit library accounts was never imposed on any one to my knowledge and has in fact been dropped.

    Even though CrossRef receives little revenue from libraries I believe it is not acceptable for these outages to occur and would like all to know that we are working towards a better solution.

    At the start, CrossRef’s services were never meant as a real-time service. Members queried CrossRef, retrieved DOIs and then built static links in their content. In this model it did not matter if CrossRef was down for short periods because the links were cached within the content. This drove our belief that a ~98% availability was adequate (which is what you get if down for 48hours every 3 months).

    Over time users have ‘migrated’ to new models which depend on more robust reliability. As anyone in IT knows going from 98% to 99.5% is expensive and going above 99.5% is VERY expensive.

    CrossRef’s system is currently 6+ years old and is showing signs of age. In response the board has set in motion a plan which will address several issues, one of which is availability. This plan is expected to cost just under $1million over 5 years. The plan has us deploying new systems related to query transactions in Q1 of 2010.

    In the mean time we will be making changes to correct problems and keep the system operating as best as possible. Problems such as occurred this past weekend (4/17-4/19) will hopefully be less frequent and will likely require interim changes while we wait for the new system.

    Please note that CrossRef’s availability is not the same as DOI availability. Resolutions via dx.doi.org are HIGHLY available and is based on a distributed system of which CrossRef maintains one redundant part. Having said that however, I realize that for local link servers CrossRef’s availability is just as critical. Growth in use of our OpenURL resolver has recently (over 3-4 months) gone from less than 100k per month to over 1 million per month. This was somewhat unexpected.

    Respectfully,
    Chuck Koscher
    Director of Technology
    CrossRef

  4. Thanks for your response Chuck, that’s quite encouraging. Nothing’s perfect, if people know it needs improvement and are actively working on improvement, that puts me at ease.

    So, when you compare “CrossRef availablity” to “DOI availability”, it’s the difference between:

    1) Looking up metadata associated with a DOI (“CrossRef availability”

    and

    2) Simply looking up/redirecting to the publisher-provided URL associated with a DOI (“DOI availability”).

    Do I have that right? I’m not an expert in the whole system, mostly it Just Works and I ignore it.

    But that does bring up another issue — the apparently known issue that not every legal registered DOI is in fact resolvable via CrossRef? I guess that would mean that it would resolve at dx.doi.org, but it’s metadata would not be retrievable via CrossRef.

    I’ve seen hints of this before, but never entirely understood what was going on. Care to shed any light on that issue, Chuck?

  5. I’m also curious if anyone has any explanation for Chuck’s observation:

    “Growth in use of our OpenURL resolver has recently (over 3-4 months) gone from less than 100k per month to over 1 million per month. This was somewhat unexpected.”

    I’m not sure we’re talking about the same things. Is your “OpenURL resolver” what you call the component that my software is querrying to get metadata back for a DOI?

    But no matter what component Chuck is talking about, I’m curious what may have made it’s usage increase so much. For us to guess, we’d have to be sure what we’re talking about though.

  6. An observation – as sort of an aside – we do, in the end, pay for CrossRef because we pay publishers for journals and we expect them to assign DOIs to their records. In all of the various conversations about why journals cost what they do, this is another thing our subscriptions pay for. (&I’m very glad they do!)

  7. 4) Yes, CrossRef lookups are where you ask for metadata for a given DOI or ask what is the DOI for some set of metadata. DOI resolution is the act of going to dx.doi.org/10.1234.abcd which leads one to the publisher’s web site. These are separate systems.

    DOIs can be registered without going through CrossRef (not advisable) and some publishers do it for weird reasons. In these cases all I need is a few example DOIs and CrossRef will chase down our member and get things fixed. I’ve already emailed JSTOR asking what might be going on.

    5) I was scanning the logs today and found a user who has automated harvesting and is hitting our service many times a second continuously. I’ve asked them to stop and contact me to find another approach. Its these ‘inventive’ activities that cause the trouble and they interfere with the more normal activity that come from local link servers and such.

    Our openurl resolver is the component that was put in place when the appropriate copy problem was addressed way back. For most of the past several years local link servers where almost the exclusive users of this service. We had an episode a few months back where we tried to limit openurl requests and to require login accounts. This effort had some unintended consequences and we quickly rescinded the action. However, this raised awareness of our no-account required OpenURL service which has encouraged folks to experiment and I’m hesitant to suppress such activity (I need to find a way to service it).

    6) I fully agree with your observation. End users (readers) are CrossRef’s real customer.

  8. I just realize that I may not have answered one question fully.

    Not all DOIs are registered through CrossRef. So a DOI may be real, may be registered with the DOI system and thus resolve via dx.doi.org, but CrossRef had nothing to do with its creation.

    In this case CrossRef can not supply the metadata for such DOIs.

    This can happen because CrossRef is only one of several Registration Agencies (RA) allowed to create DOIs. The other RAs may not have implemented the appropriate copy solution and thus local link servers will have trouble with such DOIs. AT the moment CrossRef DOIs represent probably 99% of all DOIs. That will however change but CRossRef should represent DOIs for scholarly content.

  9. Thanks Chuck.

    Is there any way to tell what RA created a given DOI? Do the other RA’s offer services where you can get metadata for a given DOI?

    It occurs to me that the way SFX and other link resolvers _are_ using DOIs through the CrossRef metadata service — which as you’ve mentioned was not the original intention of either DOIs or the CrossRef service — is kind of unsustainable, and is going to start getting us into trouble, especially as other RA’s become more prominent.

    I’m still not sure we’re speaking exactly the same language. Here’s what SFX (and similar products, which are confusingly called “OpenURL link resolvers”) does in order to handle the ‘appropriate copy’ problem. When it gets a DOI, it uses CrossRef to get full metadata for that DOI — title, author, publishing ISSN, volume, issue, page number. Then it uses it’s OWN database to decide the ‘appropriate copy’ for the particular institution’s users. It generally tries to avoid the actual dx.doi.org resolution, since that resolution may send users to a “non appropriate copy”. Now, if it gets a DOI that uses a non-CrossRef RA, obviously this isn’t going to work.

    But it seems that CrossRef has a different solution to ‘appropriate copy’, using your own ‘openurl resolver’. I don’t think I’m actually familiar with that service, and I don’t think that SFX uses it (although I could be wrong there). Could you point me to documentation or information about the CrossRef ‘openurl resolver’, so I can learn more?

  10. A couple of points… if I’m right, CrossRef offer 2 distinct services to libraries, which I think need mentioning:

    1) DOI look-ups (openurl resolver like). Given an openurl, returning a DOI

    2) Metadata look-ups. Given a DOI, provide the metadata.

    (Chuck can correct me on this).

    A question which I have for Chuck, which I think others would be interested in:

    Are we able to retain a copy of the unixref response as a file, as a fall-back option?

    No network is perfect, and CrossRef’s services might be unavailable in our part of the world for any number of reasons, so if we could, then that would be nice!

  11. 9) I’ll have to ask Larry Lannom at CNRI (they run DOI) about how to tell which RA is responsible for a DOI. Off hand I’m not coming up with a technique. Unfortunately not all RAs offer a metadata service like CrossRef, in fact I don’t know of any others that do. You are right in that the appropriate copy solution put in place by CrossRef/DOI has not been made a core part of DOI operations by all RAs. However, given the growth rate of other RAs I think we have time to work this problem before non-CrossRef DOIs start showing up.

    CrossRef is only part of ‘THE’ appropriate copy solution (we don’t have our own stand-alone soluton), you’re right that its the link resolver that presents choices to the reader who then chooses the appropriate copy for them. CrossRef’s openurl resolver is simply the interface used by link resolvers to get metadata for the DOI.

    10) Right, our main function is to supply metadata if you know a DOI or to return a DOI if you know metadata. As I said earlier this was started as a batch transaction service, but over time started being used in ‘real-time’.

    YES, you can cache the data you get from us. In fact local link servers could build this into their databases and then not have to connect to CrossRef when a reader clicks on a link. But, as I understand it, local link servers do not store article level data in their database

  12. A couple of folks have sent me a link to this thread over the last few days (ripples from some stone dropped in the water somewhere?). Chuck has done a good job of explaining the current state of things – I wanted to add one comment and give a short update.

    As one of the people involved in setting this up ‘way back’ I wanted to point out that the local link resolvers were not supposed to be dead ends for the resolution process. If the end user has the cookie set, the dx.doi.org proxy server will take that http GET and turn it into a redirect to the local server. But if the local server can’t do anything useful with it, it is supposed to send it back to the system of dx.doi.org proxies with a code appended meaning no local action for this and the resolution continues. This would seem to be a reasonable way to deal with outages.

    Re. update – the current approach, now 10 years old, was always supposed to be temporary. So every few years we look and see if its time to try to update it, esp. vis-a-vis non-Crossref DOIs. This would seem to be one of those times and we will begin another round of talks, beginning with what we can reasonably do inside the DOI system but also, I hope, involving some of the other players.

    Larry

  13. Thanks Larry, that makes sense. From your comments though, I’m not sure you actually understand the _primary_ use case in which our link resolvers are getting DOIs.

    You mention “If the end user has a cookie set, the dx.doi.org proxy server will take that http GET and turn it into a redirect for a local [openurl link resolver] server.” True enough. But that hardly ever happens, because 1) it ends up being hard to ensure that a user WILL have the right cookie set, and 2) There’s a whole parallel infrastructure of OpenURL stuff that has vendor content pages sending users directly to our link resolver.

    So here’s the use case I deal with more often: A content page somewhere (An Elsevier page, Google Scholar etc) sends an OpenURL directly to my link resolver. An OpenURL (in the context we’re talking about) is basically a structured scholarly citation.

    That OpenURL might have a DOI in it. When it’s coming from Google Scholar, it will almost always have a DOI in it, and will often have very little other identifying information. Usually an (only one) author and article title, occasionally an ISSN, almost never (from Google Scholar at least) volume/issue/page number. So we often don’t have enough information to know if we have a licensed copy, or really to fill out an ILL (inter-library loan) request.

    So, the local link resolver makes a metadata request to CrossRef, solely for purpose of fleshing out the metadata so we know what the article is. Then we can send the user to locally licensed fulltext, or figure out if we have a print copy on the shelves, or fill out an ILL form for the user, etc.

    Why do we do any of this, why not just redirect via dx.doi.org and let the DOI system bring the user right to the electronic copy, as it was intended? Because the copy that DOI redirects to might not be licensed by our library — but we might have a copy from another vendor that IS licensed. Or the dx.doi.org redirect might end up showing the user a _list_ of places to get this article, only some (or none) of which we license, but our users don’t know which is which. We don’t want to make them click aimlessly. We don’t want to deliver them to a vendor/publisher that tells them they don’t have access, and offers to sell it to them for $30, when we may very well already pay for access to that very article from another vendor. And if we really DO have no access, we want to offer them the option of making an ILL request, etc. DOI system can’t do any of that on it’s own, because it depends on what we license from whom, which DOI doesn’t know, only we know.

    Hope that makes some sense. I’d be happy to discuss it further as you consider improvements to the Crossref system — it hasn’t been clear to me that the use cases that are important to academic libraries with link resolvers, like my own, are use cases that CrossRef completely understands and is designing for.

    So your last suggestion of how to deal with an outage, just redirect to dx.doi.org — maybe. But again, I’m reluctant to do that when it might end up being a dead-end because of bad data (see JStor non-DOIs-posing-as-DOIs let loose into the wild), or might deliver them to a non-licensed copy. In some cases, I might have enough metadata already to let them do an ILL request even if metadata lookup isn’t working, and in those cases I’d rather do that before delivering them to something that I can’t predict if it will end up being a dead-end for them or not. Make some sense?

  14. Thanks for the clear and detailed explanation. A couple of reactions and comments:

    1. In reaction to my description of what I think of as the canonical appropriate copy use case, your response was that that ‘hardly ever happens.’ We go through a fair amount of effort to enable that and you make me wonder if that effort is misplaced. We’ll go back and look at our logs, or change what we collect in our logs if needed, but I’d appreciate it if you could provide a little more detail. How confident are you of that conclusion and do you think it varies by institution?

    2. This title of this thread concerns the dangers of free cloud services, and that’s the reason I was alerted to it, but so far what I see is more a lack of clarity than a business model problem. Everybody wants it to work, as far as I know. Its true that it may need more resources devoted to it than are currently available, but that is a common situation. I think what is needed at this point, and the conversation above seems to reflect that, is more discussion and understanding of use cases and requirements. This is always the most important, and most difficult, piece.

    3. There are multiple players on both sides of that metadata lookup process. On the DOI side we have Crossref, which has the bulk of the DOIs in the shared IDF resolution system and the corresponding metadata in their own system. Then there are all the other IDF registration agencies, which you probably don’t have much contact with at the moment but whose role may loom larger in the future (see the recent TIB announcement about data sets). CNRI, where I sit, is responsible for the handle resolution system that underlies the resolution piece of DOIs and we do that under the aegis of the IDF, which is the collection of all of the RAs and other members. So when you mention ‘changes to the Crossref system’ you’re talking about a multiplicity of possible changes. As Chuck notes above, Crossref is in the process of upgrading their internal processes, which will likely include a more robust metadata lookup service. But that is external to the DOI resolution system, which crosses all of the RAs. There the issue is how to return metadata, or at least a coherent answer about lack of metadata, from all of the other RAs, and not just from Crossref.

    On the library side, at least when we did the initial project, we had the libraries and the local link server makers. What role do they play here? Do they understand the use cases? Do you think they understand the DOI side of the equation?

    4. I understand the need for libraries to focus on their users and not send them off on pointless quests, but I’ll stick to my suggestion anyway. I have two concerns here. First is the behavior of the local link servers. What is their reaction to no response from Crossref? Metadata coming back from Crossref should be the normal case, but if it doesn’t come back or if the local server can’t make any sense of it or otherwise has no local information for the end user, it still does have one important piece of data – the DOI, which should resolve to something useful. I assume you could wrap that up with all sorts of caveats or suggestions that it be sent off to the ILL dept. for interpretation, etc., but it seems a waste to just stop. Secondly, in the ‘hardly ever happens’ case, this was the rationale for the collaboration – ship the link back to the library to see what local services it can provide and depend on it to send the user back to the DOI system if the answer is ‘none’ or provide it as one of the alternatives.

    Thanks again.

    Larry

  15. This all starts to get kind of complicated and unsure, doesn’t it. You want to have a phone conversation Larry? I’d be happy to chat about it, might be easier than exchanging epic comments on a blog?

    But trying to be succinct:

    1) I am pretty sure that the use case I identified happens _more_ than the use case you had been counting on (cookies set in a user’s browser, user follows a link to dx.doi.org). Your use case may happen more than I think, but I AM positive mine happens more than you had been thinking.

    (And this surely explains the higher-than-you-expected usage of what I think you guys call your “OpenURL interface”, but which I would call simply the “metadata resolution interface”. It’s convenient for link resolvers that it takes OpenURL as input, since that’s the ‘native language’ of link resolvers, but that’s just a convenience not a necessity — link resolvers could and would make use of a metadata resolution service even if they had to ‘translate’ the citation into another input format first.)

    2) Absolutely. Phone call?

    3) The _typical_ library doesn’t understand anything, they just expect the software they buy from link resolver vendors to work. Some libraries (like un-humbly, my own) have the resources and expertise to try to get into it a bit more. I think the typical link resolver vendor has a _reasonably_ good idea of what DOIs are — but tend to be focused on the use case I identified, not the one you identified. There may be other use cases you think are important that they are disregarding.

    4) The typical link resolver, when it does not succeed at enhancing metadata from crossref, will just proceed with the metadata it DOES already have. Not every request that comes into the typical link resolver has a DOI in it at all, and having minimal metadata is sadly not that uncommon. Sometimes having insufficient metadata to really identify the citation. So the link resolver proceeds as best it can — with insufficient metadata it won’t be able to identify a full text copy, but will probably offer the user a link to an ILL request that will be pre-filled with the available metadata — which may or may not be enough for the ILL dept to do anything with, depending on what’s there and how much work the ILL dept is willing to do.

    Now, when there _is_ a DOI the link resolvers I’m familiar with will indeed _also_ (if so configured) provide the user with a link to dx.doi.org resolution. I am not certain how many users configure to do so, or if it typically comes configured so by default. In my case, I _do_ provide the user with this link, but in an intentionally less prominent location on the screen, labelled “publisher’s link”, and with a disclaimer “publisher’s link may not take you to a copy licensed for Johns Hopkins University users.” I expect this dissuades most users from clicking it, which, frankly, is the intent. Perhaps ideally my link resolver could make this link more prominent if and only if it has NOT come up with any more reliable (reliable to be licensed for viewing by my users) full text links. But this would take some extra logic that I would need to find time to implement; and most standard vendor-supplied link resolvers are probably not capable of that degree of selectivity — so I’m not sure how many local admins just turn off the DOI link entirely, because of how often it sends users to non-licensed copies.

    And it’s worth mentioning the _pathological_ case — SFX, until recently, when CrossRef was not responding, would simply wait for it for many minutes before timing out. So the user would just get a blank browser screen until eventually the browser probably gave up with a timeout message. But I consider this a clear bug in SFX, so it’s probably not worth further discussion.

  16. Okay, time I waded in here too! I can only talk about my DIY code behind textseeka (which I’m still hoping will be made open-source by my former employer, but anyway):

    In response to Jonathan’s last post:

    1) Use-case ++
    When I was building textseeka, I didn’t bother with the whole cookie-pusher concept – there’s cookie-cutters, etc. and I didn’t like the concept AT ALL as an end-user.

    4) If textseeka doesn’t get CrossRef metadata, then there is a tail-spin if there is no other data in the request, but that’s to be expected.
    Given that Chuck has said we can cache CrossRef’s DOI metadata, then this is what textseeka does – using a filesystem for ease, as much as anything else. If there’s no ‘live’ response, then the cached copy is used, if any.
    textseeka doesn’t use the resolved resource to link to – it uses the doi, as this is more reliable, esp. if someone wants to bookmark it (this is assuming that textseeka has found the publisher’s DOI is valid for the request).

    I’d also like to point-out that textseeka does use some rules for generating valid urls for resources which use a standard rest-like url, such as BMC titles (using knowledgebase metadata, OpenURL metadata, and a rules file to mash it up, and generate a valid link). I’m not sure how common this is amongst other linkresolvers, but I’d hope textseeka is not alone in this, but it’s useful when we don’t have access through the publisher, and also as a fall-back for some normally CrossRef-DOI titles.

  17. Tom, your description of using rules for generating URLs into content providers matches what I thought _all_ commerical link resolvers do, even for content providers which _don’t_ provide “rest like urls” and make it exceedingly difficult to do, they try to do it anyway.

    I’m curious what textseeka does when it’s NOT doing that, especially if, as you say, you aren’t counting on the dx.doi.org resolution. Or are you using the dx.doi.org resolution, but trying to keep track of which desintation URLs your users actually have access to before redirecting them there?

Leave a comment