Free Covers? From Google?

Tim Spalding writes about Google Book Search API, cover availability, and terms of service:

In NGC4Lib:

Basically, I’ve been told [I can’t help but wonder: Told by whom? -JR] that I was wrong to promote the use of GBS for covers in an OPAC, that the covers were licensed from a cover supplier-ten guesses!-and should not have allowed this, and that the new GBS API terms of service put the kibosh on the use I promoted.

And on his blog:

The back story is an interesting one. Soon after I wrote and spoke about the covers opportunity, a major cover supplier contacted me. They were mifffed at me, and at Google. Apparently a large percentage of the Google covers were, in fact, licensed to Google by them. They never intended this to be a “back door” to their covers, undermining their core business. It was up to Google to protect their content appropriately, something they did not do. For starters, the GBS API appears to have gone live without any Terms of Service beyond the site-wide ones. The new Terms of Service is, I gather, the fruit of this situation.

Now, I am not surprised. As soon as I heard the Google staff person on the Talis interview implying that Google had no problem with use of cover images in a library catalog application, I knew that something would come through the pipeline to put the kibosh on that. Not least because I too had had backchannel communications with a certain Large Library Vendor, about Amazon, where they revealed (accidentally I think), that they had had words with Amazon about Amazon’s lack of terms of service for their cover image. Even then, I wondered exactly what the legal situation was, in the opinion of this Large Vendor, of Amazon, or of any other interested parties.

More questions than answers

But here’s the thing. When I read GBS’s new Terms of Service looking for something to prevent library catalog cover image use… I don’t see anything. And if there WAS going to be something, what the heck would it even look like anyway?

Amazon tried to put the kibosh on this by having their terms of service say “You can only use our whole API if the primary purpose of your website is to sell books from Amazon.” Making not just cover usage, but really any use of the Amazon API by libraries pretty much a violation. If Google goes _that_ route, it’d be disastrous for us.

But I doubt they would–not even just trying to limit cover image usage by those terms–because, first of all, they intended from the start for this service to be used by libraries, and had development partners from the start that included libraries and library vendors. Secondly, what would the equivelent be for Google? You can only use this service if your primary business is sending users to google full text? Ha! That’s nobody’s primary business!

What terms could possibly restrict us anyway?

Without restricting what Google was trying to do in the first place.

And besides, the whole point of the GBS API having cover images was to let people put a cover image next to a link to GBS. The utility of this is obvious.

But isn’t that exactly what we library catalog users are doing, no more and no less? So what could their terms say?

“you are allowed to use these covers next to a link to GBS, but ONLY if you are not a library or someone else who is the target market for Large Library Market Vendors. You can only use it if Large Library Market Vendor is NOT trying to sell you a cover image service.”

Or, “you can only use it if you’re not a library.”

Can they really get away with that? Just in terms of PR, especially since, unlike Amazon, they get most of their content from library partners?

I know the Major Library Vendors want to keep us on the hook for paying them Big Money for this service. And they’re the same ones selling these images to Google. But it’s unclear to me what terms could possibly prevent us from using the covers, while allowing the purposes that Google licensed them for in the first place.

And what’s the law, anyway?

Then we get to the actual legal issues. To begin with a “terms of service” that you do NOT in fact need to even “click through” to use the service—thus you don’t ever have had to have READ to use the service–I’m not sure it’s enforceable at all. But they could fix that by requiring an API key for the GBS API, and making you click-through to get the key.

But the larger issue is that legal issues around cover image usage is entirely unclear to begin with.

I remain very curious what the Large Library Vendor’s agreements with the publishers (who actually own the intellectual property for cover images, generally) is, and what makes them think they have exclusive right to provide libraries with this content? It also remains an unanswered question exactly what “fair use” rights we have with cover images. Of course, that’s all moot if you have a license agreement with yoru source of cover images, that trumps fair use (thus the “terms of service”. But again, I dont’ see anything in the terms of service to prevent cover image use by libraries).

Search hints/related search?

So google and Yahoo both sometimes offer “related” searches, in a nice AJAXy popup.

I don’t have time to find an example to show you, but I think most of you have seen it with Google at least. The firefox google opensearch toolbar for instance. I put in “library” and in a popup it suggests “library of congress; librarything; library thing; library journal” etc. Maybe that wasn’t the best example, but sometimes this is useful.

It strikes me that it would be really nice to have a similar feature in our various library search functions (including catalog and federated search?). First thought is, gee, can I just use the Yahoo and/or Google apis to do this? But I seriously doubt that would be consistent with either of their Terms of Service, to use this service for something that has nothing to do with google/yahoo and isn’t going to lead to a search of google/yahoo, but instead use these suggestions for search of our own content.

So, that gets me thinking, how do you do this? Obviously Google and Yahoo are coming up with these suggestions by analyzing their own data—either their corpus of indexed stuff, their query logs, or likely a combination of both. Anyone know if there are any public basic algorithms for doing this kind of thing? Anyone have enough “information retrieval” knowledge to hazzard a guess as to what sorts of algorithms are used for this? How would we go about adding this to our own apps?

Update: It also occurs to me that this would be ANOTHER natural service for OCLC to provide. To provide “related search” suggestions well, you need a good corpus and some data mining. OCLC has a giant corpus of not only book metadata, but search query history from their database offerings.  An OCLC “search suggestion” API where you give it a query, and it gives you search suggetsions, which you are licensed to use in any search your library has? I’d reccomend my library pay for that, if the price was right.  Natural service from OCLC.

More on open access discoverability

This is worth pulling out into a post of it’s own. Thanks to Dorothea Salo for the comments on the post where I broached this issue sort of in passing. Good to know that I’m indeed not alone in worrying about this stuff.

But there are actually a few different (but related) issues Dorothea has identified here, some of which aren’t a problem for my projects at all, others of which are. Let’s analyze them out:

1. Some faculty are unwilling to publish open access.

This might be a problem, but despite this problem there’s plenty of free-web publicly accessible scholarly content available. (I use this phrase because the specific licensing might be unclear, but an unauthenticated user can get it on the web.) I’m thinking specifically about so-called preprint/postprint public accessible versions of articles that also appear in not-open-access journals. There’s lots of it. This is in fact what motivates my desires in the first place.

2. Some repository software doesn’t allow control of access to the level desired by repository managers.

This might be a problem too, but despite it, most supposed “open access” repositories do contain material that the repository does not in fact make available to the general unauthenticated public! So the software might not be flexible enough, but it is often restricting access to contents in it anyway. And including metadata for those restricted items in the general OAI-PMH feed, without any predictable machine-readable way to tell that it is in fact restricted content.

So it’s in fact the ability of many repositories now to restrict content that brings me to my issue:

3. I have no way to identify the universe of actually publically accessible ‘open access’ scholarly content.

Even if I created an aggregate index of OAI-PMH feeds from all “open access” repositories—it would include content which is not viewable by an unauthenticted user! What I want to do in my software is, I have a known-item citation, I want to tell the user if there’s a publically-viewable copy of this citation online. I have no way to find/identify such a copy though! I have no way to weed out the stuff that isn’t really publically accessible. I don’t want to send the user to something they cant’ access—some repositories listed in DOAR actually have the majority of their items (in the OAI-PMH feed) not available to the unauthenticated off-campus user!

So 1 and 2 might be issues in general, but aren’t what’s providing the roadblock for me. 3 is. There are a couple other issues worth nothing, one that is an inconvenience (but not a roadblock) for my project, one that is not.

4. Difficulty of identifying articles in repositories matching a citation.

When I experimentally tried doing a search against OAISter (before I realized that OAISter didn’t even limit itself to so-called open access repositories; and before I realized that even open access repositories weren’t)—I had to do a search based just on title and author keywords. It would be better if I could search based on an identifier (DOI or pmid) when present—or based on structured publication data for the actual publication of the pre/postprint: ISSN, vol, issue, page number. But these things aren’t available in the OAI-PMH feed, and in fact probably aren’t even in most repositories metadata. Most repository metadata doesn’t try to connect a pre or post-print to the actual published version in any way.

This is annoying, but I found that author/title keyword search worked good enough to be useful even without this, so it wasn’t a roadblock.

5. Might be publically accessible, but is it open access?

This gets at what the SPARC/DOAJ initiative is trying to solve. Okay, I’m a reader, I can look at this article online on the free-web, but what am I allowed to do with it? Am I allowed to reproduce it? This matters to readers and is a real issue, but doesn’t in fact matter to my project. All I care about is if I can show them the full text on the public web—once I can do that, I can worry about helping them understanding the license and their access rights, but first I need to help them discover the article in the first place!

“Freedom Summer of Code”

Freedom Summer of Code is a summer-of-code-style distributed collaboration for technology projects benefiting radical/progressive movements. Exciting idea.

(en) Riseup Labs is excited to announce the Freedom Summer of Code! We aim to advance critical movement technology projects and tools that benefit a wide-variety of radical social justice organizations and movements.


Modeled after the successful Google Summer of Code, the Freedom Summer of Code adds a radical social justice twist. We will be working with select tech activist organizations to generate interesting ideas as well as help people develop several projects over the next three months.

The [Freedom Summer of Code ->] aims to advance critical movement technology projects and tools that benefit a wide-variety of radical social justice organizations and movements; inspire developers to become more interested in directly participating in social-justice tech organizations; contributes back, for the benefit of all, to the free software world which sustains us while simultaneously honoring individual’s labor; increases the social ownership and democratic control over information, ideas, technology, and the means of communication; empowers organizations and individuals to use technology in struggles for liberation. We are developing software that is geared specifically to the needs of network organizing and democratic collaboration, providing new services that greatl enhance your security and privacy.

Consider this is a call-out!

To get started, think about how you would like to participate. Regardless of your technical skills, we need your help and have numerous ways to plug into FSoC:

* We want your proposals, dream big! Submit any and all politically important technology project proposals for and by the radical tech community! Individuals, or organizations, can submit ideas for what they would like to see done during FSoC. We will collect these proposals and put them online for potential programmers to check out.

* Interested organizations should sign up: we want your organization to join FSoC, to not just submit project ideas, but also be an organizational contact person who can act as a facilitator if your project/organization is chosen.

* Do you want to participate? Come apply to the program, submit an individual project proposal, and when the time comes, you can pick projects that you are interested in working on.

* We also need facilitators, you are encouraged to apply to help individual participants through the process.

To learn about what kinds of things we are looking for, how to submit a proposal, to sign up as participating organization/facilitator or to apply as a participant in the first FSoC, come visit the [Freedom Summer of Code site ->].


Google feature changes; open access discoverability

So, I’ve found out about a couple new things from Google I hadn’t known about. (Google is such a prominent player in our space, we need to keep up with what’s going on there so we know how to exploit it to maximum effect. I need to remember to go explore google’s interfaces and documentation more regularly to see changes).  1.  Google search API now allows server-side access. 2. Google search allows limit on usage license.  And both these things got me started about open access discoverability again.

1. Google API allows server-side access!

Thanks to Kent Fitch for alerting us on the code4lib listserv.

“For Flash developers, and those developers that have a need to access the AJAX Search API from other Non-Javascript environments, the API exposes a simple RESTful interface….

“An area to pay special attention to relates to correctly identifying yourself in your requests. Applications MUST always include a valid and accurate http referer header in their requests. In addition, we ask, but do not require, that each request contains a valid API Key.”

This is huge. I’ve complained before about how it was difficult to incorporate Google features into my own service-oriented software in a maintainable way when only javascript AJAX functions were allowed.

Now if only they’d do the same thing for the Google Books Discoverability api. That’s where I really need it; it’s still not clear to me how I might usefully incorporate automated general google search (including google scholar) into my library applications dealing with scholarly materials, because of the high chance that what Google returns will be for-pay and not available to my users: I don’t want to show them that.

So it was with interest I noticed a new feature:

2. Google search supports usage rights limit

Take a look at the Google advanced search page. Click on “Date, usage rights, numeric range, and more”. Look, there’s a “usage rights” limit which filters by CC licenses. When did that show up?  Of course, it can only include things in the filter that advertise a CC license in a way that Google’s bots can recognize. (Not sure how this is done, Google doesn’t say; I think I recall there’s a standard CC-endorsed way to do this?).

Unfortunately, some initial test searches revealed that this is a tiny piece of the actual open access pie.  Many scholarly materials that ARE available online open access are not in fact in Google’s indexes. Probably because they don’t advertise it properly in a machine-readable way? Still, this is a great step by Google, and indicates that Google recognizes users are increasingly having trouble with getting too much restricted content in their google search results.

But my frustration remains with the scholarly open access community. If the problem is that open access repositories aren’t advertising CC licenses properly–why aren’t these software packages (many of them open source) being fixed? Why isn’t there general concerted funded effort from the open access repository community to solve this general problem: And the general problem is there’s no good place to search aggregated open access content and ONLY open access content. To use in software that wants to answer the question “Is there an open access version of the article with this title and author available?” No good way to do it. And this lack of discoverability is a huge problem with the utility of the existing open access repository domain. I don’t understand why there isn’t more concerted effort to solve it.

Although, in fairness, I did recently become aware of a European initiative, that’s apparently actually funded, to address at least part of this issue.  Registering in machine readable format whether content is open access is the first step to building aggregated indexes. (It’s a dirty secret of the ‘open access repository’ domain that much of the content in so-called “open access repositories” is not in fact open access at all, it’s behind IP and password based restrictions. A cursory sample of items in repositories listed in the OpenDOAR–whose collection policies say that a reason for EXCLUSION from OpenDOAR is “Site requires login to access any material (gated access) – even if freely offered”–will reveal that that collection policy is quite often honored in the breach. Although I guess DOAJ has less of a problem with that, and that SPARC/DOAJ initiative is just about DOAJ, so it’s not clear to me that the SPARC project will really address my problem.  I guess the SPARC project is about people not being sure if they can re-use material in DOAJ journals—my problem is being able to do a meta-search limited to publically available open access content in the first place, and I don’t care if it’s licensed for re-use, I just want to find only stuff that is actually viewable online for free!

Hmph.   What can we do to get the open repositories communities to take note of this problem and address resources toward it?

rails debugging

I know other rails devs read this blog. I LOVE ruby-prof.  It rocks. You have to use the ‘graph’ profile to really get it’s power, in default mode it doesn’t do much.  I haven’t even tried it yet with KCachegrind visualization, haven’t had the energy to go over THAT learning curve. Like everything else in the Rails world, there’s a bit of a learning curve to figure it out–for me, mainly in finding the right documentation. Which was that excellent blog post referenced above. After there, it flowed smoothly.

It’s really helping me figure out where the bottlenecks are Umlaut resolve action.

The query_trace plugin is pretty great too.

And, in that vein, I still don’t understand how some of my fellow coders get along without ruby-debug. But if I were better at conscientiously writing the unit tests I should be writing, maybe I wouldn’t be using ruby-debug so much.

Rails gotcha — assigning relationships

You’ve got Employees and Departments. Each Employee has one Department, each Department has many Employees. Very many.  Let’s say thousands, or even tens of thousands.

So you want to create a new Employee and assign it to a Department.

dept = Department.find_existing_dept_somehow()  # existing one fetched from db

employee = Employee.createAndInit() # newly created not yet saved

Now you have two choices

1: departments << employee
2: employee.department =  department

Either way you end with:!

Those might look equivelent, but I think the first ends up being a huge performance problem. I believe that is because the first call will end up requiring a fetch of all the department’s employees (thousands or more), before adding the new employee to it–and possibly doing an implicit save of one or more object too.  While the second never forces the potentially expensive fetching of all the department’s employees. But I’m just guessing here. All I know is that when I changed the #1 style to the #2 style, I just erased one mysterious performance hit in my app.

Can licensing make an API useless?

As I discussed in a previous essay, it’s the collaborative power of the internet that makes the open source phenomenon possible. The ability to collaborate cross-institution and develop a ‘community support’ model is what can make this round of library-developed software much more successful than the ‘home grown’ library software of the 1980s.

So how does this apply to APIs? Well, library customers are finally demanding APIs, and some vendors are starting to deliver. But the point of an API is that a third party will write client code against it. If that client code is only used by one institution, as I discussed, it’s inherently a more risky endeavor than if you have client code that’s part of a cross-institutional collaborative project. For all but the smallest projects involving API client code, I think it is unlikely to be a wise managed risk to write in-house software that is only going to be used or seen by your local institution.

The problem is if a vendor’s licenses, contracts, or other legal agreements require you to do this, by preventing you from sharing the code you write against the API with other customers.

On the Code4Lib listserv, Yitzchak Schaffer writes

here’s the lowdown from SerSol:

“The terms of the NDA do not allow for client signatories to share of any information related to the proprietary nature of our API’s with other clients. However, if you would like to share them with us we can make them available to other API clients upon request. I think down the road we may be able to come up with creative ways to do this – perhaps an API user’s group, but for now we cannot allow sharing of this kind of information outside of your institution.”

To me, this seriously limits the value of their APIs. So limiting, that I am tempted to call them useless for all but the simplest tasks. For any significant development effort, it’s probably unwise for an institution to undertake a development effort under these terms. That’s assuming that an individual institution even has the capacity tod o this–the power of internet collaboration is that it increases our collective capacity by making that capacity collective. Both that increased capacity and managed risk through a shared support scenario requires active collaboration between different institutions—even SerSol’s offer to perhaps make a finished product available to other clients “upon request” is not sufficient, active and ongoing collaboration between partners is required.

If I’m involved in any software evaluation processes where we evaluate SerSol’s products, I am going to be sure to voice this opinion and it’s justification. If any existing SerSol API customers are equally disturbed by this, I’d encourage you to voice that concern to SerSol. Perhaps they will see the error of their ways of customers (and especially potential not yet signed customers) complain.

Ross Singer notes that this is especially ironic when SerSol claims their APIs are “standards based” ( What proprietary information could they possibly be trying to protect (from their own customers!).

Think you can use Amazon api for library service book covers?

Update 19 May 2008: See also Alternatives To Amazon API including prudent planning for if Amazon changes it’s mind.

Update: 17 Dec 2008: This old blog post is getting a LOT of traffic, so I thought it important to update it with my current thoughts, which have kind of changed.

Lots of library applications out there are using Amazon cover images, despite the ambiguity (to be generous; or you can say prohibition if you like) in the Amazon ToS.  Amazon is unlikely to care (it doesn’t hurt their business model at all). The publishers who generally own copyright on covers are unlikely to care (in fact, they generally encourage it).

So who does care, why does Amazon’s ToS say you can’t do it?  Probably the existing vendors of bulk cover image to libraries. And, from what I know, my guess is that one of them had a sufficient relationship with Amazon to get them to change their terms as below. (After all, while Amazon’s business model isn’t hurt by you using cover images for your catalog, they also probably don’t care too much about whether you can or not).

Is Amazon ever going to notice and tell you to stop? I doubt it. If that hypothetical existing vendor notices, do they even have standing to tell you to stop? Could they get Amazon to tell you to stop? Who knows.  I figure I’ll cross that bridge when we come to it.

Lots of library apps are using Amazon cover images, and nobody has formally told them to stop yet. Same for other Amazon Web Services other than covers (the ToS doesn’t just apply to covers).

But if you are looking for a source of cover images without any terms-of-service restrictions on using them in your catalog, a couple good ones have come into existence lately.  Take a look at CoverThing (with it’s own restrictive ToS, but not quite the same restrictions) and OpenLibrary (with very few restrictions). Also, the Google Books API allows you to find cover images too, but you’re on your own trying to figure out what uses of them are allowed by their confusing ToS.

And now, to the historically accurate post originally from March 19 2008….

Think again.

Jesse Haro of the Phoenix Public Library writes:

Following the release of the Customer Service Agreement from Amazon this past

December, we requested clarification from Amazon regarding the use of AWS for library catalogs and received the following response:

“Thank you for contacting Amazon Web Services. Unfortunately your application does not comply with section 5.1.3 of the AWS Customer Agreement. We do not allow Amazon Associates Web Service to be used for library catalogs. Driving traffic back to Amazon must be the primary purpose for all applications using Amazon Associates Web

There are actually a bunch of reasons library software might be interested in AWS. But the hot topic is cover images. If libraries could get cover images for free from AWS, why pay for the expensive (and more technically cumbersome!) Bowker Syndetics service to do the same? One wonders what went on behind the scenes to make Amazon change their license terms in 2007 to result in the above. I am very curious as to where Amazon gets their cover images and under what, if any, licensing terms. I am curious as to where Bowker Syndetics gets their cover images and on what licensing terms–I am curious as to whether Bowker has an exclusive license/contract with publishers to sell cover images to libraries (or to anyone else other than libraries? I’m curious what contracts Bowker has with whom). All of this I will probably never know unless I go work for one of these companies.

I am also curious about the copyright status of cover images and cover image thumbnails in general. Who owns copyright on covers? The publisher, I guess? Is using a thumbnail of a cover image in a library catalog (or online store) possibly fair use that would not need copyright holder permission? What do copyright holders think about this? This we may all learn more about soon. There is buzz afoot about other cover image services various entities are trying to create with an open access model, without any license agreements with publishers whatsoever.

Google Book Search API

So Google has announced a much-awaited api for pre-checking availability of full text in Google Books. Here is one post with more detail than other announcements I’ve found.

I note that the API is described as a javascript api, and examples are provided where the request to the API is made on the client-side with javascript.

However, there’s no technical reason why you couldn’t do this server-side as well. It’s just an HTTP GET request with certain query parameters which returns JSON. I can certainly parse JSON server-side.

It makes a big difference to me whether I can do this server-side or not. Why? One example is because my software wants to query multiple sources of digital text (including our own licensed e-text from our catalog), and do something different depending on whether there is any available text or none. In some contexts, the user may even get an entirely different page depending on the answer to that. It’s difficult or impossible to implement that kind of logic only on the client-side (plus it would only work for those with javascript).

So there’s no technical reason I can’t do it server-side. But Google may certainly stop me with policy. They could rate-limit requests to the API from any given IP (and it sounds like they DO: “Because developers often issue an atypical quantity of requests, you may accidentially tip the security precautions found in Google Book Search.” ). Google certainly has it’s own business reasons to want to aggregate as much individual data as possible, not let my application be an intermediate proxy. (Google is in fact in the business of collecting usage data, not of providing search. Think about how they make their money). So hmm, time will tell.

Interestingly, a couple of the examples on the announcement are Google Books pre-check integrated into sfx! It sounds as if this was done by Ex Libris, not by the individual customer. And when I attempt to reverse-engineer the HTML to see what’s going on–it looks to me like SFX is indeed making a server-side pre-check, not doing it in javascript on the client side. Which would be encouraging. Unless Ex Libris somehow has special permission. Hmm.

Eagerly awaiting more information about this. Not quite sure how to get it.

updates (14 Mar):


Got a reply from Google:

You can do something similar to this on the client side, just add some logic
in the JavaScript on whether or not to show the div with the books dependent
on the viewability information.

Unfortunately, we don’t support server side querying of the API, because
viewability is based on local rights limitations (different countries
consider different books to be public domain), and we think it hurts theuser experience to provide incorrect viewability information.

Doh! This doesn’t really answer my concern I’m afraid, I really can’t do what I need to do client side, at least not without extreme difficulty or loss of functionality. But I guess that’s how it’ll be!


Had the idea of asking the ksu SFX example for an XML response, to see what that tells us. Of course everything in the XML response is neccesarily generated server side. The Google Books section looks like this:

<target_public_name>Google Book Search</target_public_name>


Hmm. It’s hard to say. The XML response does not include what’s in the HTML response telling you for sure what kind of access is available. But it does include a google books URL that sure looks like it required talking to Google on the server-side to generate–it doesnt’ have an ISSN in it, it has a Google unique ID. How would SFX know that Google unique ID without talking to Google on the server-side? Which Google tells me they don’t allow. Hmm. Curioser.

Another update: If I turn off javascript and look at the ksu SFX page, I still get the Google Books link. It would definitely appear to be server side. Is Ex Libris SFX allowed to do something that I am not?