Posted by jrochkind in Practice, business, programming.
Tim Spalding writes about Google Book Search API, cover availability, and terms of service:
In NGC4Lib:
Basically, I’ve been told [I can't help but wonder: Told by whom? -JR] that I was wrong to promote the use of GBS for covers in an OPAC, that the covers were licensed from a cover supplier-ten guesses!-and should not have allowed this, and that the new GBS API terms of service put the kibosh on the use I promoted.
And on his blog:
The back story is an interesting one. Soon after I wrote and spoke about the covers opportunity, a major cover supplier contacted me. They were mifffed at me, and at Google. Apparently a large percentage of the Google covers were, in fact, licensed to Google by them. They never intended this to be a “back door” to their covers, undermining their core business. It was up to Google to protect their content appropriately, something they did not do. For starters, the GBS API appears to have gone live without any Terms of Service beyond the site-wide ones. The new Terms of Service is, I gather, the fruit of this situation.
Now, I am not surprised. As soon as I heard the Google staff person on the Talis interview implying that Google had no problem with use of cover images in a library catalog application, I knew that something would come through the pipeline to put the kibosh on that. Not least because I too had had backchannel communications with a certain Large Library Vendor, about Amazon, where they revealed (accidentally I think), that they had had words with Amazon about Amazon’s lack of terms of service for their cover image. Even then, I wondered exactly what the legal situation was, in the opinion of this Large Vendor, of Amazon, or of any other interested parties.
More questions than answers
But here’s the thing. When I read GBS’s new Terms of Service looking for something to prevent library catalog cover image use… I don’t see anything. And if there WAS going to be something, what the heck would it even look like anyway?
Amazon tried to put the kibosh on this by having their terms of service say “You can only use our whole API if the primary purpose of your website is to sell books from Amazon.” Making not just cover usage, but really any use of the Amazon API by libraries pretty much a violation. If Google goes _that_ route, it’d be disastrous for us.
But I doubt they would–not even just trying to limit cover image usage by those terms–because, first of all, they intended from the start for this service to be used by libraries, and had development partners from the start that included libraries and library vendors. Secondly, what would the equivelent be for Google? You can only use this service if your primary business is sending users to google full text? Ha! That’s nobody’s primary business!
What terms could possibly restrict us anyway?
Without restricting what Google was trying to do in the first place.
And besides, the whole point of the GBS API having cover images was to let people put a cover image next to a link to GBS. The utility of this is obvious.
But isn’t that exactly what we library catalog users are doing, no more and no less? So what could their terms say?
“you are allowed to use these covers next to a link to GBS, but ONLY if you are not a library or someone else who is the target market for Large Library Market Vendors. You can only use it if Large Library Market Vendor is NOT trying to sell you a cover image service.”
Or, “you can only use it if you’re not a library.”
Can they really get away with that? Just in terms of PR, especially since, unlike Amazon, they get most of their content from library partners?
I know the Major Library Vendors want to keep us on the hook for paying them Big Money for this service. And they’re the same ones selling these images to Google. But it’s unclear to me what terms could possibly prevent us from using the covers, while allowing the purposes that Google licensed them for in the first place.
And what’s the law, anyway?
Then we get to the actual legal issues. To begin with a “terms of service” that you do NOT in fact need to even “click through” to use the service—thus you don’t ever have had to have READ to use the service–I’m not sure it’s enforceable at all. But they could fix that by requiring an API key for the GBS API, and making you click-through to get the key.
But the larger issue is that legal issues around cover image usage is entirely unclear to begin with.
I remain very curious what the Large Library Vendor’s agreements with the publishers (who actually own the intellectual property for cover images, generally) is, and what makes them think they have exclusive right to provide libraries with this content? It also remains an unanswered question exactly what “fair use” rights we have with cover images. Of course, that’s all moot if you have a license agreement with yoru source of cover images, that trumps fair use (thus the “terms of service”. But again, I dont’ see anything in the terms of service to prevent cover image use by libraries).
Posted by jrochkind in Practice, programming.
So google and Yahoo both sometimes offer “related” searches, in a nice AJAXy popup.
I don’t have time to find an example to show you, but I think most of you have seen it with Google at least. The firefox google opensearch toolbar for instance. I put in “library” and in a popup it suggests “library of congress; librarything; library thing; library journal” etc. Maybe that wasn’t the best example, but sometimes this is useful.
It strikes me that it would be really nice to have a similar feature in our various library search functions (including catalog and federated search?). First thought is, gee, can I just use the Yahoo and/or Google apis to do this? But I seriously doubt that would be consistent with either of their Terms of Service, to use this service for something that has nothing to do with google/yahoo and isn’t going to lead to a search of google/yahoo, but instead use these suggestions for search of our own content.
So, that gets me thinking, how do you do this? Obviously Google and Yahoo are coming up with these suggestions by analyzing their own data—either their corpus of indexed stuff, their query logs, or likely a combination of both. Anyone know if there are any public basic algorithms for doing this kind of thing? Anyone have enough “information retrieval” knowledge to hazzard a guess as to what sorts of algorithms are used for this? How would we go about adding this to our own apps?
Update: It also occurs to me that this would be ANOTHER natural service for OCLC to provide. To provide “related search” suggestions well, you need a good corpus and some data mining. OCLC has a giant corpus of not only book metadata, but search query history from their database offerings. An OCLC “search suggestion” API where you give it a query, and it gives you search suggetsions, which you are licensed to use in any search your library has? I’d reccomend my library pay for that, if the price was right. Natural service from OCLC.
Posted by jrochkind in Practice, Theory, business, cataloging.
Eh, this comment was long enough I might as well post it here too, revised and expanded a bit. (I’ve been flagging on the blogging lately). Karen Schneider thinks about “tagging in a workflow context“
Tagging in library catalogs hasn’t worked yet for a number of reasons…
Karen goes on to discuss much of the ‘when’ of tagging, but I still think the ‘why’ of tagging is more relevant. Why would a user spend their valuable time adding tags to books in your library catalog?
I think the vast majority of succesful tagging happens when users tag to aid their OWN workflow. Generally to keep track of things. You tag on delicious to keep track of your bookmarks. You tag on librarything to organize your collections. The most succesful tagging isn’t done to help _other_ people find things, but to keep track of things yourself–at least not at first, not the tagging that builds the successful tag ecology. Most cases of a successful tagging community where people do to tag to help others find things–I’d suggest it would be because it somehow benefits them personally to help people find things. Such as, maybe, tagging your blog posts on wordpress.com because you want others to find your blog posts–still a personal benefit.
A succesful tag ecology is generally built on tagging actions that serve very personal interests which do not need the succesful tagging ecology on top of it. Interests served even if you are the only one who is tagging. The succesful tagging ecology which builds out of it–and which goes on to provide collective benefit that was not the original intent of the taggers–is an epiphenomenon.
Amazon might be a notable exemption to this hypothesis, perhaps because it such a universally used service before tagging already. (Unlike our library catalogs). I would be interested to understand what motivates users to tag in Amazon. Anyone know of anyone who’s looked into this? It’s also possible that if amazon’s tags are less useful, it is in fact because of this lack of personal benefit from tagging.
So what personal benefit can a user get in tagging in a library catalog? If we provided better ’saved records’ features, perhaps, keep tracks of books you’ve checked out, books you might want to check out, etc. But I’m not sure if our users actually USE our catalogs enough to find this useful, no matter how good a ’saved records’ feature we provide. In an academic setting, items from the catalog no longer neccesarily make up a majority of a user’s research space.
To me, that suggests, can we capture tags from somewhere else? My users export items to refworks. Does refworks allow tagging yet? If it did, is there a way to export (really re-import) these tags BACK to the catalog, when a user tags something? But even if so, it would be better if Refworks somehow magically aggregated tags from _different_ catalogs, of the same work. But that relies on identifier issues we haven’t solved yet. If our catalogs provide persistent URLs (which they don’t usually, which is a tragedy), users COULD tag in delicious if they wanted to. Is there a way to scan delicious for any tags including your catalogs url, and import those back in?
In addition to organizing one’s research and books/items of interest, are there other reasons it would serve a patron’s interest to tag, other things they could get out of it? A professor might tag books of interest for their students, perhaps (not that most professors are looking for more technological things to spend time on helping students, but some are). And librarians themselves might tag things with non-controlled-vocabulary topic areas they know would be of use to a particular class or program or department, with terms of use to those classes or programs or departments. Can anyone think of any other reasons tagging could be of benefit to a user (not whether a successful tagging ecology would be of collective benefit–but benefits an individual user can get from assigning tags in a library catalog).
Worldcat covers a much larger share of my academic users’ research universe than my own catalog. And worldcat has solved the “aggregating different copies of this work from different libraries” problem to some extent. Which is why it would make so much sense for worldcat to offer a tagging service–which can be easily incorporated into your own local catalog for both assigning and displaying tags (if not for searching) ala library thing. It is astounding to me that OCLC hasn’t provided this yet. It seems to be a very ‘low hanging fruit’ (a tagging interface on worldcat.org with a good API is not rocket science) that is worth a try.
Posted by jrochkind in Practice, open access, programming.
This is worth pulling out into a post of it’s own. Thanks to Dorothea Salo for the comments on the post where I broached this issue sort of in passing. Good to know that I’m indeed not alone in worrying about this stuff.
But there are actually a few different (but related) issues Dorothea has identified here, some of which aren’t a problem for my projects at all, others of which are. Let’s analyze them out:
1. Some faculty are unwilling to publish open access.
This might be a problem, but despite this problem there’s plenty of free-web publicly accessible scholarly content available. (I use this phrase because the specific licensing might be unclear, but an unauthenticated user can get it on the web.) I’m thinking specifically about so-called preprint/postprint public accessible versions of articles that also appear in not-open-access journals. There’s lots of it. This is in fact what motivates my desires in the first place.
2. Some repository software doesn’t allow control of access to the level desired by repository managers.
This might be a problem too, but despite it, most supposed “open access” repositories do contain material that the repository does not in fact make available to the general unauthenticated public! So the software might not be flexible enough, but it is often restricting access to contents in it anyway. And including metadata for those restricted items in the general OAI-PMH feed, without any predictable machine-readable way to tell that it is in fact restricted content.
So it’s in fact the ability of many repositories now to restrict content that brings me to my issue:
3. I have no way to identify the universe of actually publically accessible ‘open access’ scholarly content.
Even if I created an aggregate index of OAI-PMH feeds from all “open access” repositories—it would include content which is not viewable by an unauthenticted user! What I want to do in my software is, I have a known-item citation, I want to tell the user if there’s a publically-viewable copy of this citation online. I have no way to find/identify such a copy though! I have no way to weed out the stuff that isn’t really publically accessible. I don’t want to send the user to something they cant’ access—some repositories listed in DOAR actually have the majority of their items (in the OAI-PMH feed) not available to the unauthenticated off-campus user!
So 1 and 2 might be issues in general, but aren’t what’s providing the roadblock for me. 3 is. There are a couple other issues worth nothing, one that is an inconvenience (but not a roadblock) for my project, one that is not.
4. Difficulty of identifying articles in repositories matching a citation.
When I experimentally tried doing a search against OAISter (before I realized that OAISter didn’t even limit itself to so-called open access repositories; and before I realized that even open access repositories weren’t)—I had to do a search based just on title and author keywords. It would be better if I could search based on an identifier (DOI or pmid) when present—or based on structured publication data for the actual publication of the pre/postprint: ISSN, vol, issue, page number. But these things aren’t available in the OAI-PMH feed, and in fact probably aren’t even in most repositories metadata. Most repository metadata doesn’t try to connect a pre or post-print to the actual published version in any way.
This is annoying, but I found that author/title keyword search worked good enough to be useful even without this, so it wasn’t a roadblock.
5. Might be publically accessible, but is it open access?
This gets at what the SPARC/DOAJ initiative is trying to solve. Okay, I’m a reader, I can look at this article online on the free-web, but what am I allowed to do with it? Am I allowed to reproduce it? This matters to readers and is a real issue, but doesn’t in fact matter to my project. All I care about is if I can show them the full text on the public web—once I can do that, I can worry about helping them understanding the license and their access rights, but first I need to help them discover the article in the first place!
Posted by jrochkind in Practice, open access, programming.
So, I’ve found out about a couple new things from Google I hadn’t known about. (Google is such a prominent player in our space, we need to keep up with what’s going on there so we know how to exploit it to maximum effect. I need to remember to go explore google’s interfaces and documentation more regularly to see changes). 1. Google search API now allows server-side access. 2. Google search allows limit on usage license. And both these things got me started about open access discoverability again.
1. Google API allows server-side access!
Thanks to Kent Fitch for alerting us on the code4lib listserv.
http://code.google.com/apis/ajaxsearch/documentation/#fonje
“For Flash developers, and those developers that have a need to access the AJAX Search API from other Non-Javascript environments, the API exposes a simple RESTful interface….
“An area to pay special attention to relates to correctly identifying yourself in your requests. Applications MUST always include a valid and accurate http referer header in their requests. In addition, we ask, but do not require, that each request contains a valid API Key.”
This is huge. I’ve complained before about how it was difficult to incorporate Google features into my own service-oriented software in a maintainable way when only javascript AJAX functions were allowed.
Now if only they’d do the same thing for the Google Books Discoverability api. That’s where I really need it; it’s still not clear to me how I might usefully incorporate automated general google search (including google scholar) into my library applications dealing with scholarly materials, because of the high chance that what Google returns will be for-pay and not available to my users: I don’t want to show them that.
So it was with interest I noticed a new feature:
2. Google search supports usage rights limit
Take a look at the Google advanced search page. Click on “Date, usage rights, numeric range, and more”. Look, there’s a “usage rights” limit which filters by CC licenses. When did that show up? Of course, it can only include things in the filter that advertise a CC license in a way that Google’s bots can recognize. (Not sure how this is done, Google doesn’t say; I think I recall there’s a standard CC-endorsed way to do this?).
Unfortunately, some initial test searches revealed that this is a tiny piece of the actual open access pie. Many scholarly materials that ARE available online open access are not in fact in Google’s indexes. Probably because they don’t advertise it properly in a machine-readable way? Still, this is a great step by Google, and indicates that Google recognizes users are increasingly having trouble with getting too much restricted content in their google search results.
But my frustration remains with the scholarly open access community. If the problem is that open access repositories aren’t advertising CC licenses properly–why aren’t these software packages (many of them open source) being fixed? Why isn’t there general concerted funded effort from the open access repository community to solve this general problem: And the general problem is there’s no good place to search aggregated open access content and ONLY open access content. To use in software that wants to answer the question “Is there an open access version of the article with this title and author available?” No good way to do it. And this lack of discoverability is a huge problem with the utility of the existing open access repository domain. I don’t understand why there isn’t more concerted effort to solve it.
Although, in fairness, I did recently become aware of a European initiative, that’s apparently actually funded, to address at least part of this issue. Registering in machine readable format whether content is open access is the first step to building aggregated indexes. (It’s a dirty secret of the ‘open access repository’ domain that much of the content in so-called “open access repositories” is not in fact open access at all, it’s behind IP and password based restrictions. A cursory sample of items in repositories listed in the OpenDOAR–whose collection policies say that a reason for EXCLUSION from OpenDOAR is “Site requires login to access any material (gated access) - even if freely offered”–will reveal that that collection policy is quite often honored in the breach. Although I guess DOAJ has less of a problem with that, and that SPARC/DOAJ initiative is just about DOAJ, so it’s not clear to me that the SPARC project will really address my problem. I guess the SPARC project is about people not being sure if they can re-use material in DOAJ journals—my problem is being able to do a meta-search limited to publically available open access content in the first place, and I don’t care if it’s licensed for re-use, I just want to find only stuff that is actually viewable online for free!
Hmph. What can we do to get the open repositories communities to take note of this problem and address resources toward it?
Posted by jrochkind in Practice, programming.
I know other rails devs read this blog. I LOVE ruby-prof. It rocks. You have to use the ‘graph’ profile to really get it’s power, in default mode it doesn’t do much. I haven’t even tried it yet with KCachegrind visualization, haven’t had the energy to go over THAT learning curve. Like everything else in the Rails world, there’s a bit of a learning curve to figure it out–for me, mainly in finding the right documentation. Which was that excellent blog post referenced above. After there, it flowed smoothly.
It’s really helping me figure out where the bottlenecks are Umlaut resolve action.
The query_trace plugin is pretty great too.
And, in that vein, I still don’t understand how some of my fellow coders get along without ruby-debug. But if I were better at conscientiously writing the unit tests I should be writing, maybe I wouldn’t be using ruby-debug so much.
Posted by jrochkind in Practice, Rails, programming.
You’ve got Employees and Departments. Each Employee has one Department, each Department has many Employees. Very many. Let’s say thousands, or even tens of thousands.
So you want to create a new Employee and assign it to a Department.
dept = Department.find_existing_dept_somehow() # existing one fetched from db
employee = Employee.createAndInit() # newly created not yet saved
Now you have two choices
1: departments << employee
2: employee.department = department
Either way you end with: employee.save!
Those might look equivelent, but I think the first ends up being a huge performance problem. I believe that is because the first call will end up requiring a fetch of all the department’s employees (thousands or more), before adding the new employee to it–and possibly doing an implicit save of one or more object too. While the second never forces the potentially expensive fetching of all the department’s employees. But I’m just guessing here. All I know is that when I changed the #1 style to the #2 style, I just erased one mysterious performance hit in my app.
Posted by jrochkind in Practice, business, catalogs, programming.
Think again.
http://listserv.nd.edu/cgi-bin/wa?A2=ind0803&L=ngc4lib&T=0&O=D&X=77132057060E3A8667&P=6033
Jesse Haro of the Phoenix Public Library writes:
Following the release of the Customer Service Agreement from Amazon this past
December, we requested clarification from Amazon regarding the use of AWS for library catalogs and received the following response:
“Thank you for contacting Amazon Web Services. Unfortunately your application does not comply with section 5.1.3 of the AWS Customer Agreement. We do not allow Amazon Associates Web Service to be used for library catalogs. Driving traffic back to Amazon must be the primary purpose for all applications using Amazon Associates Web
Service.”
There are actually a bunch of reasons library software might be interested in AWS. But the hot topic is cover images. If libraries could get cover images for free from AWS, why pay for the expensive (and more technically cumbersome!) Bowker Syndetics service to do the same? One wonders what went on behind the scenes to make Amazon change their license terms in 2007 to result in the above. I am very curious as to where Amazon gets their cover images and under what, if any, licensing terms. I am curious as to where Bowker Syndetics gets their cover images and on what licensing terms–I am curious as to whether Bowker has an exclusive license/contract with publishers to sell cover images to libraries (or to anyone else other than libraries? I’m curious what contracts Bowker has with whom). All of this I will probably never know unless I go work for one of these companies.
I am also curious about the copyright status of cover images and cover image thumbnails in general. Who owns copyright on covers? The publisher, I guess? Is using a thumbnail of a cover image in a library catalog (or online store) possibly fair use that would not need copyright holder permission? What do copyright holders think about this? This we may all learn more about soon. There is buzz afoot about other cover image services various entities are trying to create with an open access model, without any license agreements with publishers whatsoever.
Posted by jrochkind in Practice, programming.
So Google has announced a much-awaited api for pre-checking availability of full text in Google Books. Here is one post with more detail than other announcements I’ve found.
I note that the API is described as a javascript api, and examples are provided where the request to the API is made on the client-side with javascript.
However, there’s no technical reason why you couldn’t do this server-side as well. It’s just an HTTP GET request with certain query parameters which returns JSON. I can certainly parse JSON server-side.
It makes a big difference to me whether I can do this server-side or not. Why? One example is because my software wants to query multiple sources of digital text (including our own licensed e-text from our catalog), and do something different depending on whether there is any available text or none. In some contexts, the user may even get an entirely different page depending on the answer to that. It’s difficult or impossible to implement that kind of logic only on the client-side (plus it would only work for those with javascript).
So there’s no technical reason I can’t do it server-side. But Google may certainly stop me with policy. They could rate-limit requests to the API from any given IP (and it sounds like they DO: “Because developers often issue an atypical quantity of requests, you may accidentially tip the security precautions found in Google Book Search.” ). Google certainly has it’s own business reasons to want to aggregate as much individual data as possible, not let my application be an intermediate proxy. (Google is in fact in the business of collecting usage data, not of providing search. Think about how they make their money). So hmm, time will tell.
Interestingly, a couple of the examples on the announcement are Google Books pre-check integrated into sfx! It sounds as if this was done by Ex Libris, not by the individual customer. And when I attempt to reverse-engineer the HTML to see what’s going on–it looks to me like SFX is indeed making a server-side pre-check, not doing it in javascript on the client side. Which would be encouraging. Unless Ex Libris somehow has special permission. Hmm.
Eagerly awaiting more information about this. Not quite sure how to get it.
updates (14 Mar):
1.
Got a reply from Google:
You can do something similar to this on the client side, just add some logic
in the JavaScript on whether or not to show the div with the books dependent
on the viewability information.
Unfortunately, we don’t support server side querying of the API, because
viewability is based on local rights limitations (different countries
consider different books to be public domain), and we think it hurts theuser experience to provide incorrect viewability information.
Doh! This doesn’t really answer my concern I’m afraid, I really can’t do what I need to do client side, at least not without extreme difficulty or loss of functionality. But I guess that’s how it’ll be!
2.
Had the idea of asking the ksu SFX example for an XML response, to see what that tells us. Of course everything in the XML response is neccesarily generated server side. The Google Books section looks like this:
<target>
<target_name>LOCAL_GOOGLE_BOOK_SEARCH</target_name>
<target_public_name>Google Book Search</target_public_name>
<target_service_id>22170000000000005</target_service_id>
<service_type>getCitedReference</service_type>
<parser>GoogleBookSearch::Isbn</parser>
<parse_param/>
<proxy>no</proxy>
<crossref>no</crossref>
<note/>
<authentication/>
<char_set/>
<displayer>GoogleBookSearch::Reference</displayer>
−
<target_url>
http://books.google.com/books?id=BpPwy8t3OtIC&ie=ISO-8859-1&source=gbs_ViewAPI
</target_url>
</target>
Hmm. It’s hard to say. The XML response does not include what’s in the HTML response telling you for sure what kind of access is available. But it does include a google books URL that sure looks like it required talking to Google on the server-side to generate–it doesnt’ have an ISSN in it, it has a Google unique ID. How would SFX know that Google unique ID without talking to Google on the server-side? Which Google tells me they don’t allow. Hmm. Curioser.
Another update: If I turn off javascript and look at the ksu SFX page, I still get the Google Books link. It would definitely appear to be server side. Is Ex Libris SFX allowed to do something that I am not?
Posted by jrochkind in Practice, business, cataloging.
Did you guys know that issn.org sold z39.50 access to the ISSN registry/portal? I didn’t.
What might you want to use this for? Well, if the “linking ISSN” is deployed succesfullly, and the information is successfully included in the information available from the ‘issn portal’, then this is a machine-actionable source of correspondences between ISSNs that really represent the same title in different formats. I trust that many of my readers can think of all sorts of uses they could make being able to embed that information in their various discovery applications.
OCLC xISSN also can potentially provide some of this data in machine actionable form. (Haven’t explored it yet myself). I assume that xISSN correspondences are currently algorithmically/heuristically generated from what information is available in a cataloging record, as opposed to the “linking ISSN” based metadata, which presumably will be manually controlled? But then an interesting question is the cost comparison of these two services licensed for the uses we’d want to put them to. Would be nice to have two competing metadata web services available for a change, instead of usually having NONE that do what we need.