http://www.nytimes.com/2010/12/17/books/17words.html
The digital storehouse, which comprises words and short phrases as well as a year-by-year count of how often they appear, represents the first time a data set of this magnitude and searching tools are at the disposal of Ph.D.’s, middle school students and anyone else who likes to spend time in front of a small screen. It consists of the 500 billion words that are contained in books published between 1800 and 2000 in English, French, Spanish, German, Chinese, Russian and Hebrew.
A side issue
from that same article:
Google says the culturomics project raises no copyright issue because the books themselves or even sections of them cannot be read.
That would kind of make sense if it were so, but the publishers suing google would probably beg to differ. Copyright says you can’t make a copy, without the rights-holders permission. It doesn’t say you can make a copy as long as you don’t share it publicly. (Although the penalties might be worse if you do).
If it weren’t for this, the mere couple-sentance excerpts that google showed for arbitrary books in Google Book Search would almost certainly, in each individual case, be fair use. What gave the publishers a case was that Google had to copy the whole thing to provide that service. As they do for the ‘culturomics’ database.
Have you ever wondered why the publishers are not also suing the Google partner libraries for basically aiding and abetting Google’s wholesale copying of in-copyright works? I mean, it’s not like the libraries were ignorant of Google’s intentions. Do you think partner libraries ever thought that they would also be at risk in the lawsuit?
Presumably they would argue that this is a transformative use (see http://en.wikipedia.org/wiki/Transformation_(law) ). Bryan: Libraries have special protections under copyright law that probably apply here. See http://www.law.cornell.edu/uscode/17/108.html
Well, the partner libraries ARE allowed to loan the books out to whomever, including Google. I don’t think the partner libraries are liable even if they ought to know the borrower plans to copy it. Plus, libraries are still well liked enough enough that suing them tends to look bad.
I think Google is relying on the part of fair use about transformative use or not. Google will say that it has completely transformed the originals and the final product does not try to substitute for the originals in any way. For example, this is how parody manages to say it is fair use, because it is transformative.
It would be very difficult for a copyright holder of any particular book to say that because of this new tool of Google, their book has been superceded and is no longer valuable, because Google has created something entirely new in value and purpose.
Making the full-text scans generally available is, of course, a completely different matter.
Quite possibly James.
Fair use, perhaps sadly for us users, doesn’t have any simple measurements. It is not just about transformation — in fact, the four standard measures, which are balanced amongst themselves, for determining fair use don’t even include transformation as such, although they include “purpose and character of the use”, with transformation being part of the character. And they certainly also include “the effect of the use upon the potential market for or value of the copyrighted work.”
That one gets really tricky with the new fangled internet with new fangled markets. There is/was no real market for n-grams created from texts. There ALSO is/was no market for searches over texts revealing excerpts, the use that the publishers sued google for. That use certainly didn’t impact the market for actually buying tests, but I’d guess the publishers suing google argued that they thought they could create such a market — that they could charge SOMEONE for including their texts for searching purposes even though they weren’t yet. It doesn’t really help the case for a market that most of the big publishers are willingly giving free licenses to copy texts for full text searching and excerpts to Amazon, though.
Many people thought Google could probably prevail on a fair use case of their searching-with-excerpts, and were dissapointed they choose to settle instead. In part, the the legal precedent saying that image thumbnails on the internet, used as previews, are a fair use would seem to be pretty parallel to three sentance excerpts on the internet used as previews. As wikipedia summarizes:
http://en.wikipedia.org/wiki/Fair_use#Fair_use_on_the_Internet
Sounds a lot like using three sentence excerpts as previews in search results, right? Can’t be used as a substitute for the actual book (even MORE so than a low-res preview of an ENTIRE image); the book has already been published; yes, you have to make a ‘full’ replication in order to supply the keywords-in-context three sentance excerpts, but it’s just what you have to do in order to display the appropriate three sentance excerpts for a search. And the market for buying the books themselves (in print or even digitally) does not seem to be effected by displaying three sentance in-context excerpts, which may even increase exposure of available purchase of the books.
At any rate, it’s clearly all a mess, but I definitely question the google person’s suggestion that, oh, of COURSE there’s no copyright issue with the n-grams because “the books themselves or even sections of them cannot be read.” It is hardly that simple.