The digital storehouse, which comprises words and short phrases as well as a year-by-year count of how often they appear, represents the first time a data set of this magnitude and searching tools are at the disposal of Ph.D.’s, middle school students and anyone else who likes to spend time in front of a small screen. It consists of the 500 billion words that are contained in books published between 1800 and 2000 in English, French, Spanish, German, Chinese, Russian and Hebrew.
A side issue
from that same article:
Google says the culturomics project raises no copyright issue because the books themselves or even sections of them cannot be read.
That would kind of make sense if it were so, but the publishers suing google would probably beg to differ. Copyright says you can’t make a copy, without the rights-holders permission. It doesn’t say you can make a copy as long as you don’t share it publicly. (Although the penalties might be worse if you do).
If it weren’t for this, the mere couple-sentance excerpts that google showed for arbitrary books in Google Book Search would almost certainly, in each individual case, be fair use. What gave the publishers a case was that Google had to copy the whole thing to provide that service. As they do for the ‘culturomics’ database.