Scholars Elicit a ‘Cultural Genome’ From 5.2 Million Google-Digitized Books
By Marc Parry
The English language is going through a time of huge growth. Humanity is forgetting its history more rapidly each year. And celebrities are losing their fame faster than in the past.
Those are some of the findings in a paper published on Thursday in the journal Science by a Harvard-led team of researchers. The scholars quantified cultural trends by investigating the frequency with which words appeared over time in a database of about 5.2 million books, roughly 4 percent of all volumes ever published, according to Harvard’s announcement.
The research team, headed by Jean-Baptiste Michel and Erez Lieberman Aiden, culled that digital “fossil record” from more than 15 million books digitized by Google and its university partners. Google is giving the public a glimpse of the researchers’ data through an online interface that lets users key in words or phrases and plot how their usage has evolved. The paper’s authors bill this as “the largest data release in the history of the humanities.”
Scholars have explored quantitative approaches to the humanities for years. What’s novel here is the volume of material. According to a Google spokeswoman, the data set of 5.2 million books includes both in- and out-of-copyright titles in several languages from 1500 to 2008. Its more than 500 billion words amount to a sequence of letters 1,000 times as long as the human genome. This “cultural genome” would stretch to the moon and back 10 times over if arranged in a straight line.
Counting on Google Books
by Geoffrey Nunberg
Humanities scholars may someday count as a watershed the paper that appeared on Wednesday in Science, titled “Quantitative Analysis of Culture Using Millions of Digitized Books.” But they’ll have certain things to get past before they can appreciate that.
The paper describes some examples of quantitative analysis performed on what is by far the largest corpus ever assembled for humanities and social-science research.
That’s culturomics with a long o, with the implication that the object of study is the “culturome,” presumably the mass of structured information that characterizes a culture. The point of comparison might be biological models of evolution or simply the idea that culture, like the genome, can be “cracked” via massive distributed (that is, “high-throughput”) processing.
The inspiration for the Science paper came from two young Harvard researchers, Jean-Baptiste Michel and Erez Lieberman-Aiden, with backgrounds in genomics and mathematics. And almost all of the more than 12 authors of the paper (11 individuals plus “the Google Books team”) are mathematicians, scientists, or engineerssome at Google, the rest mostly at Harvard or the Massachusetts Institute of Technology. The very fact that the paper was submitted to Science suggests that the authors are more interested in winning the ear of their scientific colleagues than in reaching the scholars who will be the primary beneficiaries of this new approach.
Scholars can’t download the entire corpus right now, but the impediments are legal and commercial rather than technological. (Google could make available a corpus of all the public-domain works published through 1922 without raising any copyright issues, but it has decided not to do that.) In the meantime, scholars have access to the corpus via the Web sites Ngrams.GoogleLabs.com, and culturomics.org.