Google Database to Become the 'Genome' of Culture

Article  by  Samuel MILLER  •  Published 23.12.2010  •  Updated 21.01.2011
[NEWS] A team of researchers from Harvard has, with the help of Google books, produced the word’s largest corpus of lexical information. ‘Culturomics’ may provide fresh insight into a huge range of cultural trends throughout history. 
Six years and 15 million volumes after it began its mission to ‘digitally scan every book in the world’, Google has made a huge linguistic database – 500 billion words taken from 5.2 million books – freely available for anyone to download or search online.
The database consists of all the words, or “n-grams”, that appear in works published between 1800 and 2000. About 72% of it is in English, but there are sizeable contributions from French, Spanish, German, Chinese, and Russian. The corpus, which is thousands of times larger than any existing body, is aimed at scholars who wish to devise their own analytical tools to explore it. However, Google has made its own online tool, which allows users to chart over time a string of up to five words, accessible to anyone with a computer.
To coincide with the release, the journal Science has published the findings of the first explorations of the dataset[+] NoteAvailable with free registration.X [1]. The article, mainly written by the two Harvard researchers who developed the search system and worked with Google on the assembly of the dataset, Dr Erez Lieberman Aiden and Dr Jean-Baptiste Michel, offers but a taste of the rich opportunities for research now open to all domains of the humanities and social sciences.
Their study is based on 5 million (about 4% of all the books ever printed) of the 15 million volumes digitised thus far, with the reduction a result of including only books that had reliable bibliographical data – in particular, information as regards date and place of publication. In it, they outline what they have dubbed “culturomics”, the appellation a nod to the resemblance the corpus bears to the genome of a living organism. The discipline consists of the various ways that huge databases of “hard” facts, which are not always readily available to those who study culture, can be used to throw light on major questions in the humanities.
According to the abstract of the research paper, “this approach can provide insights about fields as diverse as lexicography, the evolution of grammar, collective memory, the adoption of technology, the pursuit of fame, censorship, and historical epidemiology” and “extends the boundaries of rigorous quantitative inquiry to a wide array of new phenomena spanning the social sciences and the humanities.”
Amongst the more remarkable results is the fact that the English lexicon appears to consist of 52% “dark matter” – words that together make up the majority of written text but that do not appear in standard references. This is to be expected when one considers the balance between comprehensiveness and practicality that every dictionary must strike. Although dictionaries are updated regularly, the analysis showed that “there was a lag between lexicographers and the lexicon”.
The team also found that, by tracking the appearance of names in print, they could chart the changing nature of fame. They found that, between the early-nineteenth century and the mid-twentieth, the average age of initial celebrity fell from 43 to 29 years, and that the doubling time in the lead-up to the peak of their fame fell from 8.1 to 3.3 years. The post-peak half-life of fame, as indicated by the appearances the subjects’ names in print, “dropped from 120 to 71 years during the nineteenth century”. In other words, fame arrives earlier, quicker, and with far more intensity than in the past, but famous people are now being forgotten more rapidly.
Censorship and suppression were also found to have left their mark on the lexicological record. The researchers found that when they compared the occurrence of the name “Marc Chagall”, the artist, in English and German, there was a rapid increase starting in 1910 in both languages. With the onset of the 1930s, however, Chagall’s name in German reached its lowest point (his name only appearing once between 1936 and 1944), while in English it continued to rise as before.
Finding that they could accurately track the suppression of particular groups during the Third Reich, the team tested “whether one could identify victims of Nazi repression de novo”. They created a list of all the people mentioned in the period, calculating for each a “suppression index”. According to an article published in Science, over 80% of the names identified by the suppression index were known to have been suppressed, indicating that the method had worked. The most compelling finding, therefore, is that the remaining 20% may consist of victims of suppression formerly unknown to history.
Speaking to the Guardian last week about the unprecedented nature of the Google dataset, Michel, whose background is in systems biology and applied mathematics, said that “[i]nterest in computational approaches to the humanities and social sciences dates back to the 1950s…[b]ut attempts to introduce quantitative methods into the study of culture have been hampered by the lack of suitable data. We now have a massive dataset, available through an interface that is user-friendly and freely available to anyone.”
Not everyone is convinced by this encroachment of quantitative analysis on the humanities and social sciences, which are commonly seen as the search for meaning, or, in other words, as something fundamentally unquantifiable. According to Harvard English professor Louis Menand, quoted in a New York Times article, “[i]n general it’s a great thing to have … [but] obviously some of the claims are a little exaggerated”. He goes on to say that “there’s not even a historian of the book connected to the project”.
According to Aiden, whose expertise is in applied mathematics and genomics, quoted in the same NYT article, “I don’t want humanists to accept any specific claims – we’re just throwing a lot of interesting pieces on the table … the question is: Are you willing to examine this data?”. Taken this way, the wealth of lexicological information furnished by the Google corpus can be mined for corollary evidence to reinforce traditional methods of cultural enquiry or be used as a starting point. Culturomics, as such, simply provides information, and will always require careful analysis.
Of course, there is more to culture than what makes it into published books. The additional ability to analyse words in their original context would allow the tracking of semantic changes, as well the changing attitudes of writers towards the words they employed. But context is exactly what the n-gram system disallows; many of the books included in the data are still under copyright, especially those from the twentieth century, and are therefore not to be released for open scrutiny without first being broken up into n-grams. It is likely that additional analytical capabilities will first be added to information taken from works produced in the nineteenth century.
Google Books, despite having reached unprecedented agreements with many publishers, has been facing legal battles pertaining to copyright and compensation, due to its plan to digitise material still legally protected by copyright.    
Would you like to add or correct something? Contact the editorial staff