Sunday, December 19, 2010
Tracing history in the texts we write
The Google Books service is one we use a lot, especially when we're on the road. It's like having 24/7 access to a massive library of surprisingly complete-enough works. Recently, a consortium of researchers at Google, Harvard Uni, Massachusetts Institute of Technology and the Encyclopedia Britannica have unveiled a searchable "database of two billion words and phrases drawn from 5.2 million books in Google's digital library published during the past 200 years" (source).From the abstract of an article published by this group:
We constructed a corpus of digitized texts containing about 4% of all books ever printed. Analysis of this corpus enables us to investigate cultural trends quantitatively. We survey the vast terrain of "culturomics", focusing on linguistic and cultural phenomena that were reflected in the English language between 1800 and 2000. We show how this approach can provide insights about fields as diverse as lexicography, the evolution of grammar, collective memory, the adoption of technology, the pursuit of fame, censorship, and historical epidemiology. "Culturomics" extends the boundaries of rigorous quantitative inquiry to a wide array of new phenomena spanning the social sciences and the humanities.For example, the database can be queried to show the effects of changes in language (e.g., the shift from 'nt' verb endings to 'ed' endings--as in burnt to burned, and learnt to learned; the addition of new words in English or other language, such as "pizza" or "sushi"), or to quantitatively graph the mention of women in published texts (which shifted dramatically in the 1960s as feminist movements took hold).
In addition to English, the database includes texts written in Chinese, English, French, German, Russian and Spanish.
The research group has developed a graphical interface for everyday users to play around with. This interface--the Ngram viewer--lets you adjust the year-span, enter up to five keywords you'd like graphed, and provides complementary links to related or relevant Google books. Below, for example, is an Ngram graph of the instances of the term "new literacies" in the database from 1960 onwards.
Wired Magazine has a range of additional interesting example graphs. And if you're registered with the New York Times, you can read more about the database and examples of findings here. Cultural evolutionists and linguists are all hailing the development of this database as a significant contribution to being able to quantify cultural trends in interesting ways, although everyone agrees that analysis will necessarily need to move beyond counts and graphs and the like to include mroe sophisticated qualitative analysis in order to be truly useful.
The entire database itself is even available for download if you'd like to develop your own sorting tools to use with it.