A reading and review by Spencer Roberts, written for HIST 5V71 in Fall 2011

Culturomics is the brainchild of a team comprised of computer scientists and humanists from various academic organisations who joined forces with Google to analyse millions of books. As the name suggests, the project was built to initiate a study of culture through extensive analysis of large data sets provided by books, magazines, newspapers, journal articles, and the like. A paper discussing the potential of the software was published in Science in late 2010, in which the authors highlighted the various fields that might be informed through culturomics, such as lexicography, the evolution of grammar, collective memory, the adoption of technology, the pursuit of fame, censorship, and historical epidemiology. The project was published online on December 12, 2010, and was also the topic of Patricia Cohen’s article in the New York Times on the same day. Cohen’s article brought both commendations and objections, many of which are addressed in the extensive FAQs now posted on the Culturomics website.

One of the most aggressive criticisms of the project is levelled by an English professor (found in Cohen's article) who questions the lack of humanists involved in the project. In their FAQ, the Culturomics team respond, “That’s incorrect,” and list the humanities credentials held by a number of members. They also pointed out, “What matters is the quality of the data and the analyses in the paper and what it means for how we think about a great variety of phenomena - not the degrees we happen to hold or not to hold. If what we seek is a serious conversation about this work, we shouldn't exclude anyone who has something significant and thoughtful to say. That would be a shame.”

The significance of the Culturomics project is, I believe, well expressed in their extensive and conscientious responses to the critics of their work. They suggest “quantitative methods can be a great source of ideas that can then be explored further by studying primary texts.” The project is not to be an affront to existing methods, but a supplemental source of insight and inspiration. To those who might say we cannot quantify culture, they respond that quantification can still offer insight into culture, and that interpretation is absolutely necessary in using such methods.

The Culturomics project, then, is an attempt to make accessible for analysis the millions of texts being digitised by Google, not because the texts have been ignored previously, but because they can now be read collectively and comprehensively in discrete, quantifiable strings (n-grams).

In comparison to other digital humanities projects, Culturomics shows similarities to the early work of Father Roberto Busa and text projects such as those initiated by Willard McCarty. The main difference is that this new project incorporates an unprecedented volume of data and has certain characteristics coded into its structure that are necessary for coping with the amount of data. A strong tradition, and arguably the longest, in the digital humanities has been text analysis; this project takes the principles of text encoding and analysis to a level appropriate for the amount of data made available by an increasingly digital world.

Other criticisms of the project are concerned with the quality of the data provided and the implication for conclusions drawn from that data. The Culturomics team responds by noting that all data has a certain level of error, and that their work is not to replace or be taken without critical evaluation and interpretation. The results published in Science were based on trends found in the dataset, but were interpreted by the team in order to give significance to the findings. The research that informed their results was cross-referenced with other databases and sources, as would any proper humanities research.

The potential of the Culturomics project is difficult to predict because the project is less than one year since publication and because the nature of prototype or trial digital humanities projects is to develop a framework upon which others expand. Because the work has been made public through the Google nGram Viewer and linked to Google Books, a significant portion of its user base is comprised of enthusiasts and amateurs. The public will make use of the tools as they see fit, and the humanities would be wise to do likewise, a vision shared by the authors of the project.

An important aspect of the project FAQ is the inclusion of recommendations for improved searches and interpretation of results. Choosing more specific n-grams will filter out alternate meanings; the example given by the FAQ is the difference between “Chad” and “Republic of Chad”. The authors also recommend that interpretation of results is best accomplished by eliminating possible explanations for trends and narrowing in on the most likely. For instance, random noise (chart spikes out of trend) can be discounted by looking closely at statistical significance, the trajectories within the outlying data points, or comparing with similar searches. The authors also point out that problems due to poor optical character recognition (OCR) are not likely to create significant deviations, and that misdating of books would also provide little interference.

Further research into the trends shown by the nGram viewer is necessary, say the authors, and that research must use the critical analysis, interpretation, and synthesis found in traditional research projects. Culturomics analysis does not eliminate the scholarly work done by humanists, but offers an alternative method to draw from more sources than previously possible.

A final criticism of the project questions its ties to Google and laments the commercialisation or monopolisation of scholarly research tools and knowledge. While those concerns are important, the Culturomics team has clearly distinguished their work from that of Google, and should not be unduly criticised for cooperating with a corporation with access to the data they required. Similarly, Google’s intentions with the data and analysis tools cannot be predicted with any amount of certainty; there is no reason, however, to suspect that Google would attempt to limit access (other than for copyright reasons). Many of the company’s projects have been open-access and operated under the assumption that sharing leads to improvement and innovation; while I would hesitate to guess at Google’s future decisions, I suggest humanists make use of the tools while they are free and open. Even if access is limited in the future, the groundwork for such voluminous data analysis has been laid for humanists to build upon.

