Distance Reading

From Brock University's Digital Humanities Compendium

Jump to: navigation, search



[edit] Distance Reading Articles

Cohen "Analyzing Literature by Words and Numbers"

Cohen "In 500 Billion Words..."

Crane "What Do You Do with a Million Books"

Moretti Graphs, Maps, Trees **In case some of you can't find the text, these articles are similar.

[edit] Google Ngram Viewer and Other Supplementary Material

Google Ngram Viewer

Google Datasets

Quantitative Analysis of Culture Using Millions of Digitized Books

"Culturomics" Defined

Also, for a very good understanding of the uses proposed by the Culturomics team, see this article.

[edit] Readings

[edit] Cohen "Analyzing Literature by Words and Numbers" Review and Notes


In her concise and balanced article, it is clear that Patricia Cohen is enamoured with the possibilities of digital tools and resources to enhance and supplement humanities research, yet nonetheless has evident concerns about the rising influence of strictly statistical and quantitative information replacing the interpretive and critical aspects of the Humanities. Highlighting this concern, Patricia Cohen notes several case studies in which the presence or absence of a word, term, or phrase misconstrues any strictly statistical examination of the text. However, despite any personal misgivings on her part, she nonetheless gives credence and credit to the benefits which such research would give way to. Her respect for the work of Dan Cohen and Fred Gibbs arises in her presentation of their research as a means of complementing and reconsidering traditional historical and literary scholarship. Despite her interest however, she is nonetheless quick to reiterate the concerns of several scholars who argue that digital resources, as they develop, stand to not only “reduce literature and history to a series of numbers” but also “to shape the kinds of questions” scholars will ask. In addition to her methodological commentary and concerns which digital tools raise in terms of humanities research Patricia Cohen also notes the rising role which corporations like Google play in framing resources, literature, and scholarship online. On this matter Cohen highlights the traditional, yet valuable, critiques most often raised against monopolistic information ownership including control, cost, and access.

Notes on Article

Statistical analysis, with texts being “electronically scoured for key words and phrases that might offer fresh insight into the minds of Victorians”

New research as a result of a new digital tools and databases

Technology is “transforming the study of literature, philosophy, and other humanistic fields that haven’t necessarily embraced large-scale quantitative analysis”

Notes that Dan Cohen and Fred Gibbs are using this sort of digital technology to search and graph the frequency of specific words and word groupings (ex- God, love, work, science, and industrial) from 1789 until 1914

Cohen and Gibbs are using the results of this data to reaffirm or question traditional historical and literary assumptions and interpretation regarding the Victorian age

Author notes that while there is interest in this sort of research there is also some hesitation by those concerned that such resources would “reduce literature and history to a series of numbers, squeezing out important subjects that cannot be easily quantified.”

Scholar Matthew Bevis raised the comment that these sorts of programs are “not us tool[s]” but that they actually have the potential to “shap[e] the kind of questions someone in literature might even ask.”

The author is quick to note that Cohen and Gibb’s project, “Reframeing the Victorians” is just one of a dozen to be recognized by Google, itself a major corporation involved in the digitization of texts.

With the mention of Google and its rising role and control of digital texts and online library Patricia Cohen observes such plans have many concerned with the potential for one organization to have control over this many resources. Control, as Cohen notes, often translates into larger costs as well as questions of availability and access for individuals.

Patricia Cohen highlights that many scholars believe that digital tools will let them do a more comprehensive studies of certain ideas, eras, and cultures since they are now able to search a much larger number of texts and books.

However, to this enthusiasm, Dan Cohen is quick to acknowledge that much of the early statistical analysis of texts “is anything but clear” in its results. Software programmed to find specific terms inherently avoid related words and phrases, and have the potential to miss the larger context in which words are used. (ex- Syntax)

The point being that, on a certain level, fewer or more references to a specific term do not necessarily equate to a specific focus or disuse.

Cohen concludes by asking whether statistical analysis of texts will “overshadow meaning and interpretation” or whether it will serve to “highlight the importance and value of close reading...[and]…heightened engagement with words, paragraphs and lines of verse.”

[edit] Cohen "In 500 Billion Words..." Review and Notes


Patricia Cohen’s article “In 500 Billion Words, New Window on Culture” aptly examines the release and potential implications of Google’s new database Ngram Viewer on Humanities research. This digital tool was designed by Lieberman Aiden and Jean-Baptist Michel and was “culled from nearly 5.2 million digitized books” published between 1500 and 2008. Ngram allows for input searches of up to five words and produces “a graph that charts the phrase’s use over time.” According to its designers, the goals of this resource are to simultaneously understand human behaviour, word usage, and cultural change while providing Humanities researchers with results which are “more convincing and more complete.” Some of the insights which this project gives way to are an analysis of the duration of cultural trends such as the rate of invention, duration of fame, as well as the growth and development of the English lexicon. For her own part, Patricia Cohen runs a number of searches, comparing references of women to men, Mickey Mouse, Marilyn Monroe, and discovers that “Tiananmen Square” has more coverage in English than Chinese texts. However, while Cohen admits that this online tool is interestingly addictive, and provides for insightful consideration, she is nonetheless concerned about the implications of what Lieberman Aiden terms “culturomics”.

In reflecting on the implications of this type of research, Cohen contrasts Steven Pinker, a Harvard linguist who argues that the output of such database searches “makes results more convincing and more complete” with Louis Menand, an English professor who acknowledges the benefits of such resources but also observes that among those advocating Culturomics there is not a single Humanitist involved in the project. Columbia professor, Alan Brinkley is similarly referenced as dubiously wondering what is being done with such statistics. In response to such traditional Humanist hesitation to their work, Cohen notes that both Michel and Lieberman Aiden readily acknowledge that “cultural references tend to appear in print much less frequently than everyday words” and that their project “simply provided information” in the hopes that scholars would be “willing to examine the data.”

Notes on Article

Article discussing Google’s “mammoth database culled from nearly 5.2 million digitized books” which is now available and which stands to open a vast range of “possibilities for research and education in the humanities”

Focus of Google’s resource, created from books published between 1500 and 2008, is to serve as a database which allows for anyone to input searches of up to five words and review “a graph that charts the phrase’s use over time”

Cohen admits that this sort of mental exercise is interestingly addictive, as well as providing for insightful material for consideration.

For her own part, Patricia Cohen runs a number of searches, comparing references of women to men, Mickey Mouse, Marilyn Monroe, and discovering that “Tiananmen Square” has more coverage in English than Chinese texts.

This resource, designed by Mr. Lieberman Aiden and Jean-Baptiste Michel, seeks to utilize digital resource in order to “transform our understanding of language, culture and the flow of ideas.”

Lieberman Aiden, a mathematician and genomics scholar set as his goal to demonstrate “what becomes possible when you apply very high-turbo data analysis to questions in the humanities.” He has called his method “culturomics", which is defined by Wikipedia as “a form of computational lexicology that studies human behaviour and cultural trends through the quantitative analysis of digitized texts.”

The goal of this resource is to demonstrate the research possibilities which digital tools bring to humanities scholarship and which have traditionally been avoided by literary and history academics.

Some of the insights which this project gives way to are an analysis of the duration of cultural trends such as the rate of invention, duration of fame, as well as the growth and development of the English lexicon.

Cohen quotes Steven Pinker, a Harvard linguist who argues that the output of such database searches “makes results more convincing and more complete”

Pinker believes that despite resistance in the humanities to such research, the results of these sorts of databases will make their usage more “mainstream”

To which Patricia Cohen turns to Louis Menand, a Harvard English professor who while acknowledging the benefits of such resources also notes that some such claims can be “exaggerated”.

Menard also comments that among those advocating Culturomics there was not a single Humanitist involved in the project.

Similarly, Columbia American studies professor, Alan Brinkley is noted as dubiously wondering what exactly was trying to be done with these statistics.

In response to such traditional Humanist hesitation to their work, both Michel and Lieberman Aiden emphasize that their project “simply provided information” in the hopes that scholars would be “willing to examine the data.”

Both readily acknowledge that “cultural references tend to appear in print much less frequently than everyday words.”

Cohen notes that the main impetus of Leiberman Aiden and Michel’s research into Culturomics arose from the 18 months they spent scrutinizing Anglo-Saxon texts for irregular verbs. When they read about Google’s plans to digitize texts and create and online library they sought to collaborate with Peter Norvig at the company.

Cohen also notes that despite ongoing legal actions against Google for copyright infringement, the Culturomics project is exempt as neither “the books themselves, nor even sections of them” can be read.

[edit] Crane "What Do You Do with a Million Books" Review and Notes


Gregory Crane seeks to address the issues surrounding the digitalization of mass amounts of books/libraries with regards to document and textual analysis. Crane points out that we do possess unlimited space (on the internet) and in some instances large amounts of funding (i.e. through Google's projects, etc.) but there are several other issues to consider with the shift from simply just digitalizing documents to document analysis. Crane raises the issue that since the computer is now able to perform document analysis away from human intellect then the possibilities for analysis open up in some cases too much. This article also points out the six ways in which massive digitalization could change the existing digital libraries we have. These include: Scale, Heterogeneity of content, granularity of objects, noise, audience, and collections and distributors. A change in these dimensions would change how digital libraries are used and how we support them. I don't believe that Crane is arguing against the expansion of digitalized materials but is simply pointing out how large scale projects would result in many changes and how we have to be aware of these changes in order to adapt to the digtialized world.


-21st century libraries unlimited use across world, this allows for not only text sharing but also other forms of media -books already reading themselves then providing document of comparison to readers -larger documents broken down into smaller -as digitalized library grows documents already learning from one another, also updating language models as well -documents also learning from users, can analyze who is reading them and from where to create another set of data use -ability to decompose information into smaller sizes, learn from changing environment and accept feedback from who is using them, not always as easily done in print world -traditional ways of analyzing documents issues: limited, grammar and language not as easily translated, no creation of databases -The Million Book Project at Carnegie Mellon Nov 2005 already scanned more than 600,000 books -Google partnered with Harvard Library System contains over 15,000,000 items -Yahoo and Microsoft developing millions of book collections online -European Union also creating digital system for fear that there was too much North American influence on these digital libraries -these examples demonstrate there is money available for digitalization -one issue that these libraries could not support the revenue needed because there are free -also issue of legal copyrights of the material published online, Google, Yahoo and Microsoft working with the Open Content Alliance on rights agreements -"In the end, we should remember that copyright is an instrument to generate wealth and produce content." -also human intellectual life issues, if everything's being done for us with regards to document/text analysis how does this hinder human intellect? -could impact writing as well since digital collections can now analyze themselves -changes in scale and substance: scale: completed Google Library could contain over 10 million items, could effect the quality of the texts produced, ie you could end up with analysis that includes work that may not be the same quality of your research etc. -Heterogeneity of content: organizing principles of texts could be changed, especially in cases of language, results in increase in complexity of analysis available -granularity of objects: no standard internal structure for these online collections (relates to last week's discussion on HTML), also large expansion of tags in analysis could open too many doors, hundreds of tags could turn into thousands -Noise: since these new proposed digital libraries are going to be open access to the public issue of errors arises -Audience: shift from academics using the sources to everyone=demand for different things, strive to serve the public library audience could overshadow that of the academic world, especially since most of these large scale planned libraries are from public companies ie Google -Collections and distributors: large scale libraries ie the Google Project could replace smaller online collections because they would have just one single point of entry for all materials, in a sense could be too broad of a library -all of the above 6 factors could be increased and thus our digital world could be changed as well -GALE system for imputing text has reduced mining, analysis and searching down to 3 functions: converting analog to text, translating one language to another, and transforming raw text into data -analog to text: using speech analysis methods to look at character sets within text in supported languages ie English, French, German. problem extends beyond the issues of transcription: need to analyze page layouts to understand structural patterns, characters on a page have meaning and we need to be able to determine that -machine translation: rapid progress in translation through technology, translation could also help lead to unlocking primary source material that before could not be properly translated -information extraction: identification of references to people, places and things, create methods of higher levels of storing and recording many references -document analysis, multilingual technology and information extraction serve purposes and particular needs depending on the domain -enhancements of these could help break away barriers in the study of history and culture due to primary sources not being able to be studied properly and with the proper tools -we have now shifted to a digital world that is not just concerned with making material accessible but also with proper analysis that is relevant to our studies

[edit] Moretti "Graphs, Maps, Trees" Review and Notes


I found this text to be a little confusing and slightly dense in comparison to the other readings this week, hope I'm not the only confused party in 5V71!

My understanding of this text on literary studies is that Moretti is arguing that literature analysis should not be solely based on simply reading a text but scholars should start counting data within the text to create series of graphs, maps, and trees. He argues that proper literary analysis of texts throughout history should be done based on multiple surrounding factors such as where the text was written, language use, etc. He then suggests that this data should be complied together to create patterns so that we can properly understand the culture at the time a text was written and determining factors for why a text was written the way it was. From the collection of this data, Moretti argues that we will be able to develop a better understanding of texts not just at the individual level but also at a group level. What I mean by this is that Moretti is suggesting taking massive amounts of data from massive amounts of works and plotting them into graphs, maps, or trees in order to discover patterns in literary history. Moretti argues in the beginning of his book that texts aren't individual cases but are a collective set. He also argues that through this method scholars will be able to share data better because graphs, maps, and trees are easier to read and pass along. Personally I felt very confused by reading Moretti's graphs and maps, but the trees were much more useful for proving his point because they contained actual words and I was able to see his connections between series of novels. Moretti does an effective job at proving his overall point however I would have liked to see him describe how he imputed his data into a computer to better understand his digital process. As someone with a tiny digital humanities understanding I was able to make the connection between his argument and data collection but I am not sure if a traditional literary analyst would a) be able to make the connection, and b) like the fact that Moretti is attempting to revolutionalize the way in which literature has been analyzed based on content and not on large data sets. On the other hand, the fact that the Digital Humanities is proposing a change in the way humanists do research in their fields opens up multiple avenues for analysis means that fields can expand beyond what we could comprehend only a few short years ago.

Notes (I have a lot of notes for this text however I shall only include the most relevant ones)

-proposing a change in the discourse of literary study -no more concrete individual analysis for texts but calls for a collective set of data for proper analysis of written texts -uses maps, graphs, and trees to plot these new data sets in order to reduce and abstract -using these 3 tools from other avenues of study: graphs from quantitative history, maps from geography, trees from evolutionary theory (suggesting the interconnectivity of the humanities/social sciences?) -literary studies focuses on the individual word placement or sentence, but what about moving beyond that in a more general sense -only a small percentage of texts written actually get published, something to consider? -massive field of books means that we can't analyze all of them but need to develop collective data sets in order to better understand literary history -first set of graphs used compare novels printed in specific geographic locations, from collecting this data then we can start to ask why and open the possible avenues for study more -important to note that these graphs provide data and not interpretation, still room for that in literary studies this process just opens up more opportunities -examines genres that occur as well in novels from particular periods, something historically significant happens -"books survive if they are read and disappear if they aren't" pg 20-for some reason I found this particularly interesting -by plotting genres on a graph according to years it allows us to condense genre studies and changes -quantitative data useful due to independence from interpretation then challenge interpretation and allows more questions to be asked to aid in interpretation -not just looking at the novel but at novelistic forms in Moretti's theory -maps do what can not be done with words, provide a circular theory of literature and narrative -in example demonstrates spacial patterns, helps with analysis about what is going on within a text -maps do not explain/interpret but show what needs to be explained or interpreted within a text (FIRST STEP!) -argues that it is interesting to look at geography within a novel, new possibilities for discovering what is going on within a novel -maps reduce the text to a few elements and create artificial objects within the text -Moretti is very focused on patterns within a series of texts not just one -mapping national processes changes the structure of how the narrative is analyzed reveals direct relationship between social conflict and literary form -maps are a diagram of forces that shape how/why/where a novel was written and produced -trees allow for the correlation between history and form -graphs=quantitative diagrams, maps=spacial diagrams, trees=morphological diagrams -literary studies theories blind to history and historical work blind to form, argues morphology and history are two dimensions of the same tree -asks if language evolves by diverging, why not literature too? -branches of the trees demonstrate intuitive forces -assumption that morphology is the key factor of literary history -devices and genres shape literary history not text -comparative literature could be world literature and comparative morphology, study the reasons for transformations and not just the transformations themselves -theories are nets and should be evaluated but not are ends in themselves -Moretti no longer believes in a single framework for explaining literary history -aim for book was to open new conceptual possibilities for literary studies instead of justifying them

[edit] Distance Reading Discussion Questions

[edit] Discussion Question 1:

To start out this week's debate we thought we'd begin with a couple of general questions. With the advent of the digital environment a much larger number of texts and resources have been made available. Practically speaking, is there a point at which there is simply too much available to be able to sort through - in other words, is there ever a point in which there is so much information, or the scale of it is so large, that it in fact impedes effective research?

[edit] Discussion Question 2:

While the digital environment has given way to a range of new and useful means of textual analysis how does a statistical critique of literature and resources fit within traditional approaches? Do the possibilities for digital texts surpass or fall short of traditional approaches to printed materials?

[edit] Discussion Question 3:

Crane's article "What Do You Do with a Million Books"discusses the possibilities for digital technologies to complete document analysis, information extraction, multilingual translations, and textual evaluation. Do you think it is practical that programs are created to evaluate texts? What is the point at which such projects cease assisting scholars in locating information and begin determining for academics what information is relevant? Aren't evaluation, critique, and translation essential aspects of the Humanist undertaking? Can evaluation and critique be reduced to quantification or are they more than the end sum of an equation?

[edit] Discussion Question 4:

Cohen, in "Analyzing Literature by Words and Numbers" concludes by quoting a scholar who argues that "large-scale, quantitative research is likely to highlight 'the importance and value of close reading; the detailed, imaginative, heightened engagement with words, paragraphs and lines of verse." However, generally in our readings, little mention has yet to be made in terms of how to engage with and critically utilize resources. In a similar sense, Cohen's article "In 500 Billion Words, New Window on Culture" quotes Steven Pinker, a Harvard linguist who argues that the output of database searches of digitized texts “makes results more convincing and more complete”. With a rising focus in academic institutions on quantitative analysis and results, do you think that DH will reassert Humanist interpretation or allow for "statistical measures" to "overshadow" meaning?

[edit] Discussion Question 5:

For the final online question, we were curious about the methodological issues which stand to arise as a result of Moretti's considerations in Graphs, Maps, Trees. In the sciences when ecological modelling was undertaken the result was not necessarily an improved understanding of environmental issues, but rather an increased focus on methodology and the structure of the statistical models. As such, what the natural sciences have now, is effectively a new field of debate on modelling methodology rather than new insights into their own subject areas. While clearly methodology and results are inter-related, the question necessarily arises as to whether digital resources will provide for new insights in Humanities research or merely a new subset of discourse regarding what methods were used and how models were constructed. Please Comment.

[edit] Distance Reading Responses

     ** Let it Begin **

[edit] Ryan

Well I'll kick things off this week. I'll start by addressing the first question.

The massive catalogues of books will radically change acceptable practices within history. I have no doubt of this. Where a theory could be proposed off the reading of 30 books before it might require 300 now to hold the same weight. While this seems ludicrous, I don't think it will amount to a whole lot more work for the historian in the long run. This of course can only be the case if we refine our methods of research and adapt. New methods to pull out relevant points and skim over the useless bits must be developed hand in hand with the massive catalogues of digitized books. And of course there are other aspects that must be taken into account. Digital archivists will be very important in this endeavour.

So I'll say no. I don't think that the scale of information will necessarily impede research. It COULD, but I dont think it has to.


[edit] Melanie

I'll go next... I think the first question is very interesting. I can understand how millions and millions of books at one's disposal could be seen as problematic for research, with just the sheer volume of material one would have to go through. I think what is important to keep in mind though is that as these digital libraries grow, evolve and change, so too will the tools used to search in them. I think with effective search tools, a library of millions of digital books will be kept fairly manageable, as only certain books would be brought forward from a search, or only sections of an article to be examined rather than the entire thing. Some of these techniques are addressed in the Crane article. I know this opens the questions of how searches are structured, how items are catalogued or tagged, etc. but those are things that require their own examination altogether, and aren't issues I think I could address adequately in this space. The fact of the matter is, the same amount of material is still out there regardless if it is digitized or not; digitizing it just makes it more easily available to more people. I think Cohen's "Analyzing Literature..." article is correct when it states that digital research will offer a new kind of comprehensiveness that previous research was lacking. As for the second question, I think the statistical analysis of literature can be used as a tool within traditional approaches, such as the example in the Cohen article of the work done on Victorian mindsets towards progress and science. No one is forcing anyone to use the digital data that is being made available; it is a choice of the researcher if they would like to use digital resources on their research or not. In the Victorian example, digital tools provided some information that would've been very difficult to gather otherwise. The same article also states, however, that these tools aren't just tools but are changing the kinds of questions being asked by humanists. I don't necessarily feel that is a bad thing, though - it is opening another field of inquiry. Just because a new field opens does not mean that another, more traditional one has to close. Melanie

[edit] Dave

I like the 3rd question because Crane is discussing the changing nature of the texts that are being digitized and wealth of new possibilities and problems that converting every book into a digital format presents. The issue with translations alone is something that he identifies when considering the books related to Classics for example. This is a discipline that houses many different languages and the ability of the translator programs to provide an accurate, contextually sound translation is one issue, along with the need to make all of these databases available in both their original format and in english. Crane also points out that initial text capture was done in a series of PDF's and other large single object transfers. Today, the ability (and the need) to identify ALL of the words (or objects) in the texts is crucial to allow for deep research. If the file is a PDF without the ability to explore within it (except for actually reading it) then the point of transferring this information into the digital realm is defeating the purpose. I also appreciated how Crane and others identified the troubling fact that Google, which is after all a commercial enterprise, is positioning itself to be the single point of entry for all human knowledge. This is somewhat troubling to me and not because of copyright or other rights issues; instead it seems to me that the endeavour should be undertaken by an agency that has a transparency that I don't see with Google. I heartily suspect that there will be fees coming very soon..... The ability for academics to use these resources to assess and develop theories can only be assisted by the wealth of information that is rapidly coming on line. I have no love of quantitative history and this is, I suspect, always going to present an issue with humanists whenever the massive amount of computerized information is provided to their inquiries. It is always up to us to frame our own questions and direct the path of our own research; the digitized texts and the massive online libraries, search engines, pattern matching bots and other programs designed to help us filter out the mass of raw data will only help us become better historians. Dave

[edit] Sean

I thought I would quickly jump in and comment on a few of the points being made here. First off, both Ryan and Melanie, I entirely agree with both of you, that much of this debate is a matter of methodology. As I've mentioned in my blog and in class, I've got a number of concerns about which methods scholars use when adapting to and integrating digital resources and results into traditional Humanist research. That said however, I'm not entirely convinced that the amount of information becoming available necessarily translates into there being better scholarship or that academics will be able to find the appropriate resources. That said, Dave's point, that much of finding resources has always come down to the scholar to find or not. In this manner, he's right, whether the material is digitized or remains only available in print, libraries primarily help improve access the resources. However, there is something to be said about the relationship between ease of access and the resources which get used..... This is of course a matter we can take up in class this week as it certainly deserves more attention. Sean

Thanks to everyone who has posted and discussed the material so far.

[edit] Grant

I will address the fourth question. I do not believe that Digital Humanities will lead to an overshadowing of meaning by "statistical measures." Yes, based on Cohen's article "Analyzing Literature by Words and Numbers," the quantitative analysis of texts is gaining attention throughout the academic world, but scholars still have to define the meaning of the data. In terms of Cohen's article, she gives the example of the decline in the use of "God" in the 19th century, which the meaning behind this change correlates with the rise of skepticism at the time. There is the quantitative data, but we still need to interpret the meaning of the results. The interpretations made by humanists will be supported by statistics, rather than being overshadowed by them. For example, if a historian made the argument that liberty was a major theme in literature from colonial America, then they could collect all the times "liberty" or "freedom" were mentioned in books from the time period to support their thesis. However, this data would still need to be interpreted and given extra meaning. Why was liberty such a great deal to Americans in the 18th century? What events led to this increase in interest with freedom? Meaning still plays a prominent role in the humanities, even with the advent of statistical data. As well, certain aspects of academic study in the humanities cannot be done with quantitative data. Tracing the history of a certain region or person cannot use statistics, but rather relies on a historian's interpretation of the primary sources. Arcaheology is a field in which the analysis and interpretation of the findings play the primary role. Statistics may be used to record the number of a certain artefact, such as arrow heads, but an arcaheologist must then interpret what the arrows were used for, who used them and why were they found in that region. Those are questions that statistics cannot immediately answer. Thus, I believe that statistical data will work alongside meaning in the humanities, rather than overshadow it. Grant

[edit] Ryan

Grant, I agree with you that what is important for us as humanists is the quest for meaning. Though I agree with you, it seems to me that using the statistical data could be misleading for research purposes. As Cohen points Analyzing Literature by Words and Numbers these searches are not always what we think they are. Syntax and Prosody weren't related to grammar, they were horses. There are other pitfalls in these sorts of searches. When you search freedom and liberty how many of the results are used in a context applicable to your research? And how many are just horses? And if the results turn back hundreds of thousands of results it simply isn't realistic to go through the texts to figure it out. Of course you'll still have some idea about the trends but this type of searching might make lazy and complacent scholars of us. How much weight should we really place on it as companion data in our quest for meaning? Just some food for thought.


[edit] Val

These are some great questions Sean, and I agree with most of what’s been said so far. As Melanie rightly pointed out, the materials are out there whether we digitize them or not. But I think another important question is how these documents will be made available, or more specifically what methods will be used for securing them and for accessing them. I agree with Dave, the fact that Google is seeking to gain a monopoly over so much knowledge is troubling. The Crane article certainly worries me about the future of an author’s right to her or his own work. But then again, Crane made an important point, there is nothing stopping us from developing new approaches to copyright laws and to protecting intellectual property. Anyways, to return back to the original question, mass digitization does not have to impede effective research, but I think that it certainly can, if we do not change the nature of the questions we are asking, and the methods we are using. I guess what I am getting at here is related to Moretti’s research. He seems to suggest that we need to start asking different types of questions.

As for question number 2: Dave seems to be a little less than enamored by quantitative analysis, but after having spent a lot of time on the Old Bailey website for my DH project, I’ve come to believe that digitizing materials and using them for statistical or quantitative analysis can be very useful. The ability to efficiently use statistical tools to identify larger patterns can add considerable insight to any research. I particularly like the idea of combining a traditional qualitative research approach with a more novel quantitative one. Of course such a method does not come without faults. For instance tools like culturonomics, wordle, and ngrams depend largely on the parameters we as researchers set for them and thus can very easily miss important elements. In other words, they are useful tools but it is the method in which we apply them that really matters. Needless to say, the historians Cohen interviewed in her article on analyzing Victorian literature had legitimate concerns about this new approach. But then again, should we ignore these new methods, we risk remaining stagnant in a pool of unchallenged knowledge. Val

[edit] Dave

I would defend my position vis a vis quantitative analysis, but I too have seen the value in being able to crunch huge amounts of data and present some very interesting conclusions. As I am an unreconstructed historian (as a prof would often remind me) I am always wary of the claims that are made by the new gurus of the digital marvel. The dangers inherent in combining data and skewing the results without the fine qualitative work that historians have traditionally done does indeed concern me - when I think of quantitative analysis I am always reminded of the Mark Twain quote: "There are lies, damned lies and statistics". I am always cautious when I see the types of tables represented in Cohen's article concerning the analysis of Victorian Literature. When a particular variable (in this case a key word) begins to disappear, the computer will simply record that instance and develop a trend line. Without an examination of the context or an understanding of the society and the changes occurring within it, this information could lead the researcher down a rabbit hole to a very odd wonderland. I wholeheartedly agree with Val's assertion that the digitized records of the Old Bailey are incredibly useful and valuable resources, but if we were to look for a particular crime/punishment pattern we might be fooled into believing that a particular crime simply ceased to occur, when in fact it was no longer prosecuted. Society (and laws) change, but the raw data will not always reflect that fact. After this screed, it may seem that I am opposed to the digitization of records and text, but I can assure you that I am entirely for it; I think we have to always be aware that the ability to find virtually any records and work with the research data through the use of the computer is an incredible advantage that we all possess today. My caveat is to never forget that it is simply a tool and like any tool can be either useful or potentially dangerous. Dave

[edit] Melanie

I think everyone has had some good insights into the topics. However, I was struck by question 4, and Ryan and Sean's responses to it. I definately agree with Ryan's point in that I do not believe that this kind of statistical analysis will ever overshadow meaning (unless your objective in research is to study a purely quantitative phenomenon, then you'd be right on point). I think that is a point that Moretti makes well in his first chapter, when explaining how graphs show information but still need to be interpreted by someone. However, I raise issue with Ryan and Sean's concerns about this kind of qualitative data not making historical research "better" or being useful. I understand, for example, Ryan's point about Syntax and Prosody being horses, and not related to the scholar's interests in grammar and writing structure. However, is that not the place of the historian/researcher? Isn't that their job? To go through all the available material and find what is useful, look at as much different information as possible and pick out what is relevant from what isn't? I would argue that the amount of data does make for more comprehensive research, because you have a more complete look at what data is available, and the historian or researcher is somewhat expected to have to go through that much work. If people went through volumes of information back in the day with parchment and ink, why can't we do it now with a computer? And, as I stated in my previous post, I believe the search tools to accompany this data will develop alongside it, making the information more user-friendly when researching. It may just lag behind a bit. Melanie

[edit] Val

Dave, you bring up a number of valid points and I agree with what you’re saying. As I mentioned earlier, if we are to reap any benefits from quantitative analysis it needs to be accompanied by good qualitative questioning. In this way, quantitative and statistical analysis does not necessarily make research any easier or any less of an effort for humanists. In fact, I would probably spend more time questioning and looking for gaps as a result of statistical research. Still, I’m interested in seeing what kind of questions and answers such forms of data collection and analysis can lead to. But in all honesty, I think that I may in fact agree with you Dave more than I have previously admitted. Let’s consider question number five. Sean, asks rightly about Moretti’s methodology in Graphs, Maps, Trees, of which I’m still not convinced is completely sound. Moretti’s model depends largely on constructed concepts often taken for granted as “natural” or “unchanging.” In this sense, and this is related to question four, I did feel that Moretti’s model “overshadowed” the meaning of such “evolution” among literary genres. In the end, I’m not adverse to these new forms of research, nor even to Moretti’s method, but I think they require much critical evaluation and perhaps standardization, before I can wholeheartedly welcome them. "Val"

[edit] Heidi

Wow! What a great discussion. While it is difficult to jump in at this point and add in a new perspective, I will throw in a few comments. I think Melanie hit the nail on the head for me. In Cohen's "In 500 Billion Words" it is clearly stated that "culturomics simply provides information... interpretation remains essential", which to me is the core of this debate. Does the digitization add a ton of new information to sift through? Sure. Does this mean that we will get completely bogged down in the sheer magnitude of sources? Not necessarily. As Melanie mentioned, this adds to what we as historians have at our disposal. I personally do not perceive this as being much different than sitting in a library and sifted through what is useful to me and what is not. These tools will make it faster to find connections and hopefully allow us to see connections we might not have otherwise. I was absolutely fascinated with Cohen's example of the terms men and women being compared only after women's liberation. These are things that we see in social history, but now we have the ability to see it in a more quantitative way. The fact that you can potentially tailor the search tools to your particular needs is interesting, but I think that comes back to the debate on whether or not humanists need to be familiar with the nitty gritty of DH. To sum up my answers to questions 1 and 3... I do not believe that we can ever have "too much" information at our fingertips. I think a skill that will become even more necessary for humanists than it is currently, is the ability to find relevant material in a sea of information. (this was mentioned above as well) That is one of the fantastically fun challenges that we as historians face. As for question 3, I believe that while programs of textual evaluation are certainly interesting and a potentially wonderful tool, we must avoid dictating what WE believe they should do, and see them as what they are. That way we can perhaps suggest improvements to the tools without condemning them for being inadequate for our own uses. Looking forward to continuing this discussion tomorrow. Heidi

[edit] Ryan

Melanie and Heidi, You both make note of the fact that it will still be the researchers job to analyse the results of their research. And I wouldn’t dream of discounting this, even though I’m not above making a fool of myself. Both of you seem to be saying that this is just the same ol run of the mill historian type stuff. This is where I’m inclined to disagree. I’d make the claim that the mass of information available to us radically changes the nature of research and analysis for the historian. Previously it was acceptable and manageable for a historian to sift through volumes in parchment and ink. We are now talking about compiling data on MILLIONS of books. It doesn’t seem reasonable to sift through billions of words to ensure that the data being returned by your search is actually what you think it to be. How do we find the horses here? Sure you can make general conclusions but where is the line? How definite a claim can we as historians make if it’s not humanly possible to corroborate our findings? I have no doubt that new tools will/ have to be developed to help us sift through the information but to my knowledge they don’t exist yet and it seems like it would require a rather complex set of tools. (I came a across a tweet today about metonymy in history which might fit in here.) On the other side of the coin, if we now have the access to this massive amount of information, are we not obligated to work that into our research? I think I’m just having trouble finding the line between what it potential and what is reasonable when it comes to research and analysis in the face of the tsunami of information that is descending upon us. Perhaps this is something to be decided by some all knowing committee somewhere? (They are probably in charge of SSHRC grants as well.)


[edit] Rob

Hum diddy humdiddy hum. Seems I'm rather late in adding something to the discussion but here goes. I agree with Ryan with respect to the idea that it is not humanly possible to verify all of the results found by such 'word searches' if you will. There just might be 'too much' information at our fingertips. However this is in keeping with our own culture that very much values quantity over quality. I wonder if at some point we'll even need to bother reading books or articles for evidence and not just rely on computers to do it for us - in fact most of the younger set already do this with those handheld gadgets they are always looking at. But I digress. I can see the merits of such tools as developed by Cohen and Gibbs, however I see the results they produce as being more of a complementary nature to bona fide research rather than as research in and of themselves. They are very much like circumstantial evidence in court...they merely lend credence to the more concrete evidence that has been presented. Some of you guys have also mentioned Google. I think it's a very interesting idea to have one access point to all human knowledge, but that's a lot of bloody knowledge! Further, what DOES happen to the writing profession if everything gets digitized and made accessible for free? When I was in China I realized I would never ever want to be an author there because once you write a book, it gets photocopied and sold illegally and the poor author is just that, poor. While this may not be quite a concern for academics, it certainly is for other fiction and non-fiction writers who win their bread solely by the content they produce and royalties. Lastly, though most e-collections are heavily curated, it is important that they remain that way...the democratization of media is a double-edged sword. How do we separate the wheat from the chaff? This is where a human cannot be replaced - unless you work for Cyberdyne systems!!! (sorry, couldn't resist the Terminator reference)


[edit] Spencer

Firstly, I think it might be useful in the future to break up the responses using the Header tags to make easier both reading and referencing. That said, let's go:

Question One seems to be a knee-jerk reaction to the flood of data being sent our way when we're so accustomed to choosing carefully our sources. Were it possible for a human to read every text related to their research topic, they would surely do so, and many make valiant efforts. More data does not impede research; inadequate methods will damage research by failing to accommodate the data.

The sheer volume of data now available, however, transcends the abilities of traditional methods. So what do we do, Question Two asks. Well, we rethink what we do and begin to imagine new methods. When Mario gains a tail from eating a leaf, he doesn't lose the ability to jump on Goombas, but now he has two ways to deal with his opponents. Level up!

It is logical, then, to assume that new methods will also require critical evaluation of digital tools, and a reinventing process through which those tools are made to work for their users. Louis Menand criticises the Culturnomics team because they aren't humanists, as though they've made a new toy and locked him and his friends out of their clubhouse. Digital tools are rarely made to be horded and leveraged; they are shared with everyone who might find them useful, interesting, or simply a good waste of time. If historians want to use these tools (and I would suggest that they should), they need to engage with them.

There is, of course, the possibility that these tools could be used for ill by the uninformed, producing purely statistic-driven histories that offend the very nature of our discipline (and our sensibilities). Constant vigilence, I say. As Cohen (quoted by Cohen) says, the meaning of these figures are hardly clear. Rather than fear for the worst case scenario, we should take an active role in making this data and the tools with which we analyse it.

Finally, to answer question five, and to reflect on the responses posted above, I think it's quite clear that a discussion on analysis of millions of books is not truly concerned about whether we should do this, or the negative implications of this technology, but is focused on the methodology that is required to make good use of the tools and the possibilities afforded. Ryan, Melanie, Grant, Valerie, and Heidi all seem optimistic about the potential of data analysis on this scale, and while Sean, Dave, and Robert have their doubts, the common theme is one of taking steps forward with a wary eye on the ground we tread, alert for pitfalls and misleading paths.

Though I expose my hand with the following, I believe that historians (digital or otherwise) are primarily doers. Whether active in producing the histories of our time or building digital tools that manipulate texts, we are required to be active. I don't dismiss the concerns levelled toward digital humanities, but when it comes down to it, excessively weighing the negative and positive outcomes of taking action can hobble our main purpose, to make.


[edit] Terry

A Texan in Canada: Better really late then never... I was in the wrong place at the right time. Very interesting stuff. Just a couple of notes from me...'In 500 Billion Words, New Window on Culture'"Google says the culturomics project raises no copyright issue because the books themselves, or even sections of them, cannot be read." Does anyone else think this is really funny?Analyzing Literature by Words and Numbers..neat.."“This is not just a tool; this is actually shaping the kind of questions someone in literature might even ask.” Does anyone else see this as prophetic? Franco Moretti’s Graphs, Maps, Trees In quantum physics everything (I mean everything) can be reduced to three (four?) geometric shapes. I found the illustrations to be similar to roads, veins... it all depends on which view you have and where.

[edit] Closing Remarks

We just wanted to thank everyone for a wonderful week of discussion online. For class please bring your laptops, MRP presentation notes and try to think of about 5-10 good research terms related to your project.

[edit] Post-Presentation Distance Reading Wiki Notes

Our discussion this week centered on distance reading, more importantly we discussed what we now can do with texts that have been digitalized in terms of analysis. In other words, we had a lively academic discussion on text/document analysis done at the computer level in order to understand trends surrounding research of history. A number of important issues were addressed such as, do we end up with too much information if we all thousands or even millions of texts to be digitalized and imputed into our database of research? Also, what is the value to our research of the trends produced by such tools as Google Ngram? At the core of our discussion was the problem of how the tools developed by digital humanists such as Moretti and Crane would fit into our style or I should say the assumed style of how we study history.

The discussion began with a brief overview of the study of trends and how trends have affected texts produced. Then shifted to an analysis of how these resources can be used for specific purposes and how we should be able to sift through the sources found as historians, in other words the development of our own ability to do research from digital sources. This sparked the idea that through these new data sets we can then ask more questions for our research. It also raised the issue of if books are simply texts or if they too can act as other forms of sources or insights into a context of history. It was suggested here that then do we simply have too much information at our fingertips, as Ryan put it, "do we as historians then only publish one thing in the course of our careers?"

We then returned to the idea of how research is conducted based on the individual and what sources he/she is willing to use for their research, in other words drawing our own lines for our own research purposes. Moving on from the tools/abilities of historians, we then started to look at the contexts in which this newly available data would change both in a positive and negative way our research. Spencer brought up a good point about this data or distance reading not needing a specific context because it meant to plot trends, therefore we need to address the question of "what do we want from the tool?"

The last part of the conversation shifted to a more personal level where we began to look at how research is conducted at the individual level, do we get lost in a sea of information? How do we wade through all the information? Are we then overlooking "the gem" source? We were then reeled in a bit by Kevin who gave us a mini history lesson on Truth and what particular period of studying history we are currently in. He made it clear that we are in a moment where we have the ability to have millions of resources at our fingertips and books can no longer just be viewed as texts but can be sources of other sets of data. To sum up our conversation, we looked at what specific groups of people are these data sets for and addressed the problem of being stuck on being scholars and how these methods could be useful or not useful to our studies.

I think the experiment we did with Google Ngram towards the end of the class helped to further the overall point of the idea of distance reading because it lead all of us to start interpreting the trends that appeared on our graphs. In conclusion, this was a great conversation where we addressed many different angles to the issues surrounding the Digital Humanities!

Personal tools
Bookmark and Share