Big Data meets the Bard

By John Sunyer


Here’s some advice for bibliophiles with teetering piles of books and not enough hours in the day: don’t read them. Instead, feed the books into a computer program and make graphs, maps and charts: it is the best way to get to grips with the vastness of literature. That, at least, is the recommendation of Franco Moretti, a 63-year-old professor of English at Stanford University and unofficial leader of a band of academics bringing a science-fiction thrill to the science of fiction.

For centuries, the basic task of literary scholarship has been close reading of texts. But for digitally savvy academics such as Moretti, literary study doesn’t always require scholars actually to read books. This new approach to literature depends on computers to crunch “big data”, or stores of massive amounts of information, to produce new insights.

Who, for example, would have guessed that, according to a 2011 Harvard study of four per cent (that is, five million) of all the books printed in English, less than half the number of words used are included in dictionaries, the rest being “lexical dark matter”? Or that, as a recent study using the same database carried out by the universities of Bristol, Sheffield and Durham reveals, “American English has become decidedly more ‘emotional’ than British English in the last half-century”?

Not everyone is convinced by this approach. In n+1, a New York-based journal of culture and politics, the writer Elif Batuman summarises the ambivalence to Moretti’s work: “[His] concepts have all the irresistible magnetism of the diabolical.” For Moretti, however, “The use of technology to study literature is only radical when you consider it in the context of the humanities – the most backward discipline in the academy. Mining texts for data makes it possible to look at the bigger picture – to understand the context in which a writer worked on a scale we haven’t seen before.”

Moretti’s Distant Reading, a collection of his essays published this month, brings together more than 10 years of research and marks a significant departure from the traditional study of novels. As Moretti writes in “Conjectures on World Literature” (a 2000 article reprinted in Distant Reading): “At bottom … [literary study is] a theological exercise – very solemn treatment of very few texts taken very seriously – whereas what we really need is a little pact with the devil: we know how to read texts, now let’s learn how not to read them.”

Thus, in “Style, Inc”, Moretti takes 7,000 British novels published between 1740 and 1850 and feeds them into a computer. The results reveal that books with long titles became drastically less common during this period. What happened, he wonders, to books with titles such as: The Capacity and Extent of Human Understanding; Exemplified in the Extraordinary Case of Automathes: a Young Nobleman; Who was Accidentally Left in his Infancy, Upon a Desolate Island, and Continued Nineteen Years in that Solitary State, Separate From All Human Society. A Narrative Abounding With Many Surprising Occurrences, Both Useful and Entertaining to the Reader?

There are, insists Moretti, interesting questions to be asked about the short titles that took their place. For example, why are adjectives so common in titles about mothers and fathers, but absent in titles about vampires and pirates? “By becoming short,” according to Moretti, “[titles] adopted a signifying strategy that made readers look for a unity in the narrative structure.” This is an important stylistic development – “a perceptual shift which has persisted for 200 years”.

. . .

Moretti was born in Sondrio, a small town in northern Italy, in 1950. He left the University of Rome in 1972 with a doctorate in modern literature and taught at various Italian universities. But it wasn’t until the 1990s, when he moved to America to teach in the English department at Columbia University, New York, that he became interested in the idea of “distant reading”. In 2000, he moved to California for a teaching post at Stanford, a private university recognised as one of the world’s leading research institutes. Ten years later he co-founded the Stanford Literary Lab, “And from that moment big data was no longer only something geeks did in science labs,” he says with a big laugh.

One day about four weeks ago, Moretti invited me to attend a Stanford Literary Lab seminar via Skype. The lab, with three full-time staff and about 30 students and faculty members, aims to “pursue literary research of a digital and quantitative nature”.

There wasn’t much glitz on show in the small, cramped room. In fact, there was little to suggest that this was, in effect, the office of the world’s most elite group of data-diggers in the humanities, other than some algorithms on a white board and the ubiquitous laptop computers. I didn’t spot any books but then, perhaps, that’s what one might expect. Ryan Heuser, 27-year-old associate director for research at the Literary Lab, tells me he can’t remember the last time he read a novel. “It was probably a few years ago and it was probably a sci-fi. But I don’t think I’ve read any fiction since I’ve been involved with the lab.”

The seminar was to consider Augustine’s Confessions, written in the fourth century and often called the first western autobiography. The lab members gave the sort of slick presentation you might expect from analysts in an investment bank. The language they used – algorithms, z-scores, principal component analysis, clustering coefficients, and so on – would have been familiar to an internet software engineer or mathematician.

Matthew Jockers, a 46-year-old professor of English, tech whizz and co-founder of the Literary Lab, was also in attendance on Skype. Later he told me, “We are reaching a tipping point. Today’s student of literature must be adept at gathering evidence from individual texts and equally adept at mining digital text repositories.”

Jockers spent more than a decade at Stanford before moving last year to the University of Nebraska in Lincoln. He holds the distinction of being the first English professor to assign more than 1,200 novels in one class. “Luckily for the students, they didn’t have to read them,” he says.

In his recent book Macroanalysis: Digital Methods & Literary History (2013), Jockers publishes a list of the most influential writers of the 19th century. The study is based on an analysis of 3,592 works published from 1780 to 1900, he explains. It took a lot of digging, and a computer did it by cross-checking about 700 variables across the sample, including, for example, word frequencies and the absence or presence of themes such as death.

“Literary history would tell you to expect Charles Dickens, Thomas Hardy and Mark Twain to be at the top of the list,” says Jockers. But the data revealed that Sir Walter Scott and Jane Austen had the greatest effect on other authors, in terms of writing style and themes.


The idea of graphing and mapping texts isn’t new. In 1946, when computers were enormous and the internet wasn’t even an idea, a young Italian Jesuit priest, Father Busa, started work on software that could perform text searches within the vast corpus of Thomas Aquinas, the 13th-century philosopher-saint. Three years later he persuaded Thomas J Watson, the founder of IBM, to sponsor his research. Index Thomisticus, a machine-generated concordance, was completed in the late 1970s.

Scholars have also long been interested in the quantitative analysis of language – albeit without the help of computers. For example, Russian formalism, which signalled a more practical, scientific spirit to literary criticism, flourished in the 1920s.

In Player Piano (1952), the US writer and satirist Kurt Vonnegut predicted a dystopia in which giant computers have taken over brain work. He had earlier proposed, tongue-in-cheek, that a character’s ups and downs could be graphed to reveal a novel’s wider plot. A grainy YouTube video shows Vonnegut demonstrating the “shapes of stories” using nothing more than chalk and a blackboard: “There’s no reason why the simple shapes of stories can’t be fed into computers,” he says in a deadpan way.

The big breakthrough came in 2004, when Google developed an electronic scanner capable of digitising books. No longer did researchers interested in tracking cultural and linguistic trends have to endure the laborious process of inspecting volumes one by one. Soon after Google’s digital archive went online, five of the largest libraries in the world signed on as partners. And, more or less just like that, literature had the potential to become data on an unprecedented scale.

“There are hundreds of digital projects in the humanities taking place,” Andrew Prescott, head of Digital Humanities at King’s College London, tells me. The emerging field is, he says, “best understood as an umbrella term covering a wide range of activities, from online preservation and digital mapping to data mining.”

In To Save Everything, Click Here (2013), technology writer Evgeny Morozov notes that Amazon is sitting on vast amounts of data collected from its Kindle devices about what part of a book people are most likely to give up reading. In the not-too-distant future, Morozov speculates, Amazon could build a system that uses this aggregated reading data to write novels automatically that are tailored to readers’ tastes. Will there be a point where writers and readers will admit defeat, acknowledging that the computers championed by Moretti know best?

“My impression is that Moretti is a passionate and astute scholar,” the novelist Jonathan Franzen tells me. “I doubt it is his aim to put novelists and novel readers out of business.” Though new technology does not sit well with Franzen (he once admitted gluing up his Ethernet port, saying, “It’s doubtful that anyone with an internet connection at his workplace is writing good fiction”), he is a fan of Moretti’s work. “The canon is necessarily restrictive. So what you get is generation after generation of scholarship struggling to say anything new. There are only so many ways you can keep saying Proust is great.

“It can be dismaying to see Kafka or Conrad or Brontë read not for pleasure but as cultural artefacts,” he continues. “To use new technology to look at literature as a whole, which has never really been done before, rather than focusing on complex and singular works, is a good direction for cultural criticism to move in. Paradoxically, it may even liberate the canonical works to be read more in the spirit in which they were written.”

If Franzen could wind back the clock, would he choose to study in a literary lab? “It might have been tempting but I feel lucky not to have had the choice,” he says.

Melissa Terras, 38, who since 2003 has been working in University College London’s Centre for Digital Humanities, says: “Even big data patterns need someone to understand them. And to understand the question to ask of the data requires insight into cultures and history … The big threat is that most work in the digital humanities isn’t done by individuals. The past 200 years of humanities has been the lone scholar. But for work in the digital humanities, you need a programmer, an interface expert, and so on.”

Not all the traditionalists are going quietly into the night. Harold Bloom, 82, an American critic and Sterling Professor of the Humanities at Yale University, once described Moretti’s theory of distant reading as an “absurdity … I am interested in reading. That’s all I’m interested in.” (Speaking in 2007, Bloom claimed that in his prime he could read 1,000 pages an hour, enabling him to digest Leo Tolstoy’s Anna Karenina over lunch, if he so wished. Moretti might retort that a computer could do this in microseconds.)

Moretti is used to defending his work. “I’ve received so much shit for the quantitative stuff,” he admits. “But this new and many-sided discipline hasn’t yet completely expressed itself. There is resistance because, for generations, the study of literature has been organised according to different principles. Quantitative analysis wasn’t [previously] considered worthy of study.”

As Jockers says: “Literary scholars have traditionally had to defend their worth against those working in the sciences. Yet now that literature is beginning to reek of science, there’s a knee-jerk reaction against it. We can’t win. There’s an endless battle between the disciplines. I’m still repeatedly accused of ‘taking the human out of humanities’.”

Still, as the data revolution progresses, more universities are finding clever ways to aggregate and analyse massive amounts of information. Distant reading remains “a complex, thorny issue,” says Moretti. “Will we succeed? Who knows. But in the next few years, people will use this data in ways we can’t imagine yet. For me, that’s the most exciting development.”



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s