Culturomics and Sentiment Mining

Some Ground Truth for a Blue-Sky Notion

December 2011. In September, BBC news headlined a story "Supercomputer Predicts Revolution." [1] On his Web site, Kalev H. Leetaru, the University of Illinois researcher behind this claim, describes his work with equal modesty: he says that he "forecast the Arab Spring." [2] He used software to sweep through English-language news archives, counting words of negative emotional tone and plotting the time series of a metric derived from their frequency of occurence in articles about Egypt, Tunisia and Libya. He published some details of this work in the online journal First Monday. [3] The principle he seeks to demonstrate is that this type of data, which he calls "sentiment mining," can predict political conflict. Here's his graph for Egypt: [4]

It's arguable that this plot doesn't usefully forecast anything. After all, the final negative spike that the author claims as a predictor of the Egyptian uprising occurs only about two weeks ahead of the event.

He also applies the same technique to show that the trend in the tone of the news media overall has been toward the negative. One data set he uses is the entire contents of the New York Times from 1945 to 2005, whose negativity score he has plotted in the following graph: [5]

As it happens, I have some data of my own that have a direct bearing on the interpretation of this graph. I gathered it years ago in the pre-Internet days when the only way to acquire time series data like this was by reading microfilm of archived newspapers on a hand-cranked viewer in a library.

At the time, I was studying a change in the New York Times's format that took place around 1970. Among other changes, the number of articles per issue declined, while the number of pages in the paper remained approximately the same. I plotted the number of articles in the Times's front section, per week (specifically, Monday through Saturday), at various points in the years during which this transition seemed to be taking place (horizontal axis numbers refer to years 1963-73):

When this graph is rescaled so that it can be aligned with Leetaru's on the time axis, the combined plot shows that the decline in story count, occurring around 1969, coincides with a sort of phase change in his metric:

Before 1969, his metric varies widely around a positive value with a hint of cyclicity. After 1969 the short-term variance narrows around a trend that seems almost to be controlled by a steady hand, descending to a minimum in 1973 and then rising linearly to a kind of plateau beginning around 1978, interrupted only by a negative spike in 1990-91 and then a prolonged depression starting in 2001.

In 1969-70 the Times moved toward fewer, longer articles dealing with issues in depth. Inevitably, the subject matter of these articles involved problems and conflict, presumably increasing the usage of words of negative tone. Therefore, the peculiar, structured shape of this graph might be the outcome of a change in editorial focus that can be isolated to a short time span between 1969 and 1971.

The exact nature of Leetaru's metric isn't spelled out in his paper or in his cited references. But when applied to recent news about Egypt, interpretation is easy: the country is headed for a rough patch, as seen through the eyes of the English-speaking press. On the other hand, when it's applied to the entire contents of the New York Times, it doesn't just mean that the Times is trending toward a more negative tone. Instead, it reveals something more subtle and interesting, something that begs for a more fine-grained analysis.

Footnotes

[1]
: Computer Predicts Revolution, BBC News, Sept. 9, 2011
[2]
: http://www.kalevleetaru.com/
[3]
: Leetaru, Kalev H., Culturomics 2.0: Forecasting large-scale human behavior using global news media tone in time and space First Monday, Volume 16, Number 9 - 5 September 2011
[4]
: Leetaru, fig. 2
[5]
: Leetaru, fig. 10

Home page

Charles Packer mailbox@cpacker.org