Mining the web to predict the future: what about ‘long data’?

A week ago Wired had a interesting opinion piece by Samuel Arbesman, an applied mathematician and network scientist, on why we shouldn’t ignore the value of long data in the era big data. I recommend you to read the entire piece, but I have included some highlights in this post.

On what he means with long data:

But no matter how big that data is or what insights we glean from it, it is still just a snapshot: a moment in time. That’s why I think we need to stop getting stuck only on big data and start thinking about long data. By “long” data, I mean datasets that have massive historical sweep — taking you from the dawn of civilization to the present day. The kinds of datasets you see in Michael Kremer’s “Population growth and technological change: one million BC to 1990,”

What the value of long data is:

So we need to add long data to our big data toolkit. But don’t assume that long data is solely for analyzing “slow” changes. Fast changes should be seen through this lens, too — because long data provides context. Of course, big datasets provide some context too. We know for example if something is an aberration or is expected only after we understand the frequency distribution; doing that analysis well requires massive numbers of datapoints.

Big data puts slices of knowledge in context. But to really understand the big picture, we need to place a phenomenon in its longer, more historical context.

I like the idea of adding context to big data by placing more current datasets within larger historic ones. It suits the general understanding that we can only think of the future if we understand our past.

The idea of long data is actually a basic idea for some recent developments at the New York Times. Last week they announced that researchers from Microsoft and the Technion-Israel Institute of Technology are creating software that analyzes 22 years of New York Times archives, Wikipedia and about 90 other web resources to predict future disease outbreaks, riots and deaths. And maybe even prevent them. I am aware that 22 years is not a look back at the ‘dawn of civilization’, however 22 years of data has great historic value and is ‘longer’ in terms of this historic context than the datasets we usually consider as ‘big’ data.

Eric Horvitz of Microsoft Research and Kira Radinsky of the Technion-Israel Institute also published a research paper titled “Mining the Web to Predict Future Events” (PDF). One example from the project examined the way that news about natural disasters like storms and droughts could be used to predict cholera outbreaks in Angola. Following those weather events, “alerts about a downstream risk of cholera could have been issued nearly a year in advance”.

(more on the project at GigaOm & Technology Review)

The researchers also describe the advantages of letting software handle these types of research: software has the ability to learn patterns, do tireless researching, a greater acces to news and a lack of bias (this last one is actually up for debate if you’d ask me).

Learning from the past to predict the future is what predictive analytics is all about. What should ‘long data’ look like from a business perspective?


Leave a Reply