Data isn't always the answer

Big data gets relentlessly hyped as the answer to any number of problems. But more data isn't always the needed tool.

Gordon Haff

Gordon Haff is Red Hat's cloud evangelist although the opinions expressed here are strictly his own. He's focused on enterprise IT, especially cloud computing. However, Gordon writes about a wide range of topics whether they relate to the way too many hours he spends traveling or his longtime interest in photography.

See full bio

Gordon Haff

July 10, 2012 3:50 p.m. PT

3 min read

"Big Data" promises to turn terabytes, petabytes, and exabytes (with, presumably, zettabytes and yottabytes to come) of what's often ambient digital detritus into useful results. That promise often seems to come with an implicit assumption; with enough data and the tools to crunch it, useful insights will follow. Insights that can be used to make businesses more efficient, tailor everything from medicine to advertising for individuals, and employ instrumentation and automation on larger and more complex physical systems than ever before.

For example, we're in the early days of what sometimes goes by the name of the "Internet of Things," the idea that we'll have pervasive meshes of sensors recording everything and integrated together into feedback loops that optimize the system as a whole. IBM, with rather more marketing dollars than the academics who first coined the concept, talks about this idea under an expansive "Smarter Planet" vision.

Some of this smart-systems talk leads the reality by a (long) way, to be sure. But no one really disputes that instrumentation can be used to optimize behavior at the level of an overall system. It's pretty standard command-and-control system dynamics stuff that's done all the time. The only thing that's really new is the scale of the systems, the sensor net, and the feedback controls.

There are also examples of success, even if some are incremental and tactical. Even if the Netflix prize for improving movie recommendations didn't achieve any particular breakthrough, the workaday efforts of Netflix engineers continue to improve movie recommendations across a number of fronts. And those improvements are both based on data and tied into improving business outcomes -- in this case, retaining subscribers. Other anecdotes, from Obama re-election campaign e-mail targeting to Target "pregnancy prediction" scores, suggest there's at least some value in using the results of data analysis to affect consumer behavior in a specific way.

Another recent announcement is bigdata@CSAIL, which brings together the work of more than 25 MIT professors and researchers with the Intel Science and Technology Center for Big Data at CSAIL (Computer Science and Artificial Intelligence Laboratory); it will focus on areas such as finance, medicine, social media, and security.

It's hard to argue that larger volumes of data, increasingly available at nearly the instant it's generated, won't play a bigger and bigger part in any number of applications -- both for good and ill.

However, as Big Data hype accelerates, it's also useful to maintain an appropriate level of skepticism. While data can indeed lead to better results, this won't always be the case. The numbers don't always speak for themselves and sometimes the underlying science to apply data, however plentiful, in a useful way just doesn't exist.

For example, there's a widespread assumption that personalized advertising is more effective advertising. But a reader's comment on Michael Wolff's "The Facebook Fallacy" nicely summarizes why this might not be the case.

There is not now, nor is there anything on the horizon, that is a scalable, automated means of exploiting people-generated data to extract actionable marketing information and sales knowledge. A well-known dirty little secret in the advertising world is that, even after millennia of advertising efforts, not a single copywriter can tell you with any confidence beyond a coin flip whether any given advertisement is going to succeed. The entire "industry" is based on wild-assed guesses and the media equivalent of tossing noodles against the kitchen wall to see what might stick, if anything.

Peter Fader, co-director of the Wharton Customer Analytics Initiative at the University of Pennsylvania, talks of a "data fetish" that is leading to predictions of vast profits from mining data associated with online activity. However, he goes on to note that more data and data from mobile devices doesn't always lead to better results. One reason is that "there is very little real science in what we call 'data science,' and that's a big problem."

We'll only see more stories about great results being achieved by applying data to some problem in a novel way. Especially when there's solid underlying science, algorithms, and models limited only by the quality or quantity of the inputs, more and different types of data can indeed lead to impressive results and outcomes.

But this doesn't mean that bigger data will always hold the key. Sometimes data is just data -- noise, really. Not information. It doesn't matter how much you store or how hard you process it.