Data vs. models at the Strata Conference

Ever bigger data makes it increasingly possible to tackle some thorny problems. But it takes the right algorithms, and the intelligent selection of data too.

SANTA CLARA, Calif.--That this week's O'Reilly Strata data conference was sold out says a lot about this corner of tech. It's hot. Like cloud computing, big data is all the rage, even if, like cloud computing, it's not so much a single thing but an intersection of technologies, market needs, and critical mass.

One of several themes that kept popping up this week was data vs. models.

In 2008, Wired's Chris Anderson wrote a provocative article titled "The End of Theory: The Data Deluge Makes the Scientific Method Obsolete." His thesis was that we have historically relied on models in large part because we had no other choice. However, "with enough data, the numbers speak for themselves." The counterargument is that useful insights don't just pop out of data. You have to ask the right questions.

The contrast between these two approaches came up in a lot of presentations. Overall, the speakers mostly sided with algorithms and models over just throwing more data into the mix. As Xavier Amatriain of Netflix put it, "data without a sound approach becomes noise." Yet Amatriain also gave insight into how finding the best results requires blending many different approaches, including adding additional types of data as appropriate.

The algorithms stemming from the much-ballyhooed Netflix Prize are actually a small piece of Netflix's overall movie recommendation process. There are a couple of reasons. The first is that the winning algorithms turned out to be very computationally intensive, in addition to being inflexible in other ways. The more important reason though is that predicting how customers would rate a movie, the objective of the Netflix Prize, was never the ultimate objective. That was to deliver better recommendations and, thereby, presumably increase the likelihood that they would remain Netflix subscribers. It turned out that marginally improving ratings prediction only went so far in improving recommendations overall.

Netflix therefore combines personalization, a wide range of algorithms, a huge amount of A/B testing (whereby different approaches are tried with different customer groups and the results evaluated), data from external sources, and even some randomness for serendipity. Data certainly plays a role, in fact a very central role, but it's far more complicated than feeding in the biggest possible datasets and letting the machine learning algorithms churn.

(That said, Amatriain noted that certain types of problems, such as natural language recognition, use so-called "low bias models" that benefit from a lot of training data.)

Other examples come from the talk given by Hal Varian, Google's chief economist, who showed off Google Correlate. This tool lets you explore how search trends relate to data--such as time series economic data. This opens up possibilities such as finding leading indicators in search data for various types of economic activity.

Google Correlate obviously depends on access to Google's vast database of search terms. However, Varian's talk also touched on many of the complexities of interpreting correlations. For some purposes, it makes sense to seasonally adjust data, and for others it doesn't. You have to choose search patterns intelligently. You need to use appropriate statistical techniques to interpret the results.

These two examples, as well as others, nicely sum up the data vs. models question. There's a wealth of data both within and outside of organizations that has the potential to improve business results. But most insights won't come simply. They'll come through intelligent questions, intelligent algorithms, and intelligent selection of data sets. And, ultimately, the insights will improve the business only if they're then put into action.

Featured Video