X

Data mining's adult challenges

The tools to analyze disparate data sets are getting better and cheaper. But the practice will increasingly bump against the boundaries of privacy comfort zones.

Gordon Haff
Gordon Haff is Red Hat's cloud evangelist although the opinions expressed here are strictly his own. He's focused on enterprise IT, especially cloud computing. However, Gordon writes about a wide range of topics whether they relate to the way too many hours he spends traveling or his longtime interest in photography.
Gordon Haff
3 min read
Probably no data-mining legend has been more pervasive than the "beer and diapers" story, which apparently dates back to an early 1990s project that data-warehousing pioneer Teradata (then part of NCR) conducted for the Osco Drug retail chain.

As the story goes, they discovered that beer and diapers frequently appeared together in a shopping basket on certain days; the presumed explanation was that fathers picking up diapers bought a six-pack when they were out anyway. This correlation was then used to optimize displays and pricing in the stores.

That's the story anyway. The reality, as best anyone can determine, is more muddled. The evidence suggests that the project indeed existed. However, the beer-diapers correlation may or may not have been supported by the data. And, in any case, Osco seems not to have made any subsequent changes taking advantage of the purported relationship. That the story has lasted so long says more about the dearth of compelling success stories than anything else.

This isn't to suggest that data mining has never delivered any value. But I think it's fair to say that the gap between vendor marketing claims and gaining insights that were actually useful has been considerable. Data mining might tell Home Depot that it sells more snow shovels in the north than in the south and in winter than in summer--but the Home Depot store manager in Minneapolis doesn't need a sophisticated computer system to tell him that. (Though, as I'll get to, more has probably been going on behind-the-scenes than is generally known.)

But I'm starting to see evidence that this is changing. At least a bit. A lot of hard problems remain. This presentation by Paul Lamere and Oscar Celma (PDF) does a nice job of laying out the challenges with music recommendation, for example. But I'm also seeing enough "real world" data-mining anecdotes that it's hard not to take notice.

For example, Sasha Issenberg wrote in Slate earlier this month that "as part of a project code-named Narwhal, Obama's [re-election campaign] team is working to link once completely separate repositories of information so that every fact gathered about a voter is available to every arm of the campaign. Such information-sharing would allow the person who crafts a provocative e-mail about contraception to send it only to women with whom canvassers have personally discussed reproductive views or whom data-mining targeters have pinpointed as likely to be friendly to Obama's views on the issue." This contrasts with past practice whereby e-mails were more shotgun and stuck to relatively safe and unprovocative topics as a result.

In a recent New York Times article, Charles Duhigg wrote about how Target statistician Andrew Pole "was able to identify about 25 products that, when analyzed together, allowed him to assign each shopper a 'pregnancy prediction' score. More important, he could also estimate her due date to within a small window, so Target could send coupons timed to very specific stages of her pregnancy." Duhigg then goes on to tell a story about how, in one case, Target apparently knew about a high schooler's pregnancy before her father did.

As it turns out, the events recounted in Duhigg's story are not especially recent; Pole did his initial work in 2002. But it's not an area of its business Target wants to discuss. In part, this is doubtless because it views what it does with data mining as a trade secret. However, I'm sure it also stems from the reality that a lot of people find this sort of analysis at least a little bit "creepy" (to use the most common word being tossed around the Internet about this story).

More and more disparate data sets are available online and the tools to analyze them are getting both better and cheaper. Distributed server farms, public cloud-computing resources, open-source software including large-scale distributed file systems and Hadoop are just some of the tools that are starting to make this sort of analysis more mainstream (although many of the data sets are still proprietary and expensive).

But the challenges ahead won't just be technical. They'll be about what types of mining are considered right and proper and what aren't. As the Times noted in its article, "someone pointed out that some of those women might be a little upset if they received an advertisement making it obvious Target was studying their reproductive status."