Which 'big data' are you talking about?

"Big data" is now a mainstream buzz word. But does it refer to data storage, data analytics, or both?

John Webster Special to CNET News
John, a senior partner at Evaluator Group, has 30 years of experience in enterprise IT storage, spanning mainframe and open systems environments. He has served as principal IT adviser at Illuminata and has held analyst positions at IDC and Yankee Group Research. He also co-authored the book "Inescapable Data Harnessing the Power of Convergence."
John Webster
2 min read

Late last year I posted a blog item about big data and if/when it would present opportunities for storage vendors. I concluded by saying that, while it was a bit early for next-year prognostications, I expected to see the number of storage devices aimed at analytics applications blossom in 2011 with more storage vendors pursuing the opportunity.

It's now 2011 and I stand by that prediction. However, at least three definitions of big data have blossomed since that posting:

  • Big-data storage: systems that store really big (as in humongous) amounts of data
  • Big-data analytics: systems that use new analytics processes to crunch really big amounts of data from multiple sources and deliver information in real or near real time
  • Big-data storage that supports big-data analytics.

To understand what big data storage is from the vendor point of view, one need look no further than EMC's positioning of its Isilon acquisition. EMC has written "big data" all over this one. But when you parse the text, big data here refers mainly to applications that use and produce humongous amounts of data that is stored not only on disk but tape as well. High-definition video processing applications used by media and entertainment moguls figure prominently here. The processing of genomic sequences is another big-data storage example.

Big-data analytics is very different. Interestingly enough, we can look to the CTO of another EMC acquisition for guidance. Luke Lonergan, CTO of EMC/Greenplum, defines big-data analytics in the context of EMC's Data Computing Division. At a conference for analysts last week, Lonergan defined big-data analytics as "using and leveraging data that is streaming in from all angles that makes businesses work better."

However, during the same presentation, Lonergan hinted at a third meaning of big data--big-data storage that supports big-data analytics--when he spoke of the possibility that EMC/Isilon scale-out NAS could be connected to EMC/Greenplum data analytics. That would be interesting because EMC/Greenplum's database architecture (PDF) is defined as "shared-nothing," not even storage, while Isilon scale-out NAS is a shared storage system.

So when you hear vendors talk about big data, be sure to ask: do they mean big-data storage, big-data analytics, or big-data storage that supports big-data analytics. I know that's a mouthful, but clarity here is everything. IBM is also big in big data (they call it Smarter Planet) and have a scale-out NAS (SoNAS) system as well. HP could make big-data announcements that include their IBRIX-based, X9000 NAS system as well.

And if it turns out that the vendor speaks of shared storage in what is typically a shared-nothing storage environment, take notes. In fact, if you don't mind, cc me. I'm cataloging these. Calpont's InfiniDB is an example of one, and I'm on the lookout for others.