Last week I attended the GigaOM Structure Big Data conference in New York City. Although my resume says I'm a storage analyst of long standing, this was not a storage conference. However, my e-mail inbox reminds me daily that storage vendors think "big data" spells big opportunity so I went to see how and how much they can really contribute to the advancement of big-data analytics.
This conference only confirmed a suspicion that's been building for that last few months as I've been following the big-data wave: Big-data practitioners are generally hostile to shared storage. They like direct-attached storage (DAS) in various forms from solid state disk (SSD) to high-capacity SATA disk buried inside parallel processing nodes. SANs (storage area networks) need not apply.
Assembled at this event were some of the best and brightest minds in big data. But, in numerous presentations I saw an avoidance of SAN and NAS (network-attached storage). One large systems vendor that also sells shared storage even touted the fact that its analytics database platform didn't require a SAN.
Why? There are two reasons that are interrelated. First, most if not all of the attendees here would include real- or near-real-time information delivery as a one of the defining characteristics of big-data analytics. Latency is therefore avoided whenever and wherever possible. Data in memory is good. Data on spinning disk at the other end of a SAN connection is not, unless perhaps it's a secondary copy of data. (I'll get to that in a minute.) And while some here believed that it was theoretically possible to get high-performance shared storage to stand up to the low-latency requirement, the cost of such a SAN at the scale these people need was seen to be prohibitive.
I learned many things at Structure Big Data. I learned, for example, that data clouds are real although they're referred to as data markets (not to be confused with data marts). Think of a data market this way: The cloud says that you don't have to own computing infrastructure to reap all of IT's benefits. You don't even have to own the applications running on it. Now we have data markets that say you don't even have to own data. Just rent it. Pay for data generated by someone else as you need it, and all in the cloud.
I also learned that there is a long list of start-ups and database structures in this space that I encountered for the first time. But as a storage analyst bound to more traditional computing models, my radar screen is admittedly wearing blinders. I learned more about what a data scientist does. And I learned that West Coast big-data vendors are way ahead of their East Coast brethren.
What I didn't learn any more about than what I knew going into this event was how shared storage can actually play in the big-data amusement park. I know of and have written about NAS storage playing the role of data protector and archivist. These are important functions no doubt, but I don't believe that many big-data users buy the argument that their DAS-resident data is unprotected. As one user pointed out, the DAS can be RAID storage and data copies can be distributed across processing nodes for redundancy. I sensed that I could have argued that there was still exposure to data loss until I turned blue, so I didn't.
At this point in time, most of the enterprise data center production-quality storage platforms just aren't appreciated here. Storage presented as a shared data service with all the attendant bells and whistles isn't what the majority of big-data application developers are looking for. They want simple, fast, and cheap. Yet I believe there is a case to be made for shared storage, certainly as a secondary storage facility that offers data protection and archive services, and possibly more. But storage vendors and the storage community in general have yet to make that case to the people I met at the Structure Big Data event.