EMC wants to distribute data

EMC has unveiled a vision for globally federating data, essentially a cache for storage across wide area communication links.

Gordon Haff

Gordon Haff is Red Hat's cloud evangelist although the opinions expressed here are strictly his own. He's focused on enterprise IT, especially cloud computing. However, Gordon writes about a wide range of topics whether they relate to the way too many hours he spends traveling or his longtime interest in photography.

See full bio

Gordon Haff

March 12, 2010 12:25 p.m. PT

4 min read

Virtualization first hit the big time because it let users consolidate many system images onto a single physical server, thereby reducing the amount of hardware they needed to buy. However, as time has passed, the mobility of virtual machines, the ability to move them from one server to another at the click of a button, has come to be seen as a big win as well. Going by names such as VMotion (in VMware's case), mobility enables system maintenance and workload balancing without interrupting users. And, ultimately, it underpins visions of more dynamic computing environments where workloads transparently and automatically move within and even between data centers in response to changes in demand, service outages, and even power costs.

Virtual machines can be transparently moved, in part, because they're not very big. They're a block of memory--a few gigabytes at most. The big chunk of data associated with a system sits on disk and, generally speaking, virtual machines transfer between physical servers within a single shared storage pool. In other words, the virtual machines move but the data stays put.

There are good reasons for why things are done this way. Moving large amounts of data, especially over long distances, takes time; the "pipes" are only so big. All other things being equal, accessing data that is far away also takes longer than if it is close. And many applications have to ensure they're working with the most recent copy of data so everything has to be constantly synced up. Yet, restricting movement to within a local storage network is limiting.

Dealing with that limitation was the topic of Thursday's session by EMC President of Information Infrastructure Products Pat Gelsinger in Hopkinton, Mass. Using the rubric "virtual storage," Gelsinger described a "distributed cache coherence" concept that would "create the illusion of having petabytes [of geographically-distributed data] local."

Caches are a familiar concept from computer system design. On the scale of a processor's compute cycles, the dynamic random access memory used for computer systems' primary storage is relatively slow and often makes the processor wait for data. One of the workarounds for this problem is to add memory, typically of a faster type, on or near the processor where it can be accessed more quickly. This "cache" memory contains a subset of main memory's contents.

Cache design is a complicated subject. A system typically has several levels of cache and the whole hierarchy has to present a coherent view of the contents of memory. But in a nutshell: Because data tends to be accessed in certain predictable patterns, a relatively small amount of nearby storage can present the appearance of a much larger quantity while delivering performance that can be a high percentage of what could be achieved if all of the storage were nearby and equally fast.

EMC's virtual storage will be partially based on intellectual property that it acquired when YottaYotta shut down in 2008. Robin Harris described their technology thusly: "The YottaYotta system was a network-based RAID controller. The controller's backplane was a network - Infiniband or GigE - so the controller could be physically distributed. The coordination of the distributed controller boards through wide-area cluster software is the company's key IP." EMC Global Marketing CTO Chuck Hollis notes, however, that what's being discussed here isn't literally YottaYotta's product: "We did get a nice piece of cool IP from them around distributed cache coherence algorithms as a starting point, but you'd be incorrect in thinking this is the same technology that they were selling a while back."

Gelsinger went on to describe how this "globally federated" storage could be the foundation for a wide range of functions managed through global policies at the middleware, application, and virtual machine levels. For example, an application could specify that it needs low latency storage or that it needs storage specifically optimized for read performance. A security policy might specify that a particular virtual machine never leaves a given data center.

Yesterday's session was explicitly about the concept and not a specific product. Gelsinger did go so far as to say that the first instantiation of the technology would take the form of a bundled storage appliance but it's basically software technology that EMC will bring to market in a variety of forms over time.

Time frames? None given although Hollis writes that "We wouldn't be sharing a vision without real products coming to back them up. As Pat mentioned, this stuff is in use today in real-world customer environments, and they're pretty excited about it." So the initial appliance seems likely to make an appearance within, say, months.

However, whatever the exact timing of Version 1.0, this is the sort of concept that will roll out over years. And it won't be until we start to see it in use over distributed communication links with real applications and workloads that we'll really be able to judge its effectiveness. After all, the difficulty in designing caches isn't really in making them work (not that that's especially easy either) but in making them always work with high performance. But if EMC cracks the code on the tough problem of distributing data over large distances--while maintaining good performance--this is a development to watch.