IBM goes for really, really, really big data

Big Blue's latest invention is a 120 petabyte data repository that seems big now, but won't in a few years.

Dave Rosenberg Co-founder, MuleSource
Dave Rosenberg has more than 15 years of technology and marketing experience that spans from Bell Labs to startup IPOs to open-source and cloud software companies. He is CEO and founder of Nodeable, co-founder of MuleSoft, and managing director for Hardy Way. He is an adviser to DataStax, IT Database, and Puppet Labs.
Dave Rosenberg
2 min read
IBM Watson

According to an article in this week's MIT Technology Review, IBM researchers are working on a new 120 petabyte data repository made up of 200,000 conventional hard disk drives working together. The giant data container is expected to store around 1 trillion files and should provide the space needed to allow more powerful simulations of complex systems, like those used to model weather and climate.

The new system benefits from a file system known as General Parallel File System (GPFS) that was developed at IBM Almaden to enable supercomputers faster data access. It spreads individual files across multiple disks so that many parts of a file can be read or written at the same time.

GPFS leverages cluster architecture to provide quicker access to file data, which is automatically spread across multiple storage devices, providing optimal use of available storage to deliver high performance. It's also the storage engine for IBM's Watson, which could easily beat me at Jeopardy.

Here's the interesting part: 120 petabytes equals roughly 24 billion 5 megabyte MP3 files, which sounds like a lot. But contrast it against the enormous volume of data being amassed from sites such as Facebook that in 2009 were already storing 25 terabytes of logs a day and you see that only 4,915 days could be stored.

With the volume of data online and offline growing exponentially, I have a feeling that 120 petabytes won't sound so crazy in five years or less. It also goes to show that there's room for innovation around storage and file systems, despite the maturity of the market.