According to an article in this week's MIT Technology Review, IBM researchers are working on a new 120 petabyte data repository made up of 200,000 conventional hard disk drives working together. The giant data container is expected to store around 1 trillion files and should provide the space needed to allow more powerful simulations of complex systems, like those used to model weather and climate.
The new system benefits from a file system known as General Parallel File System (GPFS) that was developed at IBM Almaden to enable supercomputers faster data access. It spreads individual files across multiple disks so that many parts of a file can be read or written at the same time.
GPFS leverages cluster architecture to provide quicker access to file data, which is automatically spread across multiple storage devices, providing optimal use of available storage to deliver high performance. It's also the storage engine for IBM's Watson, which could easily beat me at Jeopardy.
Here's the interesting part: 120 petabytes equals roughly 24 billion 5 megabyte MP3 files, which sounds like a lot. But contrast it against the enormous volume of data being amassed from sites such as Facebook that in 2009 were already storing 25 terabytes of logs a day and you see that only 4,915 days could be stored.
With the volume of data online and offline growing exponentially, I have a feeling that 120 petabytes won't sound so crazy in five years or less. It also goes to show that there's room for innovation around storage and file systems, despite the maturity of the market.