IBM announced Thursday that it is working with the British Library on a project that will preserve and analyze terabytes of information on the Web before it is lost forever.
Recent research estimates the average life expectancy of a Web site is 44 to 75 days. Every six months, for example, roughly 10 percent of Web pages on the U.K. domain are lost.
In most cases of personal sites, this is no big loss. But in the case of organizations attempting to archive and chronicle elections, news, media, and video, this data leakage presents massive challenges. And even if you have the data, the question remains as to whether it will be usable, or even in a recognizable format.
The new analytics software project, called IBM BigSheets, helps extract, annotate, and visually analyze vast amounts of Web information using a Web browser. The British Library is using a prototype of the software to archive and preserve massive amounts of Web pages to ensure the data doesn't disappear over time.
And this is no small task. The British Library receives a copy of every physical publication produced in the U.K. and Ireland, amounting to more than 150 million maps, manuscripts, musical scores, newspapers, and magazines that it must archive.
Beyond just the physical assets, the British Library has been archiving selected Web pages from the U.K. domain since 2004. According to David Boloker, CTO of Emerging Technologies at IBM, with BigSheets, users of the library in the future will be able to access vast archives of historic Web sites and easily research and analyze their queries and visualize the results of the search.
Boloker also told me via e-mail that the BigSheets software is built on top of several open-source components:
- Hadoop--an open-source implementation of frameworks for reliable, scalable, distributed computing and data storage.
- Nutch--an open source web-search project that builds on Lucene Java, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats
- Pig--an open-source platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs
Boloker explained that BigSheets is a private cloud service running parallel MapReduce jobs on all of the library's machines. And while it's a private cloud (take note--private cloud spotted in the wild), the British Library will make the data and services available for people to access.
There is no shortage of data to analyze these days, and more and more government agencies and large corporations will continue to find themselves in search of these types of solutions. What's nice to see is that open source, and perhaps more importantly, Apache-licensed open source software is what next-generation analytics tools are being built on.
Updated 8:40 a.m. PDT on February 25 to reflect that IBM's announcement had officially taken place.