Why 'big data' is a magnet for startups
So-called "big data" opens up new ways to mine the Web and social media for consumers and business. Data gurus say these apps require a rethink in how computing systems are built.
CAMBRIDGE, Mass.--Armies of entrepreneurs are trying to make money sifting through mountains of data from the Web and other sources, but one of the biggest challenges is simply getting control of the data in the first place
Entrepreneurs at an event here this week said that the trend of "big data," or collecting and analyzing reams of information from varied sources, threatens incumbent technology providers and enables applications once considered impossible. Startups are harnessing massive amounts of data to generate personalized entertainment ideas, predict how media coverage will affect company stock prices, or analyze genomes in the search of new medical treatments.
As these applications begin to crop up, the computer scientists behind them are realizing that they require a completely different technical foundation, according to people at the event organized the Massachusetts Technology Leadership Council. The prevailing view among speakers was that current database systems are simply not cut out for the speed, variety, or volume of data now available online.
"The thing that's really fun to me is that [big data] is majorly disruptive to the legacy vendors...and that will be good for startups," said Michael Stonebreaker, an MIT professor and long-time entrepreneur behind a series of new database systems.
Hadoop, a distributed file system originally designed for indexing the Web, and other alternatives to 1980s-style SQL database are the foundation for big-data apps. These give developers more flexibility in how data is formatted and, because they are often open source, put pressure on incumbent providers.
More significantly, these database systems make it easier to process information from the Web, social media, and other new data sources, such as sensors in cars or utility networks.
Companies have collected and analyzed piles of information for decades in corporate data warehouses. Now application developers are treating the Web and social media as a big database in its own right.
"What's different about big data is that it's driven by the Web and the Internet," said Kelly Stirman, the vice president of customer solutions at Hadapt, which is making data-querying tools for Hadoop. "All the Web companies tried to use Oracle [databases] to solve their problems but eventually gave up."
Everyone's a quant
The startup Goby, which was acquired by GPS navigation company Telenav, crawls thousands of local online information sources to generate recommendations for what people can do in their free time. The fact that so much information is available online makes this sort of application possible, said CEO Mark Watkins.
"We're trying to distill a fire hose of information in a way that's personal and meaningful," he said. "The key is you need lots of data to make this stuff statistically significant."
Scouring the Web for information can give everyday business people the same analytical horsepower that cutting-edge Wall Street analysts, nicknamed "quants," had a few years ago.
The startup Recorded Future collects 100,000 to 300,000 documents an hour from the Web and analyzes them in real time to gain insight into global trends. The software is being used to monitor riots in South America on Google Earth and also to predict the volatility of stock prices based on media coverage.
These sort of far-reaching analytical systems have been used by intelligence agencies for years but they are now possible to build using open-source software and applications hosted on the cloud, said Recorded Future CEO Christopher Ahlberg.
"By organizing the world, I can ask questions...and think analytically about the world in ways that were not possible before," he said. "The Web is an amazing source of analytics. I think it will provide some of the most predictive data sources if properly dealt with."
The hardware for processing massive amounts of varied data in real time is improving all the time. IBM intends to deploy the Watson supercomputer, which famously beat two Jeopardy game show champions, for new uses, such as generating better diagnoses or analyzing social media to understand consumer sentiments, said Deepak Advani, vice president of business analytics at IBM.
But because these applications are so different from corporate data warehouses of old, data scientists and engineers need to rearchitect their systems and develop new products, said Andy Palmer, who co-founded database company Vertica Systems and was CIO of Infinity Pharmaceuticals. "We're in the early days of the process of figuring which [database] engines match best to which workloads."