Flu predictions get more accurate
Google's annual Flu Trends metric was wildly off last year. But scientists were able to leverage the company's data to design a far more successful system.
Google Flu Trends, the search giant's metric for estimating the percentage of a country's population suffering from influenza, falls in line with everything we dream the company capable of. By monitoring the frequency and geographical spread of keywords for flu-related queries, the tool is supposed to provide an accurate estimate for the propagation of the virus.
The only problem: it was wildly wrong last year, providing a flu estimate of 11 percent of the population, nearly double what the Centers for Disease Control (CDC) ended up estimating.
Now doubling back on last year's flu outbreak, scientists at Columbia University's Mailman School of Public Health managed to design a system that, for the first time, predicted the timing of the 2012-2013 flu virus up to nine weeks in advance of its peak, according to results published Nature Communications Tuesday. When the team applied its system to 108 US cities, its forecasts were correct in 63 percent of cities two to three peaks before the virus peaked and hit a unprecedented 73 percent accuracy at the top of the peaks.
So how exactly did a group of scientists design a system more accurate than the mounds of data produced by countless queries through the world's most trafficked search engine? By combining the most illuminating aspects of Google Flu Trend data with region-specific reports from the CDC, explained lead author Jeffrey Shaman, an assistant professor of Environmental Health Sciences at Columbia.
"Google Flu Trends is not a predicting tool. It's a surveillance tool," Shaman said. "What we're doing is we're using a measuring of influenza incidence and using it together with a mathematical model that describes propagation through a population and then forecasting the flu," he added. Where the team saw Google Flu Trends' strengths were in local municipal monitoring, and by multiplying its data with CDC data that honed in on concrete flu cases and not simply those displaying flu-like symptoms, they were able to get a much cleaner outlook.
That combined metric is far more accurate than any competing system out there, Shaman said, but it still has its flaws. He noted how the system faired poorly with respect to Chicago flu predictions and overall did better in smaller geographic areas. But these errors help illuminate ways to improve flu predicting systems down the line.
"Population density may also be important. It suggests that in a city like New York, we may need to predict at a finer granularity, perhaps at the borough level. In a big sprawling city like Los Angeles, we may need to predict influenza at the level of individual neighborhoods," Shaman outlined.
How exactly did Google's data go "off the rails," as the Columbia team described it?
"We hadn't had a strong strain of seasonal flu for probably about nine years. It was an early outbreak and it was a virulent strain," Shaman said. That not only meant that more and more people were going online, vigilantly researching their symptoms, but that the media was latching onto flu stories to a higher degree and playing a potentially major role in offsetting Google's algorithm.
In October, Google announced that it had altered its Flu Trend algorithm to adjust for these kinds of errors. Regardless, Shaman is grateful for the vast amount of tools at scientists' disposable, and could not have formulated his system without Google.
"I think what Google is doing is fantastic. There are going to be problems along the way," he said. It's simply one tool, and "should not be used in isolation."
"It's a version of Big Data and it has its utility like anything else, but it has to be used judiciously and intelligently," he added. "You can't assume it's correct and you have to keep track of any errors, like any system."
Shaman and his team's system will be put use as soon as this year's flu season kicks into high gear, which typically happens around December. The data will be made publicly available on a Web site hosted by Columbia that is expected to go live in the coming weeks.