Tracking the flu by tracking the tweets

Computer scientists find that monitoring microblogging services such as Twitter is more effective and less expensive than the old-fashioned "syndromic surveillance" approach in predicting outbreaks.

Elizabeth Armstrong Moore
Elizabeth Armstrong Moore is based in Portland, Oregon, and has written for Wired, The Christian Science Monitor, and public radio. Her semi-obscure hobbies include climbing, billiards, board games that take up a lot of space, and piano.
Elizabeth Armstrong Moore
2 min read

Let's face it: the typical tweet in the Twitosphere (if you need help with the vernacular, consult the Twictionary) is about as revelatory as the words going into the cell phone of the girl sitting behind me on the bus last night. The vast majority are meaningless to strangers--and probably even to close friends.

The joy of irony: Twitter's addictive nature may help researchers track health trends. carrotcreative/Flickr

But the sheer volume of Twitter activity (the site is "over capacity" as I type this) turns otherwise banal tweets into telling trends, when scrutinized in the aggregate, and health trends are no exception.

"A microblogging service such as Twitter is a promising new data source for Internet-based surveillance because of the volume of messages, [and] their frequency and public availability," according to Aron Culotta, assistant professor of computer science at Southeastern Louisiana University, who, in recent months analyzed 500 million tweets to track the flu.

Culotta and two assistant students collected these messages using Twitter's application programming interface. Only a handful of keywords were required to both track rates of influenza-related messages, and predict future rates and outbreaks.

"This approach is much cheaper and faster than having thousands of hospitals and health care providers fill out forms each week," Culotta says. "Once the program is running, it's actually neither time-consuming nor expensive--it's entirely automated because we're running software that samples each day's messages, analyzes them, and produces an estimate of the current proportion of people with the flu."

It's also, much like Google Flu Trends, less accurate. But not by much. The team found a 95 percent correlation with the national health statistics collected by the Center for Disease Control.

Culotta says analyzing tweets has an advantage over Google because of their sheer volume and frequency. (Twitter has reported having more than 190 million users posting a cumulative 65 million messages a day, with about 300,000 new users getting added daily.)

Culotta's next goal is to track messages that include location-specific data so he can segment reporting information by regions and post trending data in real time.

The team announced its findings after presenting them at the 2010 Workshop on Social Media Analytics at the Conference on Knowledge Discovery and Data Mining in Washington, D.C., in July. Its updated paper, "Detecting Influenza Epidemics by Analyzing Twitter Messages," can be downloaded as a PDF.