AOL, Netflix and the end of open access to research data

First the AOL search logs last year, and now the Netflix database. With these two incidents, it is highly unlikely that any company will ever again share data with researchers.

Chris Soghoian

Christopher Soghoian delves into the areas of security, privacy, technology policy and cyber-law. He is a student fellow at Harvard University's Berkman Center for Internet and Society , and is a PhD candidate at Indiana University's School of Informatics. His academic work and contact information can be found by visiting www.dubfire.net/chris/.

See full bio

Chris Soghoian

Dec. 1, 2007 12:14 p.m. PT

4 min read

Correction: The authors of the Netflix de-anonymization study contacted me to point out that they originally published a draft of their results a mere two weeks after Netflix released its dataset. Netflix has known about their study for over a year.

Over the past year, there have been a number of high-profile incidents in which sensitive user data was accidentally revealed to the Internet at large. As a result, I believe that high-tech companies will never again share anonymized data on their users with academic researchers, at least not without requiring contracts and nondisclosure agreements. For the users and privacy advocates, this is probably a good thing. However, for researchers, the scientific community, and Internet users who want cool new technologies, this is almost certainly a change for the worse.

In 2006, Netflix released over 100 million movie ratings made by 500,000 subscribers to their online DVD rental service. The company then offered $1 million to anyone who could improve the company's system of DVD recommendation. In order to protect its customers' privacy, Netflix anonymized the data set by removing any personal details.

Researchers announced this week that they were able to de-anonymize the data, by comparing the Netflix data against publicly available ratings on the Internet Movie Database (IMDB). Whoops.

For Internet privacy geeks, this Netflix incident is just another version of an all-too-familiar tale: A well-meaning company releases a large data set of user data, which it has scrubbed to remove any identifying information. Armed with this data set, researchers are able to trace backwards, and match names to the profiles and their online behavior.

The same thing happened back in 2006 when AOL released the search records of 500,000 of its users. Within days of the database's release, journalists from the New York Times had revealed the identity of user number 4417749 to be Thelma Arnold, a 62-year-old widow from Lilburn, Ga. Over 300 of the woman's searches were traced back to her, ranging from "60 single men" to "dog that urinates on everything."

The fallout from the AOL incident was devastating, both for the company and the industry as a whole. The CTO of the company and the researchers responsible for sharing the data were all fired. In addition to pulling the data set, the entire Web presence for AOL's research division was taken offline. More than one year onward, the AOL Research group still does not have a working homepage.

The shockwaves spread to the entire search engine industry. Google's CEO Eric Schmidt spoke to journalists shortly after AOL posted the data. After calling the data release "a terrible thing," he assured the public that "this kind of thing could not happen at Google."

The end result was that no search engine would ever again release anonymized log data to the research community.

~~The announcement by researchers of their Netflix project is so recent that it has yet to be seen how the company will respond. The data has been public for over a year, and~~ With a $1 million prize, the release almost certainly required the sign-off from executives (and so the company cannot blame rogue researchers as AOL did). While search engine logs are obviously extremely sensitive, video rental records are also very private. Enough so that Congress has given video rental records a higher level of protection than almost any other form of personal data (this was prompted by the worry that the politicians' own rental records could be published by journalists).

Companies do not make money by giving researchers access to data. They do it to promote and encourage research in the field. Based on the AOL and Netflix incidents, I suspect that we will see a major chill hit the industry. No high-tech company with large amounts of user data will ever again risk making it available to researchers without first requiring them to sign a lengthy contract. The risk of the data being de-anonymized (and the resulting public relations and legal trouble) is simply not worth it.

So, what if companies require researchers to sign agreements before the firms hand over anonymized user data? Isn't that a good way to protect users, yet still enable researchers to do their thing? Unfortunately, research is rarely respected by the community when the data comes with strings. It is for good reasons that people are dubious when drug companies sponsor research into the safety of one of their drugs. When a company holds the keys to the data, they can stop the publication of anything which will make them look bad.

As a privacy advocate and end user, I think the shift against sharing anonymized data is probably a good thing. After all, I don't want some random student browsing through my search history, anonymized or not. However, if I take the end-user hat off, and put on my PhD student hat, then this is a really bad thing. Researchers depend on accurate data in order to do their work. Without the data, we don't get new exciting research, and thus no new cool technologies. For the research community, this Netflix incident will be the final nail in the coffin of information sharing from the dot-coms.