Data about human beings can tell you countless helpful things. How crowded is a hospital waiting room on a typical Saturday? What's the traffic like for your morning commute? But collecting all this data comes with the risk of exposing private information about individuals. When you want to know how busy a hospital is, you don't need – or want – to know who was in the emergency room last Saturday.
That's exactly the dilemma you face with big data.
It's with this problem in mind that Google on Thursday introduced a group of open-source software tools that focuses on differential privacy. It's a concept that sets limits on how much you can learn about specific people in big data sets, something the tech industry is drowning in. Google has built many of its own data-analysis products on top of the tools, and the company envisions everyone from academics to large tech companies using the suite of software programs.
"The aim of this is to provide a library of primary algorithms that you could build any type of differential privacy solution on top of," Bryant Gipson, an engineering manager at Google, said in an interview on Wednesday.
Google's release of the tools shows the company addressing privacy concerns at a time when consumers are increasingly worried that the tech industry is abusing their data. Along with similar projects Google has announced this year, Thursday's announcement points to a smart strategy by Google – keep on crunching mind-boggling amounts of user data, but put limits on how that data could affect individuals. The company also made its TensorFlow Federated software, which lets machine learning algorithms analyze data on user's devices instead of extracting the data and storing it on external servers, open source in March. In August, Google announced it was developing a "privacy sandbox" that will let advertisers display targeted ads while restricting tracking technology.
Differential privacy has the potential to protect your data in settings far beyond Google's products. Academics can use it to protect the privacy of study participants, and city planners can use it to protect your data as they seek to understand traffic patterns and service usage.
Even the US Census is concerned about keeping US residents' data private, so much so that it's planning to release slightly less accurate data to keep outliers from standing out – one side effect of using differential privacy.
Differential privacy is necessary because simply removing a user's name from their data isn't enough to make it anonymous. It's all too easy to re-identify someone in a data set using mathematical tricks. The process is similar to breaking a complex code, and the more data you have about an individual in a data set, the faster you can re-identify them, Gipson said.
Differential privacy combats this with its own mathematical maneuvers, which analyze how easy it would be to identify individuals in a data set and then remove some or all of their data.
In one infamous example of anonymous data gone wrong, Netflix released data about 500,000 anonymous users in 2006 for anyone to analyze. In an academic paper, data scientists showed how they could tie the data to public user information on IMDB and re-identify a significant amount of the Netflix users, revealing information about their political beliefs along the way. A similar problem emerged the same year when New York Times reporters were able to identify and speak with a specific user from among an anonymous set of AOL user searches.
Gipson emphasized that differential privacy by itself won't keep user data safe, and people handing user data need to use a wide range of privacy strategies.
Still, the technique is difficult to do, and Google hopes that, as more people use its tools, a shared understanding of differential privacy will emerge and evolve.
"This is the beginning of a conversation," Gipson said, "not the ending."