Want CNET to notify you of price drops and the latest stories?

Privacy's random answer

No need to lie about your age online--a math technique will keep your personal data covered, says News.com's Michael Kanellos.

Michael Kanellos Staff Writer, CNET News.com
Michael Kanellos is editor at large at CNET News.com, where he covers hardware, research and development, start-ups and the tech industry overseas.
Michael Kanellos
3 min read
If IBM is right, corporate databases in the future might record your age as 157 and your income as the square root of two.

Big Blue is experimenting with an idea for customer databases called data randomization. The technique will, conceivably, preserve consumer privacy by masking data such as income, age, past purchases or medical information through mathematical calculations that can't be unwound.

For instance, if a customer submits their age as 38 when registering at an online shopping site, a randomizing plug-in in their browser software will add a number between minus 25 and 112 to their age and send that number over to the server.

Randomization represents an opportunity to defuse the ugly conflict over privacy

The wrinkle is that, at the back end, computers then apply a barrage of calculations onto the scrambled data to discern patterns among all customers. The 38-year-old individual's true age can never be recovered, but an online business can somewhat accurately figure out how popular it is with 38 year olds. Unscrambled data collected by the company--such as how much a person paid for a car and on what date--could subsequently be randomized too, for additional privacy.

"The basic notion, in some sense, is kind of heresy in computer science. The normal notion is, in order to do a good job, you need to have accurate information," said Rakesh Agrawal, a senior fellow at IBM who is leading the research. "And here we are saying, 'You have good information, and we are going to perturb it or put errors into it to protect people's privacy.'"

A boon to privacy?
I find data randomization appealing on two levels. First, it's a healthy reminder of why we have big companies in the first place. They exist to hire the math geniuses and chemistry whizzes of the world, who in turn build the society of tomorrow. Without them, the Wheelo would stand as the apex of scientific achievement.

Second, it represents an opportunity to defuse the ugly conflict over privacy. A large--and seemingly growing--number of consumers are furious about how companies and institutions collect, trade and transmit their data.

In all reality, most of the harvested data is never exploited for nefarious purposes. Using an ATM card does create an electronic trail of your life, but it's not like the FBI agents are sitting around right now looking at your file and thinking, "He's eaten at Carl's Jr. three times in the last month. Wanna bet he goes there again in five days?"

Still, consumers resent the practice, and the Federal Trade Commission has made protecting consumer privacy a high priority.

It turns out that people are not very good at lying. Essentially, people leave tell-tale signs.
--Rakesh Agrawal, senior fellow, IBM

To spoof data harvesting, people often lie, but that actually doesn't work. Companies can reconstruct basic data patterns. "It turns out that people are not very good at lying," Agrawal said. "Essentially people leave tell-tale signs."

The randomization system relies on determining the relationship between different values through Bayesian probability. Consumers fill in their true data, which then gets randomized before being sent over.

At the corporate end, servers then try to determine what type of randomizing calculations were applied to scramble the original values.

"We basically ask the following question: 'What could have generated this distribution?'" Agrawal said.

If the computer can come up with the likely randomizing technique that was employed--adding a random number between 15 and 87, or subtracting one between 8 and 32, for example--it can then draw a chart that accurately simulates what the customer base looks like. In several contained trials, the reconstructed curve differed from the curve plotted by the original data by two to three percent.

"It comes back to the true distribution, always. This is the beauty of math, fortunately or unfortunately," Agrawal said. "I think the key insight was that you don't have to have access to precise information to build good models."

IBM continues to conduct trials with the technology, but Agrawal already sees some areas where it could bring benefits. Large businesses such as rental car companies could pool their data without the risk of disclosing customer lists. Hospitals could give access to records about a hepatitis outbreak without being sued. Network break-ins would become potentially less dangerous.

And when filling out a customer questionnaire at Home Depot, you won't feel compelled to claim you have 16 kitchens.