CNET también está disponible en español.

Ir a español

Don't show this again


Baking privacy into personal data

Jeff Jonas, an IBM distinguished engineer, says his technology strikes the balance between personal privacy and sophisticated data analysis.

It's not often that a high-school dropout becomes a distinguished engineer at IBM. But Jeff Jonas, the company's chief scientist of entity analytics, has done just that by developing ways to mine personal data while maintaining privacy.

Jonas first developed "identity resolution" technology more than 20 years ago. He claims the software is now mature enough to let government agencies and corporations do sophisticated analysis on the reams of data they have--even sensitive personal information.

Jonas started his company, SRD, in 1983 to help corporate customers spot fraud. He then moved to Las Vegas to help law enforcement find bad guys who use multiple pseudonyms to elude police.

SRD received funding from the CIA's venture investing arm, In-Q-Tel, to expand the use of the product, helping intelligence agencies correlate disparate pieces of information to track terrorists and other criminals.

IBM bought privately held SRD in January of this year, in part by selling Jonas on the prospect of ramping up the use of his software in many industries.

Jonas has said that using the right techniques, government agencies and businesses can gather and analyze information on people without violating privacy. He spoke to CNET to explain.

Q: Casinos were some of your first customers. Can you tell me how you used your technology?
Jonas: That's where we really cut our teeth on learning and understanding how identities change through time...Vegas is quite a sweet target to deploy some rather sophisticated attacks that have been going on for years. People will create four or five different identities. There's one person I know that has 30 different names. We spent time with the gaming industry helping them to know who they were doing business with. And they wanted to understand whether the people transacting with them were on their bad guy list.

So you were able to create this correlation between all these different false names?
Well, there are cases where you can create an identity that is so pure that is non-matchable. But bad guys have to remember an identity package. And there are some things they do that compromise one identity with another that allow you to put them together.

Historically, you first decrypt it to analyze it. We figured out how to do deep analytics while it is encrypted.

How did you apply that identity resolution technology in other areas?
The Las Vegas work was before our anonymization work. The anonymization is now called "relationship resolution." That class of technology is designed for organizations that already have the data. They want to bring more meaning out of the data.

For example?
They want to find out if the vendor (and) the accounts payable manager are related. They want to figure out that these 16 customers are really all the same. The anonymization technology is used by organizations that share data and match it, but they are tense over the loss of it because it's sensitive.

So that allows them to them to analyze their information on people without having to be explicit about identity?
Yeah. So let me give you an example. I met with a financial services company and they were estimating they were losing $10 million a year around a certain kind of fraud. That's a very small part of their total fraud, but they figured if they could put their customers, their employees and the people that they already knew were fraudsters into one database, they figured that that would help them save, reduce that fraud, maybe by $10 million.

But the risk of putting all of your customers and all of your employees in one database, the risk of having that escape and get away--if somebody hacks a system or an employee goes bad--the risk of that to their brand was more than $10 million. So they chose not to do it.

So this (anonymization) technique would allow them to analyze that data that they have all those rights legally to analyze. But to do it in a way that reduces the risk of unintended disclosure, such as somebody hacking the system and stealing information.

Many people are concerned about the amount of information on them that's collected. Can this technology be used beyond law enforcement or fraud detection?
My view of how relevant this technology is has been changing over time. I've come to this new conclusion: that is, if a company is sharing its sensitive data--like its customers or employees--and if you told the company they could share it in an anonymized form and get a materially similar result, why would a company want to share it any other way? And I've been passing that around to see what people think about that and I'm getting a lot of agreement. I'm starting to think that what has been created here is bigger than I was originally thinking.

It is relevant to health care. It is relevant to how the companies share data for marketing purposes. It is relevant to how government would share information. I'm starting to see more and more places where I'm like, "Wow, you could apply it there as well."

How does this anonymization technology work? How is it different from encryption?
That's a good question. So with encryption, I would normally encrypt my data and send it you and you would decrypt it to use it. The technique that we've developed is: I encrypt and you encrypt and all of the analysis is done only using the encrypted data. It's not decrypted first. Historically, you first decrypt it to analyze it. We figured out how to do deep analytics while it is encrypted.

Privacy, it seems, has at least as much to do with the right practices as the right technology. Do you coordinate with IBM's chief privacy officer (Harriet Pearson)?
Absolutely. How could I possibly be wearing a privacy hat without a regular dialogue?

The first year that I invented it, I was generally told that it was impossible.

I also have a privacy strategist that reports to me. This is rather unusual for most companies. But we've got a guy named John Bliss that works with me on privacy strategies. It kind of highlights the importance with which we see baking privacy into the creations. I don't invent something for our group, pass it off to engineering and later figure out how to make it protect privacy--or think about how we're going to message it to be privacy-protective.

John Bliss and I worked very closely together and with others in IBM, including IBM Research--they have some incredibly smart people doing work in the privacy areas in Almaden, Zurich and Watson. So by the time I bake something up for engineering, it has the best notion of, at that time, what would be a responsible way to deploy it.

Since you've been at IBM, have you found more awareness of this identity resolution technology by corporate customers?
They've just got a growing awareness. I'll tell you a funny thing. The first year that I invented it, I was generally told that it was impossible.

How come?

Well, one-way hashes are infinitely sensitive. "Bob" and "Bob " (with a space) produce entirely different hashes. So, the number of times that identity data is exactly the same on the left hand and the right hand is almost never, because one's got a period after their middle initial and the other one doesn't. So it was believed that it wouldn't have any real practical use, because it was believed that it would be too sensitive.

But now that I've been presenting the techniques and we have some trade secrets, some pending patents, we're encouraged that others are going after this as well. We think it'll be a real market. We moved from a year of, "That's not even possible, you're lying" to "OK, OK, you can do it; it'll work."