Culture

Deciphering Google's Wi-Fi headache (FAQ)

Life at the Googleplex has grown a little more tense following the stunning admission that Google had been collecting personal data from Wi-Fi networks for years. How did this happen?

Tom Krazit Former Staff writer, CNET News

Tom Krazit writes about the ever-expanding world of Google, as the most prominent company on the Internet defends its search juggernaut while expanding into nearly anything it thinks possible. He has previously written about Apple, the traditional PC industry, and chip companies. E-mail Tom.

See full bio

Tom Krazit

June 1, 2010 4:00 a.m. PT

6 min read

How did Google's Wi-Fi spying debacle get to this point?

As Google prepares to defend itself against allegations of Wi-Fi spying, it has said very little about exactly what kind of personal data it gathered as part of its Street View project. Last week, Google also declined to provide executives willing to speak on the record about how one of the most monumental oversights in its history occurred: the inadvertent gathering of "payload" data by Wi-Fi sniffers mapping hotspots while recording street scenes for Google Street View.

But Google finally did confirm a few additional details about the type of scanning procedure it used as well as the nature of the code first written by Google engineers back in 2006. It first took responsibility for the gaffe--which only came to light after detailed inquiries from German authorities--in a blog post on May 14, and ever since then, Google critics have delighted at the opportunity the incident has provided, with lawsuits and Congressional inquiries pending.

Let's take a look at what Google has said and some of the technology issues in question to get some more perspective on Google's Wi-Fi scanning problem.

What data does Google have?
Google admitted on May 14 that it had been "mistakenly collecting samples of payload data from open (i.e. non-password-protected) Wi-Fi networks" for three years. Payload data is distinct from a "header," which contains mostly benign information about the network itself: The payload is the actual data that is being transmitted over the network.

That sounds bad. Theoretically, it means that a Street View car stopped at a red light outside a coffee shop could have been sniffing its unsecured wireless access point and collecting data as it traveled over that unsecured network.

However, Google's store of personal data might not be quite the treasure trove it may seem. Data sent back and forth between encrypted Web sites (password logins, online banking, credit-card transactions, or anything with https:// in the URL) would not be collected. Mobile workers signed into VPNs would also not be affected.

In addition, it's not totally clear how much data Google would be able to capture with a Street View car moving at about 25 miles per hour along the streets of cities and towns around the world. Google said the data was "fragmented," implying that piecing together any coherent image from that data would be difficult.

A company with the algorithmic and computing resources of Google could theoretically make some sense of the 600GBs of fragmented data collected over the last three years. Google already knows a great deal about your online life if you're one of the two-thirds of Americans who regularly use its search engine, but data willingly provided to the company is different than data snatched out of thin air.

How did Google get the data?
Google confirmed it was using "passive" scanning techniques to discover Wi-Fi hotspots. That means there was the wireless equivalent of a big ear on the Street View cars that listened for any and all wireless signals. There's nothing inherently wrong with passive scanning, but most passive scanners are set to not record payload data.

Google Street View cars mapped Wi-Fi hotspots in addition to taking pictures: but went a little too far Google

To avoid any possibility of collecting payload data, some other wireless mapping companies, such as Skyhook (which has gotten no shortage of free publicity from Google's screwup) use active scanning. This means Skyhook's scanning equipment sends out a probe signal to determine whether any access points are in range, and access points recognize that signal and return their own message that basically says "here I am, here's how to find me, and here's how fast I can send you the Internet." This is also how your computer or phone finds an available Wi-Fi network.

Active scanning is said to scale better, but passive scanning is more comprehensive and can't be detected by the network access point.

Scanning public Wi-Fi networks has been a hobby for wireless enthusiasts and criminal hackers for years. Back in the days when Wi-Fi was just getting off the ground, "wardrivers" would locate and map public hotspots as a service, while those bent on criminal activity could do the same thing to steal data or borrow a network to conduct something illegal.

All other issues aside, the incident is yet another reminder that operating an unsecured wireless access point is like leaving your front door wide open with your jewelry on the doormat.

How could Google have let this happen?
We don't really know.

Google confirmed it uses the open-source Kismet wireless scanning software as the base of its Wi-Fi mapping program. But additional code was written by Google engineers to discard any encrypted payload data captured as part of the scanning, Google said Friday.

That additional code is what is giving Google executives a headache. Without having any inside information, Lauren Weinstein, a longtime networking expert and co-founder of People for Internet Responsibility, believes this is the heart of the debacle: Someone at Google forgot to modify the software before it left a testing environment and entered a production one.

Inside the friendly confines of the Googleplex, logging all publicly available wireless data--including payload data--would be a normal way to test whether the system will function normally as data streams into the application, Weinstein said. "You want to make sure you're not going to crash things. When you're in your own environment, it's your data; you can do what you want with it," he said.

And discarding the encrypted code makes sense in that environment, because the encrypted code is recorded as gibberish that can't be used to run network diagnostics.

However, if this was what happened, code should not have been allowed out of the labs without modifying it to dump all data gathered, not just encrypted data. "A procedural breakdown of this sort shouldn't occur," Weinstein said.

Was it really a simple mistake?
Your answer to that question depends on whether you trust Google.

Those who follow the Internet industry have been noticing a troubling trend over the past several years: one in which Internet companies push the boundaries of user privacy and data collection and apologize once they're found out or the backlash can't be ignored, only to start pushing once again after the hubbub dies down.

Likewise, Google has been willing to push in areas of law that haven't necessarily anticipated the effects of the Internet and digital technology, such as it did when it decided to scan copyright-protected books under the belief it had the fair-use right to do so, rights that in that situation are not explicitly granted nor explicitly barred under copyright law.

It's not illegal to inadvertently capture public wireless data under federal electronic privacy laws, but it is illegal to intentionally do so. All of Google's public statements to this point have characterized the data gathering process as accidental. The developer of Kismet appeared to find such a basic error entirely plausible and human, and posted a playfully chastising blog item to that effect last week, pointing out how easy it would have been to change the code to make sure the software didn't log payload data.

But as the late Ronald Reagan liked to say, "trust, but verify." Google could go a long way toward clearing up any confusion by publishing a much more detailed technical explanation of how this came to be, and by publicly allowing a third party to review the code and data as promised in its May 14 blog post.

Most of this will probably come out in court hearings and congressional testimony anyway. Until then, some will think Google looks like it has something to hide.