Google, others dig deep--maybe too deep

Search engines rummaging the Internet for interesting Web pages are increasingly stumbling upon passwords, credit card numbers, classified documents and even computer vulnerabilities.

Paul Festa Staff Writer, CNET News.com
Paul Festa
covers browser development and Web standards.
Paul Festa
4 min read
Search-engine spiders crawling the Web are increasingly stumbling upon passwords, credit card numbers, classified documents and even computer vulnerabilities that can be exploited by hackers.

The problem is not new, security analysts say: Ever since search robots began indexing the Web years ago, Web site administrators have found pages not meant for public consumption exposed in search results.

But a new tool built into the Google search engine to find a variety of file types in addition to traditional Web documents is highlighting and in some cases exacerbating the problem. With Google's new file-type search tool, a wide array of files formerly overlooked by basic search engine queries are now just a few clicks from the average surfer--or the novice hacker.

The files include Adobe PostScript; Lotus 1-2-3 and WordPro; MacWrite; Microsoft Excel, PowerPoint, Word, Works and Write; and the Rich Text Format.

"The overall problem is worse than it was in the early days, when you could do AltaVista searches on the word password and up come hundreds of password files," said Christopher Klaus, founder and chief technology officer of Internet Security Systems, a provider of information-security systems. "What's happening with search engines like Google adding this functionality is that there are a lot more targets to go after."

Since Google's new tool launched earlier this month, surprised Web site owners have been busy pulling down or securing sensitive pages that have turned up in Google results.

Google disavows responsibility for the security problem. But at the same time, the company has begun devising ways to catch sensitive pages before they wind up exposed to public view.

"Our specialty is discovering, crawling and indexing publicly available information," said Google spokesman David Krane. "We define public as anything placed on the public Internet and not blocked to search engines in any way. The primary burden falls to the people who are incorrectly exposing this information. But at the same time, we're certainly aware of the problem, and our development team is exploring different solutions behind the scenes."

Viral threats
In addition to giving malicious hackers a handy tool for scouting out sensitive information or vulnerable computers, Google's file-type search could pose a risk to searchers who click on file types that are more susceptible than Web pages to viruses and other hostile code.

"The security issue was a top thing I thought of when the new types were released," Danny Sullivan, editor of SearchEngineWatch.com, wrote in an e-mail interview. "It's great to have the additional coverage, but people might not realize when they click on a link that they could expose themselves to viruses. It's not something we've encountered with search engines before because HTML files are pretty safe," though JavaScript can be used in some exploits.

Google searchers concerned about viral threats can select a "View HTML" version of non-HTML file types. That option would render useless malicious code written for applications such as Microsoft's Word and Excel.

Search engines already go to some pains not to crawl where they are unwelcome. Web site administrators can add to their pages a simple "robots.txt" file that will turn the crawling bots away.

Google also maintains a site for Webmasters giving them several options for curtailing or turning away search crawlers.

But the consent-based option has its share of loopholes. Asking Web crawlers not to index a page does not make it inaccessible to the outside world. A robots.txt file can only succeed in turning away compliant search bots, leaving the door wide open to malicious crawlers.

In addition, the robots.txt "keep out" sign could serve as an advertisement to hackers that valuable or sensitive information lies behind it.

Security analysts concerned about the use of search engines for bad ends point to two problems. One is the exposure of sensitive, unsecured information such as passwords and credit card numbers. The second is the use of search engines to find Web sites running programs, such as CGI (common gateway interface), with known vulnerabilities.

Hackers find a way
Still, analysts are quick to say that even without Google and its peers, hackers have tools at their disposal for crawling the Web. Recent Internet worms such as Code Red and Nimda prove that massive, automated hacking exploits have no need of search engines to find vulnerable computers.

"Intruders have their own search engines that bypass the robot-ignore feature and would still find the same sensitive documents with passwords or known flawed CGI script or what have you," said Internet Security Systems' Klaus. "And a robots.txt file could be a flag for intruders to say, this must be interesting if robots are being told not to look at it.

"The underlying issue is that the infrastructure of all these Web sites aren't protected."

Webmasters queried about the search engine problem said precautions against overzealous search bots are of fundamental concern.

"Webmasters should know how to protect their files before they even start writing a Web site," wrote James Reno, chief executive of Amelia, Ohio-based ByteHosting Internet Services. "Standard Apache Password Protection handles most of the search engine problems--search engines can't crack it. Pretty much all that it does is use standard HTTP/1.0 Basic Authentication and checks the username based on the password stored in a MySQL Database."

But other critics said Google bears its share of the blame.

"We have a problem, and that is that people don't design software to behave itself," said Gary McGraw, chief technology officer of software risk-management company Cigital, and author of a new book on writing secure software.

"The guys at Google thought, 'How cool that we can offer this to our users,' without thinking about security. If you want to do this right, you have to think about security from the beginning and have a very solid approach to software design and software development that is based on what bad guys might possibly do to cause your program grief."