Feds use robots.txt files to stay invisible online. Lame.

Some federal government Web sites, including the Office of the Director of National Intelligence, are trying to remain hidden online by blocking search engines from indexing them. Not only is this lame, but it's a good reason to ignore their robots.txt fi

Declan McCullagh Former Senior Writer

Declan McCullagh is the chief political correspondent for CNET. You can e-mail him or follow him on Twitter as declanm. Declan previously was a reporter for Time and the Washington bureau chief for Wired and wrote the Taking Liberties section and Other People's Money column for CBS News' Web site.

See full bio

Declan McCullagh

Aug. 24, 2007 5:00 a.m. PT

3 min read

I noticed, when writing a story on Thursday about the bizarre claims by National Intelligence Director Mike McConnell, that the DNI is trying to hide from search engines. Its robots.txt file says, simply:

User-agent: *
Disallow: /

That blocks all search engines, including Google, MSN, Yahoo, and so on, from indexing any files at the Office of the Director of National Intelligence's Web site. (Here's some background on the Robots Exclusion Protocol if you're rusty.)

So I figured it would be interesting to see what other fedgov sites did the same. I wrote a quick Perl program to connect to federal government Web sites, check for the presence of a broad robots.txt exclusion, and report the results. By way of disclaimer, it's the same database I used in an article from early 2006, so it's probably a bit out-of-date.

The government sites that mark themselves as entirely off-limits via robots.txt:

http://www.dni.gov/robots.txt
https://gits-sec.treas.gov/robots.txt
http://thomas.loc.gov/robots.txt
http://www.erl.noaa.gov/robots.txt
http://www.nwd.usace.army.mil/robots.txt
http://www.tricare.mil/robots.txt

Some government sites favor one search engine over another (Customs and Border Protection bans all non-governmental search engines except Google; one Army Corps of Engineers site bans Alexa's spider; the Ginnie Mae agency bans Google's image search bot but not, say, Altavista's; the Minority Business Development Agency completely bans all crawlers but Google's; and one Bureau of Reclamation site bans Googlebot v2.1 but allows MSN's bot):

http://cbp.gov/robots.txt
http://www.nad.usace.army.mil/robots.txt
http://www.ginniemae.gov/robots.txt
http://www.mbda.gov/robots.txt
http://www.mp.usbr.gov/

And here are some sites that seem to have had trouble with misbehaving Web crawlers in the past:

http://www.cdc.gov/robots.txt
http://www.glerl.noaa.gov/robots.txt
http://www.usbr.gov/robots.txt
http://www.onr.navy.mil/robots.txt
http://www.senate.gov/robots.txt
http://www.usdoj.gov/robots.txt

Now, I'm the last person to suggest that using robots.txt to cordon off subsets of your Web site is somehow evil. At News.com, we use it to tell search engines not to index our "email story" pages, for instance, and on my own Web site I use it as well. Blocking misbehaving Web crawlers is important and necessary. And robots.txt may be appropriate when a Web site's address changes, which seems to have happened in the case of the National Oceanic and Atmospheric Administration's site in the first chunk of examples above, or when it becomes defunct, which seems to have happened with the Treasury Department's "gits-sec" Web site above.

But why should entire federal offices like the Director of National Intelligence want to remain invisible online? I can think of two reasons: (a) avoiding the situation of posting a report that turned out to be embarrassing and was discovered by Google and (b) letting the Feds modify a file such as a transcript without anyone noticing. (There have been allegations of the Bush administration altering, or at least creatively interpreting, transcripts before. And I've documented how a transcript of a public meeting was surreptitiously deleted -- and then restored.)

Neither situation benefits the public. In fact, I'd say it calls for a friendly amendment to the Robots Exclusion Protocol: Search engines should ignore robots.txt when a government agency is trying to use it to keep its entire Web site hidden from the public.