Until a few hours ago, the Web site of National Intelligence Director Mike McConnell had been invisible in Google, MSN and Yahoo searches. That's because dni.gov's robots.txt file told search engines to stay away.
Now it's been fixed. DNI spokesman Ross Feinstein told me, apologetically, a moment ago: "When we saw your story posted, I asked our developers to look into it...We certainly appreciate you bringing it to our attention. It's a public Web site. We want it to be indexed. We're not even sure how (the robots.txt file) got there."
The robots.txt file can't force search engines to ignore certain Web sites or sections of Web sites, but most indexing bots will abide by the requests. When dealing with government sites, this is a mistake, but more on this below.
By way of background, I wrote a blog on August 24 pointing out the invisible dni.gov Web site (and a handful of other .gov and .mil sites). Then I wrote a this morning about the White House's Web site blocking Iraq documents via robots.txt, and then lifting the ban after we spoke on the phone this week.
DNI spokesman Feinstein said that the robots.txt file had initially been fixed on Monday but then when the site was updated on Tuesday with a media advisory, the prohibitory original version of robots.txt had been restored. Now it's presumably permanently fixed.
Now, I'm the last person to suggest that using robots.txt to cordon off subsets of your Web site is somehow evil. At CNET News.com, we use it to tell search engines not to index our "e-mail story" pages, for instance, and on my own personal Web site I use it as well. Blocking misbehaving Web crawlers is important and necessary.
But why should a public federal Web site be entirely marked as off-limits to search engines? There's no good reason. I can think of two bad reasons: (a) avoiding the situation of posting a report that turned out to be embarrassing and was cached by Google and Archive.org and (b) letting the feds modify a file such as a transcript without anyone noticing. (The White House has quietly altered photo captions before, and I've documented how a transcript of a public meeting was surreptitiously deleted--and then restored.)
I don't know why DNI chose to want to be invisible in searches. Their explanation of a simple mistake, like the one the White House gave me earlier this week, is certainly plausible. But this is why, I'll say once again, we need a modest revision to the Robots Exclusion Protocol: Search engines should ignore robots.txt when a government agency is using it to keep public documents hidden from the public.