Recovery.gov blocked search engine tracking

<b style="color:#900;">update</b> After Google seemingly ignored restrictive search engine-blocking code built into the Obama administration's new stimulus-related site, the robots.txt code is removed.

Chris Soghoian

Christopher Soghoian delves into the areas of security, privacy, technology policy and cyber-law. He is a student fellow at Harvard University's Berkman Center for Internet and Society , and is a PhD candidate at Indiana University's School of Informatics. His academic work and contact information can be found by visiting www.dubfire.net/chris/.

See full bio

Chris Soghoian

Feb. 19, 2009 8:15 a.m. PT

3 min read

Update: As of 8 a.m. PST, within three hours of this story first going live, it appears that President Obama's Web team has (silently) pulled the robots.txt file from the Recovery.gov Web site. The site is now open to Web crawlers of all kinds.

The Obama administration has apparently opted to forbid Google and other search engines from indexing any content on the newly launched Recovery.gov.

Is this even more evidence that the administration's much-publicized commitment to transparency is simply hype?

Recovery.gov, which went live Tuesday, is set to act as a central clearinghouse for information related to the newly signed American Recovery and Reinvestment Act. The legislation is designed to stimulate the flagging U.S. economy.

In a video message, available on YouTube and embedded into the new site, President Obama states that the "size and scale of (the stimulus) plan demands unprecedented efforts to root out waste, inefficiency, and unnecessary spending. Recovery.gov will be the online portal for these efforts." He adds that the new site will be used to publish information on how the stimulus funds will be spent in a "timely, targeted, and transparent manner."

Although the site is advertised as proof of the president's commitment to transparency, its technical design seems to betray that spirit. Most importantly, the site currently blocks all requests by search engines, which would ordinarily download and index each page to make the information more accessible to the Web-searching public.

The site's robots.txt file has just a few lines of text:

# Deny all search bots, web spiders
User-agent: *
Disallow: /

Although the White House Web team did not immediately respond to a request for comment, the single-line comment at the top of the file indicates that the blocking of search engines is no accident but rather a statement of policy.

Many sites use a robots.txt file to communicate, in machine-readable terms, the Web pages that they do and don't wish to be indexed by search engines. While the files don't carry much, if any, legal weight, most search engines act as good Internet citizens and honor the requests.

Luckily for the millions of Americans who might wish to find out how their money is going to be spent, it seems that Google has opted to ignore the administration's restrictive robots.txt on the stimulus-related site. It is unclear if this is due to an error or a manual override by someone at Google, but a quick search turns up more than 60 Web pages on Recovery.gov that have been indexed by the search engine's Web crawlers in just the past three days.

Also, the stimulus bill requires that the site be run by the new Recovery Accountability and Transparency Board, but it seems to currently be under the control of the White House Web team--the same folks who revamped Whitehouse.gov and whose use of the robots.txt search engine-blocking code was expanded after the site initially was praised by bloggers for its openness.

It is this blogger's hope that with a bit of gentle prodding by members of the pro-transparency community, Recovery.gov's administrators will correct the "unintentional oversight" that was made in launching the site with such an restrictive robots.txt file.