X

White House expands use of search-blocking code

Whitehouse.gov's administrators silently triple the number of Web pages that it forbids Google and other search engines from accessing. Is this a bad omen or much ado about nothing?

Chris Soghoian
Christopher Soghoian delves into the areas of security, privacy, technology policy and cyber-law. He is a student fellow at Harvard University's Berkman Center for Internet and Society , and is a PhD candidate at Indiana University's School of Informatics. His academic work and contact information can be found by visiting www.dubfire.net/chris/.
Chris Soghoian
3 min read

The White House has silently tripled the number of Web pages that it forbids Google and other search engines from accessing. Is this a bad omen or much ado about nothing?

Within hours of Barack Obama being sworn in as president, bloggers and tech journalists began to closely examine the new White House Web site for hidden indicators as to how he would shape future tech policy.

While I focused my efforts on the White House privacy policy, others looked to the new administration's robots.txt file, which lays out boundaries that search engines like Google should follow when scraping the site.

When the new Obama geek team posted its sparse robots.txt to the Web, tech pundits soon hailed it as a sign of the President's commitment to openness, transparency, and proof that someone tech-savvy was finally running the show.

Blogger Jason Kottke hailed the move, writing that it was "a small and nerdy measure of the huge change in the executive branch of the U.S. government today." Another blogger, Ben Orenstein, compared the new Obama robots.txt file to the 2,400-line file used by the Bush White House, "I think you've got a lovely little microcosm; one that points to a hopeful and open future."

The big fuss?

These digerati were excited by the fact that the new White House robots.txt file contained just two lines:

User-agent: *
Disallow: /includes/

Fast-forward one week, and the White House has silently started to expand its use of the robots.txt search engine-blocking mechanism. As of Friday morning, the file now contains the following text:

User-agent: *
Disallow: /includes/
Disallow: /search/
Disallow: /omb/search/

While it would be accurate to state that the White House has in one day tripled the number of sites it excludes from Google crawling, it is also important to note that this is not a big deal--in fact, it doesn't matter at all.

For the most part, the Bush White House's use of robots.txt was totally legitimate, something that Kevin Fox, an engineer at Friendfeed told the folks at Google Blogoscoped:

This is a bit silly. The old robots.txt excludes internal search result pages and redundant text versions of HTML pages. This is exactly what robots.txt is for. Google's Webmaster Guidelines state "Use robots.txt to prevent crawling of search results pages or other auto-generated pages that don't add much value for users coming from search engines."

It's understandable that the robots.txt of an 8-year-old site is longer than that of a 1-day-old site, and it's not as if '/secrets/top' or '/katrina/response/' were put in the robots file.

Fun as it may be, this is a nonstory.

Those bloggers drunk on hope who desperately wanted to see proof of Obama's commitment to his campaign promises of transparency and Google Government now find themselves with a difficult choice: they can either accept and acknowledge that robots.txt files are not a set of digital tea leaves through which you can read the new administration, or, if robots.txt does carry weight, they can try to come up with a way of explaining a 200 percent increase in the number of directories blocked by Obama's Web team as anything but Cheney-esque secrecy.

Simply put, the robots.txt file was created and managed by engineers, not lawyers or policy makers. It is not the place to judge the president on tech policy issues.

The president's tech policy should instead be judged on real issues: how many former RIAA and MPAA lawyers will be given positions of power in the administration, who ends up working at the FTC and FCC, and who will be named the new cybersecurity czar.

As for the president's commitment to transparency, he has already violated his pledge to post all nonemergency bills on the Whitehouse.gov Web site for five days before signing them. The text of the Lilly Ledbetter Fair Pay Act of 2009, which was signed into law yesterday, was certainly not posted to Whitehouse.gov for anywhere near five days.

Obama's broken commitment to transparency remains advertised on the White House blog:

One significant addition to WhiteHouse.gov reflects a campaign promise from the president: we will publish all nonemergency legislation to the Web site for five days, and allow the public to review and comment before the president signs it.

It is by looking to these kinds of concrete issues by which we can judge the president, not robots.txt