Culture

Analyze, create robots.txt files in Google

Google adds another helpful feature to Webmaster Central that helps Webmasters create a robots.txt file for their sites.

Brian R. Brown

Brian Brown is a Consultant & Natural Search Marketing Strategist for Netconcepts. Brian assists with leading retail clients on their natural search needs, analyzing their sites for creative optimization and link building opportunities to maximize the value of their natural search program. Prior to entering the online world, Brian served in various sales, product management, and new product development roles within divisions of Newell Rubbermaid. He made the dramatic shift from consumer packaged goods with the launch of his own web presence development company, where he served diverse clients, from small startups to large corporate divisions. He brings not only strong SEO skills to client engagements, but a technical background in standards based web design, including table-less XHTML & CSS. Disclosure.

See full bio

Brian R. Brown

March 31, 2008 9:34 a.m. PT

3 min read

Google's Webmaster Central has become a very important resource for anyone who has a Web site, works on a Web site, or, like SEO practitioners, helps others with their Web sites.

Google continues to roll out more features and better functionality to existing features, and now they just did a little bit of both with the addition of their Generate robots.txt function.

Google had previously added a robots.txt analyzer, which at this point is still the more useful of the two tools. For those who aren't aware, the robots exclusion protocol helps with instructing search engines how to interact with a Web site. There are a number of directives available, but the main purpose of the robots.txt file is to instruct the search engines about content that a site owner doesn't want the robots to crawl.

Why in the world would you not want search engines to crawl any of your content? You may have content that, for whatever reason, you don't want others to find through search results. Note, however, that this is not the same as secure information that requires authentication through a log-in.

Your site may have its own search function that creates "search results" for your site. Search engines generally do not want to include search results within search results, so this content may not be returned for searches on the engines anyway, so you might want to focus the crawlers elsewhere for greater crawler efficiency.

Or you may have duplicate content issues that you could use robots.txt to filter out. This is especially common with a content management system (CMS) that creates a separate printer-friendly page.

Regardless of your specific needs, having a robots.txt file can be important to a site. Rarely is there a site that can't benefit from disallowing at least some content. Even if you have nothing to disallow, you may want to take advantage of the auto-discovery feature for your XML sitemap. Finally, depending on your server log system or analytics package, not having a robots.txt file can be problematic if it inflates your "404 File Not Found" error reporting, which can happen because search engine spiders will request the robots.txt file automatically when they come to your site.

Right now, the robots.txt generator is rather basic and I hope that Google will add more features to it going forward. Currently, site owners have to paste in URLs and URL patterns to build the file. It would be great if it would provide a list of URLs or patterns extracted from a site to help automate the procedure for anyone not familiar with the protocol.

There is more information about the protocol, though a bit more on the technical side, at the robotstxt.org site and you can find more engine specific information on crawling and robots.txt from Google, Yahoo, MSN, and Ask.com.

One important tip is that the following directive tells all spiders they are allowed to go anywhere:

User-agent: *
Disallow:

And, more importantly, the following directive, which I sometimes see when I think people really wanted the above:

User-agent: *
Disallow: /

The latter tells the spiders to stay out of the entire site--clearly two very different results, so be sure you understand which does what.