Google's Webmaster Central has become a very important resource for anyone who has a Web site, works on a Web site, or, like SEO practitioners, helps others with their Web sites.
Google continues to roll out more features and better functionality to existing features, and now they just did a little bit of both with the addition of their Generate robots.txt function.
Google had previously added a robots.txt analyzer, which at this point is still the more useful of the two tools. For those who aren't aware, the robots exclusion protocol helps with instructing search engines how to interact with a Web site. There are a number of directives available, but the main purpose of the robots.txt file is to instruct the search engines about content that a site owner doesn't want the robots to crawl.
Why in the world would you not want search engines to crawl any of your content? You may have content that, for whatever reason, you don't want others to find through search results. Note, however, that this is not the same as secure information that requires authentication through a log-in.
Your site may have its own search function that creates "search results" for your site. Search engines generally do not want to include search results within search results, so this content may not be returned for searches on the engines anyway, so you might want to focus the crawlers elsewhere for greater crawler efficiency.
Or you may have duplicate content issues that you could use robots.txt to filter out. This is especially common with a content management system (CMS) that creates a separate printer-friendly page.
Regardless of your specific needs, having a robots.txt file can be important to a site. Rarely is there a site that can't benefit from disallowing at least some content. Even if you have nothing to disallow, you may want to take advantage of the auto-discovery feature for your XML sitemap. Finally, depending on your server log system or analytics package, not having a robots.txt file can be problematic if it inflates your "404 File Not Found" error reporting, which can happen because search engine spiders will request the robots.txt file automatically when they come to your site.
Right now, the robots.txt generator is rather basic and I hope that Google will add more features to it going forward. Currently, site owners have to paste in URLs and URL patterns to build the file. It would be great if it would provide a list of URLs or patterns extracted from a site to help automate the procedure for anyone not familiar with the protocol.
There is more information about the protocol, though a bit more on the technical side, at the robotstxt.org site and you can find more engine specific information on crawling and robots.txt from Google, Yahoo, MSN, and Ask.com.
One important tip is that the following directive tells all spiders they are allowed to go anywhere:
And, more importantly, the following directive, which I sometimes see when I think people really wanted the above:
The latter tells the spiders to stay out of the entire site--clearly two very different results, so be sure you understand which does what.