Robots.txt deluxe – the Google-supported extensions

Johannes Beus
The job of the, so-called, Robots.txt – a purely textual file with the name robots.txt, that is put into the domains root-directory – is to keep searchengines, at least those that heed the file, away from certain areas of the website. Even while the standard-version, which is supported by most searchengines, causes some confusion from time to time, Google taught its parser some extensions that come in handy quite often. Google is not only supporting a wild card (“*”) in the user-agent-information but also in directories. The second extension is the end-of-line anchor “$”. Following are a few possible uses and mistakes that should not be made. Since these extensions are only supported by Google at the moment, you should always split the robots.txt for Google and the rest of the searchengines

User-agent: *
Disallow: /blog/member
Disallow: /forum/member
Disallow: /upload/member
User-agent: Googlebot
Disallow: /*/member


It is important to generally start every Disallow-command with the slash (“/”). While most of the searchengines will add it automatically if it is missing, it can still trip up some searchengines. Conventional searchengines will ignore all pages whose URI starts with the directory that is listed in the Disallow command. As we can see in the example, this can cause instructions, in which same-named files in different directories are to be blocked, to be split which, in turn, can quickly become confusing. By using the wild card we can noticeably reduce this complexity.
Through the use of the end-of-line-anchor we are able to bar complete filetypes from being indexed, with relative ease. This command, for example, will ban the indexing of all .txt- and .pdf-files that are potential causes of duplicate content:

User-agent: Googlebot
Disallow: /*.pdf$
Disallow: /*.txt$


In this case it is important to add the end-of-line-anchor because otherwise the searchengine would also ignore anything with “.pdf” in the filename (/nice-downloads-with-.pdf-files.html for example).
Johannes Beus - on Tue (06/05/2007) at 07:30 AM

Add Comment

more
This posting is older than 30 days and therefore closed for new comments.