Search Engines & SEO Blog
Controlling search-engine-indexingJohannes Beus
At the moment I am at the SEMSEO in Hannover, lecturing something about which pages should be made accessible for search-engines, what are the reasons for this and how to use the available technology to make this happen. Both for those attending and as information for those who could not make it to Hannover here Sadly there is no generic answer to the question of which pages should be allowed to be put in the searchengine-index and which pages should be off-limits. Websites are too special and diverse for generic tips to be particularly helpful in this case. In any case you should start thinking about meta-pages such as the legal notice, privacy policy and the standard terms and conditions. Additionally you should rather keep out the types of pages that, internally or externally, produce duplicate content. Search-results (its not been long since we had the issue of SERPs-in-SERPs) are also a hot candidate for exclusion. Here, everyone will have to illuminate their own web-project critically to find a sensible choice. For the technical implementation there are only three basic prospects. Everyone who has been engaged in the subject of search engines should be familiar with the first way which leads through the “robots.txt”. It is a simple text file that rests in the root directory of a website and contains, in a simple form, prohibitions for searchengine-crawlers. The advantage of this is that the implementation is easy and quick and the whole management can be done centralized in one file. Sadly it is also very rigid in the possible bans it can contain and from time to time you get a searchengine which forgets to keep to the guidelines. The way through the robots.txt works exceptionally well if you are trying to block whole directories, for example. The second implementation comes in form of the robots-metatag within the HTML-header of the site. Nearly all searchengine-operators have agreed on the implementation of this method. The nice thing about this solution is that you can – unlike with the robots.txt – edit pages not only binary (in/out) but you can keep pages out of the search-index through specifications like “noindex, follow” while they will still be used to strengthen your internal linking. A drawback is that the implementation can, depending on the underlying system, become relatively complex. A modification of this possibility is available for a few month now in that you can commit the data from a robots-metatag through the HTTP-header of a page. This is especially useful for types of data that do not possess a HTML-header such as PDFs, Downloads, etc. [update] I have finally gotten around to putting the presentation online. To progress you have to click the right part of the image: slide #1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14
|














