Controlling search-engine-indexing

Johannes Beus
At the moment I am at the SEMSEO in Hannover, lecturing something about which pages should be made accessible for search-engines, what are the reasons for this and how to use the available technology to make this happen. Both for those attending and as information for those who could not make it to Hannover here briefly the contents.

A long long time ago Google added every site into the index that the Googlebot got its hands on. On one hand, the Internet was still rather well arranged and the provided content was good enough to be indexed most of the times; on the other hand, was that the quality of a searchengine was, back then, measured by the number of indexed sites most of the times – when Yahoo published a slightly larger value you could be sure that Google would be quick to tail them and start up their internal number-generator. This approach started to become a problem when the number of sites grew quicker than Google could put up new servers. Scripting-languages in combination with the just released Amazon-API, CSV-product-lists like those of Zanox and other possibilities all did the rest. Google then decided to limit the maximum number of indexable pages for a domain and to separate important from unimportant through the implementation of the supplemental index. The problem now becomes the fact that machines do make mistakes. If Google were to decide that they like to index the PDF-print-version of a product-description rather than the actual HTML-page including the order-form, the online-shop would suddenly have a serious problem. As a possible solution would be to forbid the Googlebot to index the PDF's and therefore free it of the decision which page to take into the index.

Sadly there is no generic answer to the question of which pages should be allowed to be put in the searchengine-index and which pages should be off-limits. Websites are too special and diverse for generic tips to be particularly helpful in this case. In any case you should start thinking about meta-pages such as the legal notice, privacy policy and the standard terms and conditions. Additionally you should rather keep out the types of pages that, internally or externally, produce duplicate content. Search-results (its not been long since we had the issue of SERPs-in-SERPs) are also a hot candidate for exclusion. Here, everyone will have to illuminate their own web-project critically to find a sensible choice.

For the technical implementation there are only three basic prospects. Everyone who has been engaged in the subject of search engines should be familiar with the first way which leads through the “robots.txt”. It is a simple text file that rests in the root directory of a website and contains, in a simple form, prohibitions for searchengine-crawlers. The advantage of this is that the implementation is easy and quick and the whole management can be done centralized in one file. Sadly it is also very rigid in the possible bans it can contain and from time to time you get a searchengine which forgets to keep to the guidelines. The way through the robots.txt works exceptionally well if you are trying to block whole directories, for example.

The second implementation comes in form of the robots-metatag within the HTML-header of the site. Nearly all searchengine-operators have agreed on the implementation of this method. The nice thing about this solution is that you can – unlike with the robots.txt – edit pages not only binary (in/out) but you can keep pages out of the search-index through specifications like “noindex, follow” while they will still be used to strengthen your internal linking. A drawback is that the implementation can, depending on the underlying system, become relatively complex. A modification of this possibility is available for a few month now in that you can commit the data from a robots-metatag through the HTTP-header of a page. This is especially useful for types of data that do not possess a HTML-header such as PDFs, Downloads, etc.

[update] I have finally gotten around to putting the presentation online. To progress you have to click the right part of the image: slide #1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14
Johannes Beus - on Fri (04/25/2008) at 10:30 AM

Add Comment

more
This posting is older than 30 days and therefore closed for new comments.