Search Engines & SEO Blog
Protect the XML-sitemapjohannes beus
Ever since the sensible extension of the Sitemaps.org-standard, which permits you to deposit the path to your sitemaps-file in the robots.txt and is not forced to register, authenticate and file for every searchengine anymore, the danger that scrapers and other assorted web scum can help themselves to it rose, too. This is making it extremely easy, especially for content-thieves – the sites architecture is all in front of them and they do not have to laboriously crawl for it.I started to only allow the established searchengines access to the sitemaps-files for most of our projects. This may not be fair, by all means, seeing that this will put small searchengines at a disadvantage but considering the benefits I decided that it is worth it. This access control is realized through cloaking: Google, Yahoo, Microsoft and Ask.com will receive the sitemap, all other clients an error message. Even though cloaking is usually not much appreciated, here it is used legitimately since its goal is not the deception of users or the searchengines. <?phpThe PHP-code is first validating if the user-agent of the client has any clues in them that the bot belongs to one of the four big searchengines. Then it will check the authentication, which is by now supported by all large searchengines, through a DNS/reverse-DNS-check to see if the bot is “genuine”. If all of this should be the case, the code will present the sitemaps-file, if it is not the case, there will be an error message. You can take a look at it on this domain for example: sitemap.xml
|














