Search Engines & SEO Blog
How do searchengines deal with XML-sitemaps?Johannes Beus
Ever since Google, Yahoo and Microsoft agreed on the XML-sitemap-format in 2006, there are regular discussions on the (correct) use of this option. I would like to summarize a recent publication by a current Google-employee because it could bring some clarity to this subject matter. In “Sitemaps: Above and Beyond the Crawl of Duty”, Narayanan Shivakumar and Uri Schonefeld list some consideration, that I think are used in this or a similar form at Google.The publication lists that by the end of 2008, more than 35 million domains with several billion URLs are using sitemaps. They are introducing different methods of organizing sitemaps by showing three domains that use different strategies: Amazon.com generates a new sitemap every day that includes new and changed URLs, CNN has sitemaps for URLs that are changed today, this week and this month and updates those three regularly. Pubmed, a site for medical products, has no real system but a lot of sitemaps that are combined in an index-sitemap which is updated daily. If we look at the crawling process, searchenginecrawlers encounter two major problems: The first problem is that they have to cover as many pages of a domain as possible. If these pages are hidden behind forms or Ajax, then this can be quite tricky. To check the efficiency of sitemaps for this problem, the authors are using two values: a) how many sites are being apprehended and b) how many “meaningful” pages of the domain are being apprehended through the sitemap. One method of checking this is by comparing how much of a domains PageRank is being covered by the URLs that are in the Sitemap. This second value is especially interesting since it makes classical duplicate content visible, this is content that can be reached through more than one URL and which can force crawlers to scrape all variants, trying to figure out the “correct” version. For the Pubmed-site, they came to the conclusion that the normal crawling-process had an efficiency of 63%, whereas the one through the sitemap had a 99% efficiency. Further evaluations showed similar results, even though not quite so extreme. The second problem of searchenginecrawlers is that the index needs to be kept fresh. This means new sites as well as updated ones need to be quickly recognized and crawled. For this, the authors are comparing the number of sites that have been found by the “normal” webcrawler, with the number of sites that were found through sitemaps. In a test of more that 5 billion sites which could be reached both ways, they found that 78% of the URLs were first discovered by the sitemap-method. A comparison between Pubmed and Cnn.com shows that the amount of documents that are discovered by the normal webcrawler is higher in smaller (number of pages) and more important sites because the robot will crawl these sites much more frequently that it does archives. They finish the paper by telling searchengineoperators not to trust the information in sitemaps (doh) and by showing how to weight between the normal webcrawl and sitemaps-URLs. The second problem of searchenginecrawlers is that the index needs to be kept fresh. This means, new sites as well as updated ones need to be quickly recognized and crawled. For this, the authors are comparing the number of sites that have been found by the “normal” webcrawler with the number of sites that were found through sitemaps. In a test of more that 5 billion sites which could be reached both ways, they found that 78% of the URLs were first discovered by the sitemap-method. A comparison between Pubmed and Cnn.com shows that the amount of documents that are discovered by the normal webcrawler is higher in smaller (number of pages) and more important sites because the robot will crawl these sites much more frequently that it does archives. They finish the paper by telling searchengineoperators not to trust the informations in sitemaps (doh) and by showing how to attach weight between the normal webcrawl and sitemaps-URLs.
|














