Johannes Beus

In his
blog, Adam Doppelt compiled some interesting data and comparisons of the crawling-behavior of Google's and Yahoo's bots. For this he analyzed the logfiles for
urbanspoon.com, a restaurant-guide for the USA. While the Googlebot is crawling the site with a nice consistency, Yahoo causes extreme peaks. Another surprising finding is that Google only requested 1,4 percent of all pages twice while it were 38 percent for Yahoo. I fully agree with Doppelt's conclusion that Yahoo still has a long way to go to catch up to the market leader. If I scan the logfiles of some of our larger projects for the “behavior” of Yahoo's Slurp I am able to find many oddities. What irritates me the most is the fact that Yahoo generally leaves off trailing-slashes regardless of how the site is linked. Instead of accessing /directory/subdirectory/ it will first access /directory/subdirectory – should the server-configuration as well as all Mod_Rewrite-rules be correct, then there should be a 302-redirect to the correct directory, otherwise there will be duplicate-content or an error-page. The behavior of the searchenginecrawlers is one of the few aspects in which we can view Google's 90% market-domination as beneficial – that way you do not really have to worry much about the flawed programming of the other searchengine's crawlers.
Nice marginal web-finding: in their article on
Internet-speech, Spiegel Online exposed the following sentence as typical for the German Internet-speech: “The Hijacking-problem could be easily avoided with a 301 header-redirect.”