en / de
  • Start
  • Blog
  • Contact
SEO-BlogSEO-Blog
  • RSS-Feed, complete Feed
  • About the Blog
  • Twitter
  • Archive: 2003, 2004, 2005, 2006, 2007, 2008, 2009
SISTRIX ToolboxSISTRIX Toolbox
  • SEO
  • SEM
  • Backlinks
  • Monitoring
  • Zugang jetzt bestellen
SearchSearch
More PostingsMore Postings
  • Toolbox: Monitoring
  • Toolbox: Backlinks
  • Toolbox: SEM
  • Toolbox: SEO
  • SISTRIX toolbox
  • SEO-regulars-table Bonn
  • Dead-end PageRank-sculpting?
  • Hello Christmasindex
  • IndexWatch 10/2008
  • When the Webhoster optimizes

Who measures much, measures much muck.

The diligent reader of this blog will have noticed that I am known to use my harddrive to offer refuge to bytes that are aimlessly wandering the Internet and that I will then try to reach rather meaningful conclusions from them. When, a few days ago, Seomoz announced that they had a linkindex of more than 30 billion websites it sounded great, before all else: Yahoo and Google are sitting on their data or rather publish only rubbish and Microsoft has had the good grace to not even attempt publication, so another source was eagerly welcomed. One thing that did bother me though was that the people at Seomoz, who usually are very open, did not want to divulge the source of their data. But seeing that the crawling mechanism for 30 billion pages cannot go unnoticed, I started on its trail.

The general idea is that the crawler has visited all of the sites that were cited as link-sources in the Linkscape report, which means it has to be found in the webserver-logfiles. What I did was compile reports for my own domains which have (very) few incoming links, which were all set by myself and to whose logfiles I have access and then compare the logs of the linking sites over the last few month for the same useragents. During all comparisons I was always left with the “Dotbot” by Seattle's dotnetdotcom.org.

A lively little guy with a large appetite, which – to get back to the title of this post – attracted my attention before: just as the Yahoo-crawler did a while ago, the Dotbot is leaving out the “trailing-slash” before directories. It would request this blog as “/news” from the server instead of as “/news/”. Usually this is not a problem because, for real-existing directories, the webserver will notice the error and use a 301-redirect to transfer the crawler to the correct URL, for dynamic websites that have faulty programming though this can become a problem and is the cause that the “Dotbot” receives redirects for a whopping 15% of URLs during the assessment of the located HTTP-Status-Codes.
Johannes Beus - on Wed (10/15/2008) at 13:50 PM

Comments closedComments closed
This posting is older than 30 days and therefore closed for new comments.
 
An den Seitenanfang