Johannes Beus
While going through my logfiles today, I noticed that Google – identical to what Yahoo has been doing for a while – is apparently producing 404-errors on purpose to get the content of the 404-errorpage. I would assume their goal to be the detection of sites which send the standard-errorpage-header 200 so that they can deal with them accordingly. The logs will look like this:
66.249.65.145 - - [01/Jan/2007:15:09:36 +0100] "GET /mppwafdgqpulx.html HTTP/1.1" 404 216 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
They will always try to access random files in the root directory which are made up of some lower-case letters as well as the file extension .html