The user agent is freely configurable for HTTP (S) access, which means there aren’t just “real” Googlebots our there on the Internet, but also third parties who hope to benefit from calling their crawlers Googlebot.
In the past, the only way to determine the authenticity of Googlebot access was through reverse DNS and DNS lookup of the accessing IP address. Here is a current example from our log files:
126.96.36.199 [10/Nov/2021:10:59:29 +0100] "GET /news/ HTTP/1.1" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
In order to determine whether this access was made by a real Googlebot, you must first determine the so-called reverse DNS entry for the accessing IP address:
% host 188.8.131.52 184.108.40.206.in-addr.arpa domain name pointer crawl-66-249-66-67.googlebot.com.
This is then taken and resolved again to get the IP address:
% host crawl-66-249-66-67.googlebot.com crawl-66-249-66-67.googlebot.com has address 220.127.116.11
If you end up with the same IP address (as in this example), the access is authentic. In this case it was really Google and not someone just pretending to be Google. The other case is easy to find as this log entry shows:
18.104.22.168 [10/Nov/2021:11:00:42 +0100] "GET /ask-sistrix/ HTTP/1.1" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
The same user agent as before, but a different result with the reverse lookup of the IP:
% host 22.214.171.124 Host 126.96.36.199.in-addr.arpa not found: 2(SERVFAIL)
There is no reverse DNS entry for this IP address and if you do a little more research, it becomes clear that the Russian provider that uses this IP address is not quite as serious as Google. A clear case of fake Googlebot.
To help solve the problem of identifying fake Googlebot, Google is now providing a list of legitimate IP addresses. In a JSON file, Google lists all the IPs that the Googlebot is currently using.
This makes it much easier to store this list regularly and, when a Googlebot accesses it, to briefly check whether the IP address of the access matches the list. Thanks Google!