Johannes Beus
One problem that is current but has seemingly not yet found enough publicity for the large searchengines to become active, has to do with the taking-over of content – just as with the 302-hijack-problematic. In a number of countries, China or the United Arab Emirates for example, there is no free Internet access: all traffic that wants to pass the borders is redirected through state-owned routers and is censored. The Internet would not be what it is if there would not be a technical method of evading this forced-censorship. Besides such elaborate possibilities as the anonymity service “Tor” or the use of a foreign VPN-gateway, a much more technologically simple method has come out on top. Small scripts, mostly written in PHP or Perl, can be installed on every webspace-account that supports one of these scripting languages and then offer a kind of simple, non transparent proxy-service. The user enters the site that he wishes to visit and the server which hosts the script fetches the site and displays it. To come back to the problem which this poses in the view of searchengines, here the typical crawling procedure when the Google-crawler fetches a site:

the Googlebot is accessing the site directly, receives the content and adds it to its index. Since these proxy-scripts are run on a normal webspace-account they are also accessible to the Googlebot through its normal crawling. In the following case the homepage would accessible through www.example.com but also through www.proxy.com/proxy/www.example.com, for example.

Since the proxy is mirroring the content on the homepage one-to-one and – this is important – relays it through its own URL, we get the adequately know problem of duplicate-content: Google sees the same content more than once and is then forced to decide which on is the original, which will then be added to the index and which are the doublets. In principle, Google has made large advancements in this field since they introduced a new architecture, which was baptized “Big-Daddy” internally, even though it is by no means perfect already. It is not uncommon for Google to recognize the proxy as the original, which then leads to the actual homepage being thrown out of the index and the proxy being added in its stead. The effects of this are not contained to just the page that the proxy is hijacking: you should imagine that your homepage is deprived of your main page, which is usually also the main-linkhub.
The first solution that is usually tried first is the blockage of the particular proxy. Either the user-agent or the IP-address of the proxy is being blocked on the server on which the homepage sits so that the proxy is unable to fetch the site. Sadly by this time, the number of these kinds of webproxies has become nearly impossible to keep track of and everyday new ones are added, additionally you have the problem that the operators are quit often rather resourceful when it comes to naming their user-agents. To this comes that fact that it is not only altruistic operators that want to help the poor, censor-stricken people in the aforementioned countries that offer these proxies but partly also racketeers which are deliberately hijacking pages through these proxy-methods to superimpose their own advertisements on them. Especially for these cases the cover-up tactics are so sophisticated that it is hard get at them.
The second method is taking a rather offbeat approach, which is promising a considerably higher chance of success, though. The basic idea is that the websites are fetched with a “Noindex”-metatag to all visitors. This metatag will be either left out or a “Index,Follow”-tag will be send, only when a certified searchenginebot requests the page. The result would be that the Googlebot would directly get the permission to index the page while the proxy-version – since the proxy will not be able to identify itself as a regular searchenginebot – would receive the Noindex-command. That way the searchengine will have an easy time deciding which site to take into the index. Thankfully, the four large searchengines (Google, Yahoo, Microsoft, Ask.com) have established a uniform mechanism for the verification of their searchenginebots last year: by resolving the IP-address for the access to the associated hostname, which has to correspond to one of the searchengines as well as another resolving of the hostname to that IP-Address, you can undoubtedly determine whether the access came from one of the real or one of the many fake Googlebots. An implementation of this solution in PHP could look like this for example:
<?php
if(preg_match('/(Googlebot|Slurp|Jeeves|msnbot)/', $_SERVER['HTTP_USER_AGENT'])
&& preg_match('/(\.googlebot\.com|\.yahoo\.net|\.inktomisearch\.com|\.ask\.com|\.live\.com)$/', gethostbyaddr($_SERVER['REMOTE_ADDR']))
&& (gethostbyname(gethostbyaddr($_SERVER['REMOTE_ADDR'])) == $_SERVER['REMOTE_ADDR'])) {
echo '<meta name="robots" content="index,follow">';
} else {
echo '<meta name="robots" content="noindex,nofollow">';
}
?>
Naturally I can not conceal the fact that this solution has drawbacks. For one you are keeping out all the other searchengines besides those that are deposited in the script. This can make the crawler life hard for those, especially smaller searchengines that are not (yet) supporting the verification through DNS-rDNS-resolution or which are not yet entered. This way every time a page is fetched there have to be two DNS-resolutions – this can noticeably increase the responsetimes depending on the speed of the prompted server. A solution with intelligent caching of inquiries is to be implicitly advised for larger sites. We can also not rule out the possibility that the proxyserver operators will just rewrite the meta-tags or removed them altogether which would come to nothing. All in all this should be the best possible method at the time to answer the problem of proxy-hijacking.