Johannes Beus
The level of complexity is rising: while the first part was about easily susceptible Onpage-factors, we had Offpage-factors, meaning the linknetwork, in the second part. While this is already considerably harder and more complex to “recreate” we now gain another difficulty level: the chronological development of these two pointers.
It is quite useful that a few weeks ago Yahoo was awarded a patent with the title „
Using exceptional changes in webgraph snapshots over time for internet entity marking“ from which we can get some clues. As a short summary, this patent is about discovering suspicious websites that are trying to artificially improve their ranking. To achieve this, “snapshots” of the network as well as the linking among them are saved at various times. During the evaluation of these snapshots it becomes possible to identify extreme changes, which stand out decisively from the normal changes, and mark them as suspicious. This comparison is not limited to single websites but can also be used for hosts or whole domains. The example Yahoo lists is a website that had ten outgoing links at the time of the snapshot but already more than a thousand a week later. Since the natural growth should have yielded only 5 links, this site was marked as suspicious. The patent also addresses the problem that this growth can be perfectly normal, for example if a subject is fresh in the news which is therefore generating links. For this, there could be a whitelist of which the searchengineoperator knows that the sites and subjects are subject to frequent, heavy fluctuations in growth, which are not acting as ranking manipulations. The patent lists three possible reactions for the discovery of suspicious websites, the complete ejection from the rankings, the deferral of a certain amount of positions to the back (is this not sounding familiar?) and the inspection by a human.
Even though this patent is from Yahoo, we can be rather sure that Google is considering similar deliberations and is possibly already using comparable mechanisms. Especially through the combination of different methods of discovery the rate of “false positives”, sites that are wrongfully labeled spam, should be lowered considerably. In this context it is interesting to see that for the discovery of webspam they are apparently trying out similar ways to those that are being used against e-mail-spam for a while now. The Yahoo patent also makes it sound as though the mechanisms for the discovery can be executed as self-learning – the parallel to the usual Bayes-spamfilter for e-mails is distinctive. The method of examining and rating a multitude of clues and checking the “score” of all tests to see if it is lower than a set value also sounds familiar.
Is this still normal? – part I
Is this still normal? – part II
Is this still normal? – part III