Google Caffeine: SEO-Campixx presentation
I just finished giving a presentation on Google Caffeine on the second day of the SEO-Campixx here in Berlin. Now, I want to summarize the most important aspects of that presentation here in this blog, for all of you that were unable to attend the conference. The whole presentation can be downloaded as a PDF-file at seo.at.When Marco was putting together the program for the Campixx last year, he asked me around Christmas for what topic he could put my name down. Since Matt Cuts had previously announced that they would not roll out Caffeine before Christmas, I was naive enough to assume that they would roll it out in January of 2010, which would give me a nice subject to talk about. Now we are half way through March and Caffeine still has not been implemented. This means that the following text is more of an “educated guess” and does not necessarily have to hold true.
We can simplify the technology behind a searchengine into three different areas: crawler, index and searchfrontend. The last part is where the magic that SEO's call ranking happens: Google looks at the keyword and queries some indices, peeks into its algorithm, ranks the results accordingly and then constructs the SERPs. While most people think that this is where Caffeine will bring improvements, it is actually the index or the “search-infrastructure”, as Google likes to call it, that is being improved. This is the part of the searchengine that is build on software which was conceptualized and written by Google early on. They use MapReduce to spread out the processing of large amounts of data, BigTable is a little more powerful than a normal Excel and the Google File System (GFS) makes sure that all the files are where they are needed.
On the one hand, the GFS was developed on a tight schedule because other components are build around it and on the other hand, Google made some decisions at the time, that are more of a hindrance now. Things like, “High throughput is more important than low latency” do not reflect the demands that searchengines have to deal with now. This is supposed to get better with GFS2, through with Google is preparing its technology for the years to come. This and probably many other changes are being implemented by Google under the name of Caffeine, all of which have the goal of preparing the Google-essentials for the new requirements – this probably also includes the algorithms' ability to use more and more current signals in the future.
So what changes with Caffeine? With this new infrastructure, Google will be able to make long strides towards the Realtime-Index. Even though many articles are already found quickly, there are limits to what can be implemented, which are not going to be there anymore after the update. Being able to process data more quickly will lead to many algorithm-signals becoming more current or actually useful. Today, they make do by using domainwide signals like “trust” but in the future, we can probably expect to see more precise rankings for different types of documents.
Caffeine is also the base for including more pages to the index: Ajax-pages, parts of the notorious Deep-Web and similar types of pages will find their way into the index. And because the software can work more efficiently with Caffeine while not having the constraints of the old infrastructure anymore, results will be returned more quickly.
The question of how to prepare for Caffeine resembles something like asking a crystal ball: maybe things will happen this way, maybe they won't, so please, don't go changing around concepts that work well for you at the moment just because of the following lines. If the Google-algorithm will have new signals after Caffeine, we can assume that they will primarily deal with the realtime-search. This means, that it should not be wrong to already establish a solid presence on services like Twitter and Facebook. Please keep in mind to establish ties to meaningful contacts and networks and just remember, that due to different reasons, it is much easier to expose fake networks on social-networks than it is with link-networks.
The realtime-search will also require different concepts for Google to acquire data. The classical crawling of the web is just to slow for Google to reach it's goals. This will lead to you notifying Google about new informations through new API's (PuSH being a buzz-word), where you might even pass along the actual data, too.
When Google increases the size of the Index through vast crawlings, then this will not happen without a deeper crawling of the websites that are already in the index. Recently, Google already showed that AJAX is no problem for them, that the bot likes to also fill out and send in forms and that there is no clear word on whether they might change the priority of robots.txt entries from binding guidelines to mere recommendations. This means that you should take a close look at what Google is crawling on your site in the next few weeks or month and what they are putting in their index.
This posting is older than 30 days and therefore closed for new comments.