Register / Login

How Google is searching through the Deep-Web

Even though it could be viewed in the wild for over a year, Google's ability to assess and fill out forms is still rather unknown. Due to the VLDB (Very Large Data Bases) 2008, Google has published a paper on this subject which lists backgrounds and considerations on this subject in detail.

The “Deep Web” is the part of the Internet that is not being indexed by the common searchengines because their content is only dynamically generated and shown after a form has been filled out and submitted. Since it is assumed that the “Deep Web” holds noticeably more content than the “normal” web, searchengines like Google naturally have a large interest in gathering this content into their index.

There are different ways of indexing the Deep-Web; Google has chosen to fill out the forms and then crawl the resulting (GET)-URLs. For one, this has the advantage that nothing has to be changed in the actual searchenginemechanism (they are still indexing “quite regular” URLs) and this approach is also not limited to specific sites or filetypes. The actual problem is that of how how the URLs are created that are to be crawled – and this is where Google is showing some interesting approaches. I do not want to go into details too much (those who are interested can download the PDF) but Google basically developed a mechanism that crawls as many different results (data) with as few a number of URLs as possible. Text-fields in the forms are filled with words belonging to the respective site, from which Google expects good results, and Drop-Down-menus as well as similar fields are used as well. According to the paper, this method was used to fill out a few million forms and the results can be found in the Googleindex for a while already.

From a technological point-of-view it is highly interesting to see what Google is doing here, while the SEO- and siteoperator point-of-view is a little conflicting. I believe that, at the moment, it makes more sense to present the contents of your database to the Googlebot through autonomous measures rather than trusting on Google's algorithm to find the fitting words and searchoptions.
Johannes Beus - on Mon (02/02/2009) at 15:14 PM

Comments closedComments closed
This posting is older than 30 days and therefore closed for new comments.