OpenLinkGraph: the SISTRIX Link-Index

It has been nearly two years since we started out with gathering ideas and first drafts and now, we can finally show the first fruits of our labor: the SISTRIX OpenLinkGraph private-beta went live this weekend and we have already gotten some valuable feedback from users. The determining factor for developing this tool was the realisation that only our own index, which we crawl and process ourselves, would be able to give us the results we would expect. Additionally, there is the fact that since Microsoft bought Yahoo, they decided to cease operations of their own crawling-ambitions. This means that the main trove of link-data has disappeared, which made developing our own index unavoidable.

What might sound simple at first glance, turned out to be hugely challenging: billions of websites need to be prioritised, crawled and processed. The database needs to spit out the results within seconds. Considering the number of servers supporting such a system, you have some of them break down on a daily basis, which makes it necessary to buffer their impact on the system. As one could imagine, this makes for enough complexity to make it lot of fun.

The result of our work is this platform, which makes it possible to deal with the current ideas and applications, as well as be prepared for future requirements: both the index-size as well as the evaluation-methods will not push the system to any discernible limits, which means we will be able to enjoy it for quite a while. Seeing how an introduction to the OpenLinkGraph would be far too long for one blogposting, I will take the next few days to preview the different parts of the system. For those of you coming to Dmexco this week, you can come by our booth D-69 and get a live preview of our tool as well as take home a beta invitation.

