Crawling Errors in the Optimizer

From:

Modified: 23.12.2020

There can be times when the SISTRIX Crawler cannot completely capture all content on a page. Here, we take a look at the most common reasons as well as the reasons, and show you solutions to these problems.

The SISTRIX crawler

All access related to the SISTRIX Toolbox is carried out by the SISTRIX crawler. This Crawler can be identified by two distinct traits: on the one hand it is the user-agent, which is submitted every time a page is accessed. By default, the user-agent is:

Mozilla/5.0 (compatible; SISTRIX Crawler; http://crawler.sistrix.net/)

On the other hand, all IP-addresses of the SISTRIX Crawler point to the hostname of the domain “sistrix.net”. Our Crawler on the IP 136.243.92.8, for example, would return the Reverse-DNS-Entry 136-243-92-8.crawler.sistrix.net.

The SISTRIX Crawler continuously keeps a close eye on the loading speed of visited pages, and will adjust the speed with which new pages are requested, to this rate. This way, we can ensure we will not overload the webserver. More information is available at crawler.sistrix.net.

In the Optimizer you also have the ability to control the user-agent and the crawl-intensity of the Optimizer Crawler. You will find these settings in each project under “Project-Management > Crawler” in the boxes “Crawling Settings” and “Crawling Speed”.

robots.txt

Before first accessing a website, our Crawler will request a file with the name “robots.txt” in the root directory, as well as on each hostname, of the domain. If the Crawler finds this file, it analyses it and closely observes the rules and restrictions found in the file. Rules that only count for „sistrix“ will be accepted as well as general rules with the identifier „*“. Should you use a robots.txt file, we ask that you please check the contents to make sure that the SISTRIX crawler hasn’t been accidentally restricted.

If you refer to a sitemap in the robots.txt, our crawler will access it as a crawl base.

Cookies

The SISTRIX Crawler will not save cookies while checking a page. Please ensure that our crawler can access all parts of a page without having to accept cookies. You will find the IP of our crawler inside the “Project-Management” under “Crawler-Settings”.

JavaScript

Our crawler does not use JavaScript. Please ensure that all pages are accessible as static HTML-pages so our crawler can analyse them.

Server side restrictions

The SISTRIX Crawler can be restricted on the server’s side. In this case, our crawler will get an error message with the HTTP-status-code 403 (restricted) when first accessing a page. Following that, it will not be able to access any pages on this server. Such a server side restriction may be put in place on different system levels. A good starting point would be to check the “.htaccess” file of the Apache-webserver. If no clues are found here, you should contact the provider or host. Sadly, we are not able to deactivate these restrictions ourselves.

Examples of common restrictions

robots.txt restrictions

If the robots.txt restricts our Optimizer crawler, you will get a “robots.txt blocks crawling” error. Please check if there are general (User-Agent: *) or specific (User-Agent: Sistrix) restrictions in your robots.txt. If you changed your user-agent in the crawler-settings of your project, please check for those, too.

Only a small number or no pages were crawled

There are multiple reasons for why our crawler could only crawl a small number or even no pages at all. In the Optimizer project, go to “Analyse > Expert Mode”. There you will find an extensive list of all crawled HTML-documents on the domain. You can find the status code by scrolling a little to the right in the table. This should tell you why not all pages associated with this domain have been crawled.

200: If the status code is 200 but no other pages have been crawled, the reason is often one of the following:
- Missing internal links: Our crawler follows all internal links that are not blocked for the crawler. Please check that there are internal links on the starting page and if the target pages might be blocked for our crawler by either the robots.txt or the crawler settings.
- Geo-IP settings: To present the website in the corresponding language of every user, the IP is checked for the country of origin. All of our crawlers are based in Germany which makes it necessary to whitelist our Crawler-IP if you want it to access all language contents available behind a Geo-IP-Barrier.
301 / 302: If the status code 301 or 302 appears, please check if the link leads to a different domain – for example sistrix.at, which leads to sistrix.de via a 301 redirection. The Optimizer crawler always stays on the domain (or the host or directory) entered into the project settings. If I create a project for sistrix.at, our crawler would recognize the 301 redirection and show it in the expert mode, but would not follow the redirect to sistrix.de, as this is a different domain.
403: If the status code 403 is delivered instantly, or if after a few crawlable pages (Status Code 200) only 403 codes are shown, you should check why the server restricts our crawler from requesting the pages. Please refer to the entry for “Server side restrictions“.
5xx: If a status code 500 or 5xx is shown in the status code field, this means the server was not able to take care of our request due to a server error. In this case, you should wait a few minutes and then use the “Restart Crawler” button in the “Project-Management” menu. If the 5xx status code keeps showing up, check why the server is overloaded and unable to deliver the pages.

Why does Google find other/more content than SISTRIX?

Our crawler always begins with the starting page of the project, though more start pages may be added in the crawler settings. From this point on, we will follow all internal links that are not blocked. On these linked pages, we will follow all internal links until we find all of those that we have not yet requested.

What can happen is that, for example, AdWords Landingpages that aren’t linked internally do not appear in the results. This is usually done so that they do not influence the AdWords Tracking. This will mean that such pages are invisible to our crawler. Google, of course, is aware of these pages.

If you enter a sitemap of our project with Google, it can pay off to link to it inside the robots.txt. That way, our crawler can recognise and use it as a crawl base.

Another reason for why there may a difference of values between the indexed pages of the Google search and the number of crawled pages in your optimizer may be duplicate content in Google’s search index.

From:

SISTRIX Content Team

Published: 19.01.2016