Johannes Beus
Now that the large three have
agreed on a
Sitemaps-format, the time has come to convert projects to it. Sitemaps.org's
XML-schematic is a further development of Google Sitemaps. They are generating purely textual files in the XML-format. The file has to have the UTF-8-encoding. A typical Sitemaps-file looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://www.example.org/</loc>
<lastmod>2006-11-12T13:19:21+01:00</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
</url>
</urlset>
The first line is a typical XML-header and sets the encoding to UTF-8. Within the urlset every single URL needs to be specified. For this, every URL has to be opened with the <url> tag and closed with </url>.
<loc> is the only tag that is mandatory within <url>. It determines the URL. Here it is important to know that specific characters within the URL have to be escaped. The characters are &, ', ", > und <. They have to be replaced by their HTML-equivalent.
<lastmod> is the time of the last change of the file. This can help searchengines skip files that have not changed which will then save traffic for both the searchengine as well as the siteoperator. This information is formated in ISO 8601. If the Sitemaps can be created with PHP, the “c” format can be used (echo date(#c#, getlastmod($file));)
<changefreq> can take on the following values: always, hourly, daily, weekly, monthly, yearly, never. Always should be used when the site will show new content every time the crawler comes to visit. This happens, for example, with random quotes. Never is for Archivepages. This specification is nothing more than a recommendation how often the page should be crawled. For the searchengine this is not a binding statement and is being handled differently by the searchengines.
<priority> is a number between 0.0 and 1.0 which sets the relative priority of the page in reference to the other sites pages within the Sitemap. This statement has no bearing on the SERPs.
The maximum size for a Sitemaps-file is 10 Megabytes (10,485,760 bytes) or 50.000 URLs. The Sitemap may also be packed with GZip but may not exceed the 50,000 URLs. If you have projects which exceed this maximum, you can set up an Index-Sitemap. Contained therein you can refer to up to 1000 further Sitemaps which can hold 50,000 URLs each.
To inform searchengines about updated Sitemaps, the protocol is supposed to “ping” a searchengine interface. Until now, only Google realized this feature. For this “ping” just call up the following address:
www.google.com/webmasters/sitemaps/ping?sitemap=http%3A%2F%2Fwww.example.org%2Fsitemap.xml – Google will return the HTTP-returncode 200 to confirm that the Sitemap was correctly recognized.