building xml sitemaps that stand the test of time

not every developer generates xml sitemaps the same. i often run into situations where large portions of a site’s content is not indexed or the feed is riddled with errors.

although generating an xml sitemap is fairly standard, there are a few requirements i include in my specifications, although optional, are worth taking the time to complete. also, if you’re curious how a sitemap is actually created, check out my post on generating xml using php first.

please take these opinions as my own based on experiences and not as hard facts. if you want to add to my list or correct anything i am suggesting here, please contact me at tom@cometton.com or leave a comment below.

timestamps

this tip is more specific to google news sitemaps or rss feeds because of the fresh content requirement to generate both. if two major news publishers published a story about a breaking topic, which would have more authority over the other considering all on-page elements are evaluated as equal? i would have to assume that google and other search engines would consider “freshness” as the deciding factor. in the world of breaking news, “first-to-publish” can be an authority signal for the publisher.

in a google news sitemap, the <publication_date> tag accepts a few different timestamp formats all of which are based off the w3c datetime specification. in order to provide the exact time an article was published, i choose to incorporate the most narrow format that lists the date plus hours, minutes, seconds and a decimal fraction of a second.

for example, november 5, 1994, 8:15:30 am, us eastern standard time can be designated as either of the two formats:

  • 1994-11-05t08:15:30-05:00
  • 1994-11-05t13:15:30z

the benefit, of course, is that google knows the exact date and time a story was published. this can be compared to other timestamps for the same story around the web as a means of validating freshness.

exclude optional tags

vanessa fox had a webcast in partnership with o’reilly media on “top technical roadblocks keeping your website from being seen by searchers“. (if you have not, i recommend everyone to watch it.)

one tip vanessa mentioned was in regard to the optional tags for xml sitemaps. she recommends to focus on the values in the <loc> and to not worry about including the optional tags (starts at 24:48). i would still recommend using the <lastmod> tag above all others in order to state how fresh my content is. like i said, however, it doesn’t mean it’s necessary for all sitemap or feed types.

from a developers’s standpoint, they often want to know the minimum that can be done. as an seo, i often need to consider the resources a client has and determine what it will actually take to push the needle. it’s good to have this sort of validation, especially for someone who worked on the sitemaps.org alliance and google webmaster tools project, on where we can lighten the load for development.

regenerate xml sitemaps

for sites that generate massive amounts of content over a short time period would benefit from regenerating xml sitemaps. since xml sitemaps are limited in the number of urls they can contain, it’s important to create a new sitemap when that limit is reached. if not, there is a risk google and other search engines will not index new content.

Screen Shot 2013-03-24 at 12.46.03 AM

looking at the screenshot above, you will see that at one point google started removing a lot of our content from their index. that was the same time the sitemap stopped regenerating itself and wasn’t until 5 months later where we were able to make the technical changes needed. looking at this graph, i’m fairly confident that not having an xml sitemap lead to google removing much of our content from their index.

keep in mind that you can also name your sitemap files anything you want as long as the extension is .txt or .xml (besides xml, google accepts text and rss formats as valid sitemaps). for instance, if your ecommerce store has a category dedicated to boots, you could create a sitemap titled boots.xml that will only list urls pointing to your boot pages.

the last item i want to mention is to build your sitemap to update as soon as new urls are published. this is especially important for google news sitemaps to always have the most recent version including all new urls. with breaking news, every bit will help to ensure highest visibility. if only google news supported pubsubhub, then near real-time updates would be easily done.

wrap html entities in cdata blocks

i’m not sure why the google news sitemap specification doesn’t mention wrapping the <news:title> or <news:caption> tag in cdata blocks or escaping the html entities but it’s important to do so if you are using this tag in your sitemap (google does mention it in the video sitemap specifications, however). if not, you can expect to see a message about a “parsing error” when you try to upload the sitemap in google webmaster tools.

anytime a headline uses ampersands, single or double quotes, the characters will need to be escaped or wrapped in cdata blocks. for instance, both values below would satisfy google’s requirements:

  • <news:title><![cdata[no. 1 louisville rolls past north carolina a&t in ncaa tournament]]></news:title>
  • <news:title>no. 1 louisville rolls past north carolina a&amp;t in ncaa tournament</news:title>

check out google’s support page for more information about how alpha-numeric characters should be handled.

image & video hosting

it’s touted as a performance best practice to host content on separate domains in order to increase the number of parallel downloads that can occur. the most common assets to be delivered separate from the principal domain are videos and images.

if you have intentions on generating google image and video sitemaps then it will be important to let google know these domains exist. for starters, verify the hosting domain in google webmaster tools. remove any directives from the hosting domain’s robots.txt file that may accidentally block crawlers from finding your content. if you want to limit access for security reasons you could either:

  • include a blank index.html file in the directory
  • use a reverse dns lookup to allow only googlebot access to your content

with the second method, you could set a 403 (access denied) for all other user-agents accessing the same content.

noindex non-essential content

you don’t want your xml sitemap showing up in google’s search results. you only want the urls found in your xml sitemap to be indexed.

Screen Shot 2013-03-24 at 1.17.54 AM

for all non-html assets, the easiest way to remove content from the index or at least stop it from showing up in search results is to use the x-robots-tag. the “noindex” value is passed via the http header response servers send when an asset is requested by the browser.

for example, the following directive would be placed into your .htaccess file (if you were running an apache server) to set a noindex on all xml files:

<file ~ "\.xml$">
  header set x-robots-tag "noindex"
</file>

testing sitemaps via google webmaster tools

before pushing your sitemaps live, it will be good to check your they pass the initial sniff test. using google webmaster tools, can you test your sitemap for any errors or warnings that google may come across. this is a great debugging feature courtesy of google and should be included in any sitemap q/a workflow.

Screen Shot 2013-03-24 at 1.22.57 AM

one thing to note is that does not mean your sitemap is 100% validated. there could be other issues occurring that google won’t report on such as urls that are noindexed, 404ing, or are duplicate.

xml sitemap support for bing

while google has added new sitemap formats and protocols to push their agenda, bing has quietly staid on the sidelines when it comes to setting standards. outside of the sitemaps.org alliance, bing hasn’t been as vocal about supporting new standards.

the good news, however, is that bing does support google’s protocols for its news and video sitemaps. in an interview with duane forrester on search engine watch in january 2013, duane recommends contacting bns@microsoft.com for inclusion. as far as feeds go, duane suggests submitting an rss feed made up of only recently published content for bing news to crawl.

if you’re looking to submit video content, bing will accept a variety of feed protocols, including google video sitemap or mrss feeds. all you have to do is contact bingfeed@microsoft.com to start the inclusion process.

more resources on creating xml sitemaps

i certainly haven’t exhausted what can be said about generating xml sitemaps but hopefully you can walk away learning a few less than obvious tips you can apply the next time you are developing or validating a sitemap.

for more best practices on generating xml sitemaps, i recommend checking out the following links:

Leave a Reply

Your email address will not be published. Required fields are marked *