HowTo RSS Feed State

by Randy Charles Morin, http://www.kbcafe.com

Table of Contents

  1. Introduction
  2. RSS is just XML
  3. Syndication Hints
    1. skipHours and skipDays
    2. TTL
    3. Syndication Module
  4. HTTP is not stateless
    1. Cacheability
    2. Entity Tags
    3. Last Modified
    4. Gzip
    5. Redirect
    6. Gone
  5. Bad Ideas
  6. Glossary

Introduction

Of late, the blogosphere has been alive with claims that RSS doesn't scale. This started when Chad Dickerson, CTO of Inforworld, wrote an article called "RSS Growing Pains," where he explained that RSS traffic at Infoworld was out of control. Dare Obasanjo, jumped to the rescue of RSS and showed how Inforworld was incorrectly using RSS. Then, less than two months later, Robert Scoble wrote his famous piece claiming "RSS is broken." Later Sara Williams admitted that Scoble's claims were exaggerated.

The problem with RSS is it's simplicity. Developers can quickly write RSS feeds and publish content in record time. And this is great. This is why RSS is the future of the Web. The problems occur when software developers write bad code to publish and pull RSS feeds from the Web. It would be great if the RSS advisory board had a FAQ section that told us how to properly publish and pull RSS feeds, but this hasn't happened. As such, I've set out to do exactly this. Show the busy developer how to properly publish and pull RSS content from the Web.

Don't think that I'm inventing a new wheel here. Everything in this document is already used by some RSS software. Everything has already been specified int he various RSS specifications, notes and HTTP specifications and extensions. Further, this is a living document that will describe the most widely used mechanisms for providing state to RSS feeds. I want the reader to think about this document and respond with his thoughts, especially where my thoughts are not the same as the mainstream thinking. Please send feedback and ERRATA to me.

Who should implement?
RSS readers and RSS publishers.

How useful is this mechanism?
This is a must read for all RSS developers.

It is important to understand that RSS feeds don't always return the same data to two different clients making the same request at the same time. Rather, as an optimization, RSS and HTTP provides for state management mechanisms in order to reduce bandwidth costs. This document will try to enumerate these methods. Each section will contain a sidebar indicating who should implement each mechanism and it's overall usefulness. An example indicator is shown on the right.

You know we all use our own terminology and it often differs from geek to geek, so you'll find some common RSS definitions that I use at the end of the document in the glossary.

RSS is just XML

It's very important to remember that RSS is just structured XML , that is, the elements, attributes and their order is defined by a specification. Their are three widely used RSS formats; RSS 0.91, RSS 1.0 and RSS 2.0. I suggest that RSS publishers use RSS 2.0 as it's the fastest growing of the formats and most widely supporters by RSS readers. I don't suggest using any versions of RSS outside of the big three. Regardless of which version of RSS you use, it's just XML. Well, some will say RSS 1.0 is RDF, but I'm not one of them.

Because it's just XML, most of this document is about XML feed state and applies equally well to other XML formats, like CDF and Atom. In fact, many of the techniques described here-in also apply equally well to other non-XML formats. It's not like were inventing the wheel here, these techniques have been used for years by Web client and servers to interchange documents.

Syndication Hints

RSS itself specifies several techniques for guiding the RSS reader in pulling RSS feeds over the Web. These techniques are quite often neglected by both RSS publishers and RSS readers. In order for these syndication hints to be affective, both the RSS publisher and the RSS reader must respect them.

skipHours and skipDays

Who should implement?
RSS readers MUST and RSS publishers CAN.

How useful is this mechanism?
Very useful in specific circumstances.

Many of us sleep and during those sleeping hours, we rarely blog. Many of us work and during those working hours, we rarely blog. So, why then are RSS readers pulling our feeds during those down hours? Well, truth is, they don't have to. RSS 2.0 and 0.91 both implemented a great syndication hint that told RSS readers when to avoid reading the RSS feed. By adding these elements to our RSS feeds, we're telling RSS readers to stop polling during these hours or even days. This can have a very positive affect on the bandwidth requirements of your Weblog. The following is an example RSS 2.0 feed that tells the RSS reading client not to poll the RSS feed during the six hours from 6AM GMT to 11AM GMT (until noon) and neither to poll the entire day of Sunday.

<rss version="2.0">
   <channel>
      <description>News and commentary from the cross-platform scripting community.</description>
      <link>http://www.scripting.com/</link>
      <title>Scripting News</title>
      <skipHours>
         <hour>6</hour>
         <hour>7</hour>
         <hour>8</hour>
         <hour>9</hour>
         <hour>10</hour>
         <hour>11</hour>
      </skipHours>
      <skipDays>
         <day>Sunday</day>
      </skipDays>
      <item>
         <title>stuff</title>
      </item>
   </channel>
</rss>

This would reduce the bandwidth required to serve the feed by about one third. Of course, this depends entire on whether your readers use well behaved RSS readers and the times of the day they read your blog. Of course, if you blog all days of the week and all hours of the day, then this syndication hint won't be of much help.

A technique I once used to reduce bandwidth in rare blogging hours was to put every second hour. This allowed RSS readers to poll my feed every second hour during my non-blogging hours and every hour otherwise. I didn't want a six hour polling gap, just in case, I was awake at 3AM and wanted to get my message out as quickly as possible.

TTL

Who should implement?
RSS readers SHOULD, centralized RSS aggregators MUST and RSS publishers CAN.

How useful is this mechanism?
Somewhat useful.

TTL or time to live is another great syndication hint available in RSS 2.0. It's defined as "a number of minutes that indicates how long a channel can be cached before refreshing from the source." It's a hint telling you how long you can cache the RSS feed. An RSS reader could use this hint to automatically set the polling interval for the RSS feed. The following is an example RSS feed that sets the refresh hint to two hours.

<rss version="2.0">
   <channel>
      <description>News and commentary from the cross-platform scripting community.</description>
      <link>http://www.scripting.com/</link>
      <title>Scripting News</title>
      <ttl>120</ttl>
      <item>
         <title>stuff</title>
      </item>
   </channel>
</rss>

Most RSS readers poll the source RSS feeds once per hour by default . If you don't blog very often and are not concerned with how quickly your message is read by your readers, then a larger TTL value can significantly reduce the bandwidth requirements of your RSS feed. On the other hand, if you want to get your message out there quickly and are not worried about the bandwidth consumption, then a lower TTL can get RSS clients to pull your RSS feed more often.

It's very important to note that nobody is suggesting that an RSS reader shouldn't poll the RSS feed more frequently than the TTL value indicates. Rather, the TTL value is telling the RSS reader that the feed data is good for so many minutes and that it only needs to refresh from source when the TTL is exceeded. This is a very small distinction, but an important one, because there's no contract that says an RSS client can't poll an RSS feed every five minutes regarless of the TTL value.

Syndication Module

Who should implement?
RSS readers SHOULD and RSS publishers CAN.

How useful is this mechanism?
Somewhat useful, but not widely implemented.

RSS 1.0 implements a mechanism similar to TTL called the RDF Site Summary Syndication Module. This module is a bit more flexible than TTL, but rarely used. Again, the technique is not a contract telling RSS readers to limit polling to this interval, but rather a hint to the RSS reader as to how often the feed is generally updated. Although the syndication module is intended for RSS 1.0, the extensibility of RSS 2.0 allows you to use it, but you might find it's not well supported.

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
   xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
   xmlns= "http://purl.org/rss/1.0/"><BR>  <channelrdf:about="http://meerkat.oreillynet.com/?_fl=rss1.0">
      <title>Meerkat</title>
      <link>http://meerkat.oreillynet.com</link>
      <description>Meerkat: An Open Wire Service</description>
      <sy:updatePeriod>hourly</sy:updatePeriod>
      <sy:updateFrequency>2</sy:updateFrequency>
      <sy:updateBase>2000-01-01T12:00+00:00</sy:updateBase>
 
   </channel>
</rdf:RDF>

The exactness of this RSS extension would normally allow you to exactly control the polling of the RSS feed, but both the RSS publisher and RSS reader must implement the mechanism and it is sparsely implemented.

HTTP is not stateless

To this point, I've identified a few RSS mechanisms for controlling the state of RSS feeds. The next few sections describe the state of your RSS feed as a Web document. Remember that your RSS feed is really just another object that is transferred over HTTP, the protocol of the Web. As such, it takes on all the attributes of HTTP objects and contrary to what we'd like to believe HTTP is not stateless. Whenever you make an HTTP request and whenever you receive an HTTP response, the HTTP package contains a set of headers that are easily extended to provide HTTP with all sorts of state data. Many of these state attribute and other HTTP attributes are described in the following sections.

Cacheability

Who should implement?
RSS readers MUST and RSS publishers CAN.

How useful is this mechanism?
Somewhat useful.

HTTP implements are very elaborate mechanism for increasing performance called response caching. The algorithm is describe in the HTTP/1.1 RFC in various sections. The RFC sections on Caching in HTTP and Cache-Control describe most of what you would require in order to implement an HTTP response cache, but the protocol is quite extensive and many HTTP libraries have these mechanisms built-in.  I suggest implementing one of the existing libraries.

The advantages of Cacheability in the arena of RSS only arise when you use shared caches, that is, when serveral RSS readers are behind the same Web proxy. If several RSS readers were behind the same Web proxy, then the feed can be cached by the Web proxy and served to more than one user.

The question always arises on how to implement HTTP/1.0 cacheability. First, any RSS readers or publishers that implement HTTP/1.0 are simply wrong. HTTP/1.1 is widely implemented and every RSS reader should be making HTTP/1.1 requests. That said, there are a lot of RSS readers that have chosen for some ridiculous reason to implement HTTP/1.0. I almost feel like telling you to ignore HTTP/1.0 requests, but you might not like this response. Rather, I suggest that all RSS readers should make HTTP/1.1 requests, knowing that all RSS publishers have implemented this version of the protocol and that RSS publishers handle HTTP/1.1 request as best possible and HTTP/1.0 requests as minimally as possible. In other words, do the least work possible when handling cacheability and all the other HTTP attributes of HTTP/1.0 requests. Again, that said, if you find your bandwidth is out-of-control, implementing HTTP/1.0 attributes for cacheability and compression can help, but not much.

If you need more help implementing Cacheability, then Mark Nottingham has a great article on HTTP Caches.

Entity Tags

Who should implement?
RSS readers MUST and RSS publishers SHOULD.

How useful is this mechanism?
Very useful in reducing bandwidth requirements.

Entity tags or ETags are a hash of the response content, a.k.a. cache validator. It is passed in the HTTP response headers. The client saves the Etag and next time he requests the same URL, he includes the Etag as the If-None-Match header. If the Etag matches the current representation, then Web server responds with the HTTP 304 status code and no content. This tells the RSS reader that the content has not changed since the previous request. If the Etag doesn't match the current representation, then the RSS feed is returned in the response content, as usual.

Last Modified

Who should implement?
RSS readers MUST and RSS publishers CAN.
RSS publishers SHOUD use Entity Tags.

How useful is this mechanism?
Somewhat useful in reducing bandwidth requirements.

Another form of cache validation is the Last-Modified header. It works similar to the ETag, except that it's based on time, rather than some sort of content hash. That's not to say that ETags can't also be dates, but ETags are not limited to dates, whereas Last-Modified headers are dates and only dates.

ETags are often referred to as the strong cache validator, that's because ETags are not based on artificial hash of the image. Dates on the other hand are an artificial hash of an image. To explain, if you have a hit counter on a page, then two simultaneous pulls would produce the same Last-Modified date, but two different ETags. An insignificant difference, but one none-the-less. This is not to say that an ETag must be a strong cache validator, it may also be weak.

Last-Modified works in the same manner as the Entity Tags, except that the Last-Modified header value returned in an HTTP response is passed as the If-Modified-Since HTTP header in future HTTP requests.

Gzip

Who should implement?
RSS readers SHOULD and RSS publishers CAN.

How useful is this mechanism?
Very useful in reducing bandwidth requirements.

HTTP also provides a mechanism for compressing the response content. The RSS reader can pass either gzip or compress in the Accept-Encoding HTTP header to tell the Web server that it is capable of understanding compressed responses. Gzip works the same for both HTTP/1.0 and HTTP/1.1.

GZip, ETag and Cacheability are not widely supported by all RSS publishers and readers, but the goal of this document is to change that. RSS software developers, let's get busy.

Redirect

Who should implement?
RSS readers MUST implement 302 redirect.
RSS readers SHOULD implement 301 redirects.
RSS readers SHOULD implement XML redirects.
RSS publishers CAN implement any.

How useful is this mechanism?
Very useful in maintain subscriptions.

Sometimes you need to move a feed from one URL to another. For example, services like FeedBurner host your RSS on your behalf. After hosting with FeedBurner you might decide to move your feed back to a URL within your own domain. FeedBurner implements an HTTP 301 Permanent Redirect for a ten day period after you decide to move your feed URL.

Most RSS readers currently treat HTTP 301 permanent redirects as temporary redirects and don't update their database. This is a good start, but his causes an extra redundant network cycle forever going forward. Rather, RSS clients SHOULD update their database, replacing the original RSS feed URL with the new RSS feed URL returned by the HTTP 301 response.

RSS feed publishers may also temporarily redirect your RSS feed URL by returning an HTTP 302 Temporary Redirect. RSS clients MUST NOT uupdate their database when they receive an HTTP 302 response.

A last type of redirect is an XML level redirect. These should be treated the same as HTTP 301 response.

<?xml version="1.0"?>
<redirect>
   <newLocation>http://feeds.feedburner.com/TheRssBlog<newLocation>
</redirect>

Gone

Who should implement?
RSS readers MUST and RSS publishers CAN.

How useful is this mechanism?
Somewhat useful.

Finally when an RSS feed is over, when it's life has been served and you no longer want to incur the bandwidth of clients repeatedly requesting a stale RSS for the rest of time, how do you tell the client to stop requesting me. This is simple. If you respond with an HTTP 410 status code, then you are "notifying the recipient that the resource is intentionally unavailable and that the server owners desire that remote links to that resource be removed." That's a quote from the HTTP RFC.

But, not everybody has the Web server control to respond 410 to requests. Especially if you have a shared Web server, this may not be possible at all. In this case, I suggest you respond with the following tidbit of XML that tells the RSS reader that you are no longer servicing this request.

<?xml version="1.0"?>
<redirect>
   <newLocation/>
</redirect>

Bad Ideas

This section is simply an enumeration of HTTP bandwidth saving techniques that simply don't work. The reason is almost always that nobody actually supports them.

Glossary

Centralized RSS Aggregator
A Web service that aggregates many RSS feeds on behalf of many RSS subscribers. A type of RSS Aggregator. Examples include Bloglines and Feedster.
Desktop RSS Aggregator
A client application, residing on the user's desktop, that reads many RSS feeds on behalf of one RSS subscriber. A type of RSS aggregator. Examples include Sharpreader and RSSBandit.
RSS Aggregator
A Web service that aggregates many RSS feeds on behalf of one or more RSS subscribers. A type of RSS reader.
RSS Publisher
A Web server that publishes RSS feeds for retrieval via HTTP or other transport. Examples include ScriptingNews and BoingBoing .
RSS Reader
An application that reads many RSS feeds on behalf of one or more RSS subscriber. Most RSS readers are also RSS aggregators, but with a few exceptions. Firefox reads RSS files, but does nothing more than compile the items in a menu tree. Therefor, it's an RSS reader, but not an RSS aggregator.
RSS Subscriber
A Web user who reads RSS feeds using a Web browser, an RSS aggregator or an RSS reader.