Rowan's World, et Cetera

Cached Redirects Considered Harmful (and how browsers can fix them)

by Rowan on 9 October, 2011

There are a lot of URLs out there on the Web; and a pretty big number of those URLs are either alternative names for something, or old locations that have been superseded. So “redirects” from one URL to another are a common feature of the web, and have been for many years. But recently, the way these redirects behave has been changing, because performance-conscious browser developers have started caching redirects, rather than re-requesting them from the server every time.

In theory, this makes perfect sense, but in practice, it causes web developers like me a lot of pain, because nothing “permanent” is actually that permanent. I’m not saying no browser should ever cache a redirect, but I do have a few suggestions of ways they could be a little more helpful about it.

The Technology

Let’s be clear what we’re talking about – the HTTP/1.1 specification defines a class of status codes grouped as “Redirection” statuses, in the range 3xx. The main candidate for browser caching is status code 301, which the specification labels “Moved Permanently”.1 The specification specifically states that “this response is cacheable unless indicated otherwise” – that is, unless the response also includes headers specifically relating to cache control, a User Agent can assume that it’s OK to cache this response and not re-request the original URL. It doesn’t go into any more detail, and neither this nor the next sentence (about “clients with link editing capabilities”) is couched in the standard RFC form of “User Agents MAY …” (let alone “SHOULD”), but the clear message is “this old URI is irrelevant, just use the new one”.

So, newer versions of Firefox, Chrome, and Internet Explorer have all started “obeying” this part of the standard, and opting to cache all HTTP 301 responses if not told otherwise.

The Problem

So, what’s the big problem? Well, the fact that a “resource has been assigned a new permanent URI” doesn’t mean that there won’t be some point in time where someone wants to use the old URI for something else. Worse, people have an unhelpful habit of changing their decisions later.

Consider this scenario:

  1. A  company has a product called a “Thingummy™”, with a description at http://example.com/thingummy/
  2. They decide that the name is too unwieldy, so re-brand it as “Widget™”. Knowing that cool URIs don’t change, the developers permanently redirect the old URL to the new page about the same product, at http://example.com/widget/ Old links remain valid, and direct customers to the information they were looking for.
  3. A couple of years down the line, the company decides to boost sales by releasing a “classic” version of the Widget™ under the old Thingummy™ brand. They put up a new page at http://example.com/thingummy/ Sadly, some customers continue being redirected to http://example.com/widget/ by their browsers’ cache.
  4. With Thingummies™ massively out-selling Widgets™, the company comes full circle, and abandons the new brand. The developers put in a redirect that permanently points http://example.com/widget/ back to http://example.com/thingummy/. At this point, all hell breaks loose. Well, maybe not, but customers trying to access the product page find mysterious messages on their screen about “infinite redirects”. Management are not impressed.

How Permanent is “Permanently”?

The biggest problem in all this is that developers seem to have taken the word “permanent” rather too literally. The Random House dictionary2 lists 4 definitions of permanent; the first is:

existing perpetually; everlasting, especially without significant change

In theory, this seems a reasonable definition, but in reality nothing lasts forever. Nothing physical, nothing electronic, and certainly not the structure of a website. So clearly, when the HTTP specification says “a new permanent URI”, it doesn’t actually expect you to guarantee its perpetual existence. Let’s try the second definition:

intended to exist or function for a long, indefinite period without regard to unforeseeable conditions

Ah, now that’s more like it – by that definition, a 301 status indicates a redirect which will remain valid for “a long, indefinite period”, unless there are “unforeseeable conditions”.

I did a quick test earlier, using Firefox 7 to visit a URL which returned a 301 status and no cache instructions, then inspecting the resulting cache entry using the built-in “about:cache page”. Among the details is this – “expires: No expiration time”. That’s pretty permanent, by the first definition.

Effectively, having once seen that 301 response, it is not going to request the original URL ever, simply because the web developer didn’t mention an expiry date, because they didn’t foresee any conditions in which they would want to change the decision. (And I thought 24-hour TTLs on DNS entries were annoying…)

→ Suggestion 1: Cache entries for an unadorned 301 response should have a reasonable default life time, not last forever.

→ Filed as Mozilla Bugzilla Bug 696595

Chasing your own Tail

If you’re foolish enough to have created an immortal redirect, or unlucky enough to have inherited one, you might find yourself wanting to put a new redirect pointing back the other way, as in my example above. But if you do, any browser which saw (and cached) the old redirect will simply see the new one as well, and follow both back and forth, back and forth, back and forth … until eventually it decides there’s an infinite loop and throws the user an error.

So instead, you have to resort to all sorts of confusing workarounds to keep everything working.

But the browser knows something is wrong, and it’s not the user’s fault, it’s bad data in the browser cache (depending how you look at it, that’s the web developers fault, but a browser that doesn’t cut developers some slack won’t get very far rendering real-world HTML…)

→ Suggestion 2: When an infinite redirect is detected, try skipping the cache.

→ Filed as Mozilla Bugzilla Bug 696646

Escaping the Trap

The first hurdle for developers is that it’s not obvious what the hell is going on, even when they’re just testing out a few different URL schemes in their development copy. I think whatever unacceptable language this developer used on MozillaZine probably sums up the feeling most of us had when first encountering this invisible magic.

But even once you’ve figured it out, it’s not that obvious what to do about it – do you do a deep clean of the browser’s cache every time something’s not quite right? I see a sledgehammer approaching a nut. And if you’ve released your mistake to a non-technical user, perhaps on a preview version of the site, you’re going to have to talk them through this process as well.

Caching is a pain sometimes, but it’s been around for a long time, and we have UIs for dealing with it. The most common of these is the “hard refresh” – hold down Ctrl, or Shift, and hit the reload button or keyboard shortcut, and the cache is by-passed completely and content is reloaded. Brilliant. Oh, but it only reloads the page you’re looking at, not the URI you originally requested, so it’s useless for cached redirects.

→ Suggestion 3: Make a “hard refresh” request the original URI navigated to, not the one currently being viewed.

→ Filed as Mozilla Bugzilla RFE 696650

Doing the Right Thing

Apparently, what we’re all doing wrong, as developers, is not sending appropriate cache headers along with the 301 status code. If you’re writing, say, a PHP script, and using header('Location: foo'), you should probably be doing it in some kind of wrapper function, so you can make up your own default expiry, and make sure you send a whole bunch of control headers whenever you redirect.

But a lot of redirects are not written in a rich server-side scripting language, they use specific tools built into the web server, like Apache’s incredibly powerful mod_rewrite. I just checked, and there is no function in the latest version of mod_rewrite that lets you control caching, or send arbitrary HTTP headers, when it generates a 301 response. I’m sure there are ways of stringing together a whole bunch of Apache directives to achieve the desired effect, but it would take me a while to come up with the right combination, and it would look a mess.

→ Suggestion 4: Web server redirect functions, such as mod_rewrite, should build in control over caching headers.

→ Posted to Apache HTTPD dev list

Inconsistent Behaviour

Finally, let’s assume we’re using a tool for our redirects that lets us control the cache headers, and we don’t have any old immortal redirects to avoid infinite loops with. All is fine, and the browsers will do the right thing and save us some network traffic, right?

Well, according to this test result table from someone who is in favour of redirect caching, probably not. Like everything out there in web land, you have to deal with every browser handling things just a little bit differently. Combinations which should work might or might not actually be reliable across browsers.

Even client-side caching of standard resources is, as one comment I just came upon puts it “a bit of a black art”

→ Suggestion 5: Somebody3 needs to work out what the most common scenarios for redirect caching actually are, and how to achieve them.

Conclusion

My gut instinct the first, second, and twelfth times I ran up against redirect caching was that it was stupid, and the browser developers should take it out immediately. But I do see that it makes sense, and could save a lot of unnecessary network traffic and server load. So instead, all I ask is that people don’t brush the problems off with “oh, well, you should have known what permanent means”, and look at how to make redirect caching work in the real world.


Update 2011-10-23: Bugs / RFEs filed against Firefox and Apache. I am less certain of how to test and report other browsers’ behaviour.


Footnotes

  1. On a pedantic note, while that is the heading in the RFC, and the suggested human-readable “Reason Phrase”, it is not actually the definition of the status, as is sometimes implied when discussing it. []
  2. by which I mean dictionary.com, because I couldn’t be bothered to go downstairs and find the OED []
  3. Sorry, not me – I’m one of the people who needs the crib sheet, not one who can write it!  []

13 thoughts on “Cached Redirects Considered Harmful (and how browsers can fix them)

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.