Browser caching of Rails assets — for real

Note: This info is out of date with Rails 3.1 and the asset pipeline. The asset pipeline does a better job of taking care of this for you, although you may still need to do some work in apache to get headers set the right way, not sure, but it’d not be the exact solution mentioned here.

tldr intro: While Rails adds a ?timestamp query string to the end of asset URLs, under passenger+apache deployment (and I suspect most other deployment), these actually have no effect whatsoever on browser caching.  In some cases this might not matter to you much — if you are able to use the Rails helper :cache=>’filename’ argument, or if you don’t have very many asset files, or if you aren’t trying to wring every last bit of performance out of your app.  But sometimes you really do want your Rails assets to be cached for real by the browser, without even an if-modified check.  Below is a way to kind of hackily set up your apache conf to do so.  [btw, wouldn't it be nice if Passenger would do this automatically for Rails, avoiding the need for a hack?]

The basic issue: why do we care?

When you deliver a web page, these days it probably links to lots of external assets the browser needs to download too — such as stylesheets, javascript files, images.   Worse, with stylesheets and inline <script> tags, the browser will generally not display the page until all are loaded. This can significantly slow down page load time when you have a bunch of these — the number of such asset files is actually more important than their size, as each requires an http connection which takes some time, and most browsers are also only willing to parallelize so many (4?) requests to the same server at once.

So one approach is to try and have fewer CSS or JS files, perhaps using Rails asset helper methods “:cache=>’filename’” argument.  There are some cases where this isn’t convenient, and also some issues with this.  And even if you use it, you think, gee, wouldn’t it be nice to have the browser use it’s cached copy on the second and later requests, if it has one?

Although it’s certainly less of an issue if you’ve consolidated your js and css down to one file each, still, you may have noticed Rails adds timestamp query strings to assets, aren’t those there for a reason?

Rails timestamp in query string: Does nothing at all in typical setup

Rails asset url generation methods (stylesheet_link, javascript_include) will automatically include a numeric timestamp on the end of every asset URL, as a query string:

http://someserver/stylesheets/my_stylesheet.css?1234567890

The idea here is that exact URL can in fact be cached forever by the client, because if the content ever changes, the timestamp on the end will change, and the browser will be sent a different URL.

Browsers all cache URLs including query string. However,  apache ignores the query string on such a URL when it’s serving static resources (and a typical passenger deployment has apache serving static resources directly), just serves up the same resource as if the query string weren’t there.

That’s the idea of these timestamp query strings, but how is the browser to know it’s allowed to cache that resource forever (or for a year, which is the longest the HTTP standard allows to be specified)?  It only knows that if the web server sends appropriate caching headers (such as “Expires:”) ; the browser can’t know that just becuase a URL ends in a query string with ten digits, that means it’s Rails way of saying it’s cacheable forever, that’s what HTTP headers are for.

But under a standard apache+passenger deployment (and I’m betting any other common deployment), the web server doesn’t send any special headers for these URLs, because the web server also doesn’t have any way to know these are cache-forever-able URLs. In fact, by the time apache gets to returning the response which could have expires headers, apache has already ignored the fact that the url even has a query string.

So what will the browser actually do, the second time it needs one of these asset files it already requested? Well, it does have them in it’s cache. So it sends a conditional GET to the server, saying “Only send me this file if it’s changed since the last one I had.”  (Apache, in most typical setups, will also send an “Etag” with the original asset, which provides another way for the browser to ask the server if the content has changed). And it won’t download the file a second time if it hasn’t.

But recall that in  most applications, it’s not really the size of the asset downloads that causes a slowdown, it’s the number of them, because of the overhead of making all those HTTP requests. So even sending a conditional GET, the browser is still sending an http request for each asset. And if you’ve got a bunch of assets, that’s still going to slow things down — on my application, it was slowing things down by up to a second, just for all conditional gets!

Yes, one solution would be trying to reduce my number of assets.   But what’s that Rails-generated timestamp for, anyway?  In a typical setup, the timestamp is doing absolutely nothing to effect browser caching behavior — the conditional get described above happens with or without the special Rails timestamp.

Googling around for this, I found a few people mentioning this, but not as much discussion as I expected. I’d think it would be more noteworthy that Rails is spending time calculating and attaching these timestamps, which then have no effect at all.  Maybe my googling skills are just poor. There were a few comments on approaches to address this, which I added upon to arrive at a solution satisfactory to me.

Naive Approach

To be fair, there is some discussion of this in the latest Rails documentation. It recommends that you tell apache to send Expires  “access plus 1 year” for all your resources.  Which is an approach I found googling too.

The problem with this, is that it will send expires-one-year headers for ALL assets, those with the timestamp/versioning query string on the end, and those without. But it’s the versioning/timestamp query string on the end that really makes the URL cacheable “forever”, that’s what keeps the browser from using a cached version of something that has since changed.

At first you might think (as I did at first), that this is fine, because after all, all Rails urls to assets should have this timestamp.  Ah, but it turns out there are cases where they don’t — primarily when you’re referencing things like background-images in a static CSS file.  (I think Rails3 makes typical css files ERB-able? Which would make it easier to use the rails helper method in a CSS file to generate a timestamped version here).  Or perhaps something else weird, perhaps using Javascript dynamic loading, or perhaps you’ve made a mistake somewhere.

Really, we want the expires-one-year header to be sent ONLY for asset URLs that have the versioning query string appended on the end, and not for those that don’t. Because otherwise, if you send it for all of them, and there’s one single place in your app that causes the browser to request a non-version-query-stringed asset url, the browser will end up caching it forever and not pick up changes you make to that asset (unless the user does a hard reset), which is pretty unfortunate. So what the Rails docs recommend is actually kind of dangerous unless you’re really careful.

But the Apache mod_expires modules doesn’t accept any conditions, you can’t tell it to only set an expires header for certain URLs.  Well, you sort of can, by placing it inside a specific virtual host, or <location>, <locationmatch>, <directory>, <filesmatch> etc directive. But none of those directives let you constrain their applicability based on query string, they all just check the path part of the URL.

mod_rewrite will let you check the query string, but mod_rewrite doesn’t give you any good way to set the expires header to “one year in the future”. It’ll let you hard-code an expires to a certain date, but then you’ve always got to be going back and pushing that hard-coded date into the future, as time marches on.

The Real Solution

So I was pointed to the real solution by this blog post by Stephen Sykes. You’ve got to kind of hackily first use mod_rewrite to change the actual path of any asset with a rails versioning/timestamp query. You create a symlink in your file system so that new path actually points to exactly the same files as the old path. But since it’s a new path, you can use apache directives to set far-future expires only for resources in that path.  Bingo, it’s kind of hacky, but now you have far-future expires headers only on rails assets with the versioning/timestamp query string, and not on those that don’t.

I made a few changes to Stephen’s suggestions though, because:

  • I didn’t want to have to specify a file system <directory> in my apache conf.  I use passenger, and the exact directory something lives in is determined simply by a symlink in web root to the real application. If I ever changed this, I didn’t want to have to remember to go sync it in the apache conf too. I’d rather use an apache directive like <location> based on the URI, not <directory> based on the file system location.
  • My Rails apps are sometimes deployed not at web server root, but at a certain path on the web server, so I needed to take account of that, and make sure I was only doing the path rewriting for things actually in my Rails apps.
  • I prefer to be more conservative with the conditions on the RewriteRule, to only apply this logic to things in one of the actual asset directories, /images, /stylesheets, /css. (Plus /plugin_assets for Rails2 engines).  If I’m too conservative, and realize there’s something else I wanted to cache, can always expand the condition later. But if I’m too generous, by the time I realize it there might be clients out there caching forever something I don’t want them to, with no way to fix it except somehow getting every such user to do a hard-refresh in their browser.

So here’s what I end up with. First you still need to go into your Rails app’s public directory, and do a:

ln -s . add_expires_header

(Symlinks go into git fine, it looks like, if you have your public directory in git for capistrano deployment or anything else.)

Then, in your apache conf, one per application. In this case my application is mounted at /demo, along with the passenger directive:  RailsBaseURI /demo

# Force assets to be cached forever -- but only if they have the Rails-added
# query string on them that will change if the content changes, exact URL
# can be cached forever. Since we have so many JS/CSS files,
# making sure they're all cached leads to perf gains. Without
# this FF at least checks modified-since for each one, which
# takes non-trivial time even if it doens't download.
# This trick requires your rails public directory to have:
#    cd /path/to/rails_app/public
#    ln -s . add_expires_header
# Idea from: http://www.stephensykes.com/blog_perm.html?157

 RewriteEngine on

 # only match query strings exactly 10 digits, Rails default timestamp
 RewriteCond %{QUERY_STRING} ^[0-9]{10}$

 # Only match in our application's asset directories; use PT to make sure
 # the new URL will be picked up by the LocationMatch directory, redirect
 # to our symlinked dir that will force long expires.
 RewriteRule ^/demo(/(plugin_assets|images|javascripts|stylesheets).*) /demo/add_expires_header$1 [PT]

 # Now make sure the redirected-to-symlinked URLs really do get
 # the expires header.
 <LocationMatch "^/demo/add_expires_header">
   ExpiresActive On
   ExpiresDefault "access plus 1 year"
 </LocationMatch>

Yeah, you need one of these per application, and you need to remember go and fix the prefix /demo path if you change your mount-point.  But at least you don’t need to change anything just because you change the file system location, if you don’t change the URL prefix.

Wouldn’t it be nice if…

I kind of wish Passenger would just do this for us.  With a config directive, perhaps even turned on by default. Would go with the general “just do the right thing” goals of Passenger.  Rails is creating these timestamps, isn’t the whole point that they be cached on the browser?   I’d like it if Passenger just made it so, which it likely could, since it’s already an apache module.  Although it might be a bit of a pain since usually the middleware stays out of the way of apache serving of static resources entirely.

Alternately, there is some talk that appending a “cache busting” string in the query section of the URI is the wrong approach for other reasons (along with other complaints about Rails general strategy, although this issue isn’t mentioned), because not all proxies and other caches pay attention to the query string in caching.  If Rails were putting the ‘cache buster’ in the cache portion alone, it would be easier to do the apache directives in a much shorter and simpler way, just using mod_expires without the need to bring mod_rewrite into it.  If you use that Asset Fingerprint plugin, you might find it easier to do the apache conf simpler.  (Although putting the fingerprint in the path instead of the query requires either making an extra copy of the asset on the server, as the Asset Fingerprint plugin does, or intervening in the server processing with mod_rewrite again).

My own complex situation, if you’re curious

What makes this a more complicated issue for me is the particular way an application using Blacklight (BL) generates links to CSS and Javascript assets.

In order to provide for modular code, where extra add-ons or local customizations can provide their own CSS or JS that integrates cleanly with the core BL,  BL builds up it’s list of JS or CSS assets dynamically, per-action.  That makes it quite hard to use the standard :cache=>’key’ rails asset helper argument, without accidentally caching two different collections of assets under the same key, with disastrous consequences.

In addition, with this architecture, you end up with a large number of stylesheets and javascript files, as each component adds it’s own in.  Which is a problem for application performance without the appropriate caching strategy.  I actually like this architecture, I don’t want to change it, I want to provide an appropriate caching strategy instead.

Idea for even better solution

So the above solution with apache conf is definitely good enough for now, improving page loading times (when assets are cached) by up to 1 second or more in my application. But on the first request (or first request after assets have changed on the server), the client will still need to download a bunch of separate assets using a bunch of HTTP calls, which is unfortunate.

Taking the principles of the Asset Fingerprint plugin, but combining them with the goals of the standard Rails helper methods :cache=>key argument to aggregate resources, here’s my idea for a caching strategy fit for my situation.

Blacklight already keeps a list of stylesheets or javascripts to include in an array, and renders the tags to include them in a single method.  So there’s a hook to put the caching strategy in.

The list of assets to render is actually a list of array-arguments to the rails helper methods, which sometimes include the Rails2 Engines :plugin argument too.

So first take this list, and normalize it to a string in a sorted way (basically, sort all the elements, and make sure the serialized hashes have sorted keys and any nested arrays are sorted too etc).  This should hopefully be a fairly quick thing to do, becuase we’re going to do it on any page render.

Now check an application-wide hash, with that normalized serialization as a key, to see if we already have an aggregated asset filename generated for this unique combination of asset files. If so, we just generate the proper html tag to include it.

If not, we combine all these assets into one file (with internal comments as to where each subsidiary file starts, to aid in debugging if neccesary).  We MD5 hash the whole file, and construct a filename that includes the MD5 hash (meaning the filename is unique to it’s exact contents).  (These steps are done only once per unique-combination-of-assets per application instance run, so it’s okay if they are expensive).  We put the file in a public/aggregated_assets directory (not bothering if there’s a file with the exact same name, and thus neccesarily the same content, already there; maybe another instance already created it, no big deal), and save the filename in the app-wide Hash mentioned above.  And we set up apache with a simple location/expires directive to expire-far-future anything in the /aggregated_assets directory.

If I ever have time to work on it, or if it ever becomes a priority to squeeze out even a bit more performance from my application, that’s what I’ll try.  Get the benefits of consolidating multiple assets into one; get the benefits of a far-future expires date on a filename which will change if it’s contents change; and as an added bonus do caching the way the Asset Fingerprint plugin argues is better, based on MD5 fingerprint instead of timestamp and in path instead of query string.

About these ads
This entry was posted in General. Bookmark the permalink.

2 Responses to Browser caching of Rails assets — for real

  1. Wow! Quite the post…. I use the REVISION from capistrano as a key for apending the :cache => “all_#{REVISION}”

    The simple approaches work best especially if you are using a CDN like CloudFront

  2. David says:

    I’m a UI designer/front-end dev working with a startup. The devs are using rails which is great. I myself am pretty clueless about rails but have been googling around to see if adding the timestamp to an image filename would have any bad SEO implications?

    I haven’t been able to find too much info, maybe I’m “googling wrong” but thought you may have more info.

    I realize image names aren’t hugely important for SEO but I run a desig gallery site and we get a lot of traffic from people searching google images.

    So would a filename like picture-of-grass.jpg make any SEO difference it it were named picture-of-grass.jpg?12345579 ?

    Thanks,
    David

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s