Delivery patterns for non-public resources hosted on S3

I work at the Science History Institute on our Digital Collections app (written in Rails), which is kind of a “digital asset management” app combined with a public catalog of our collection.

We store many high-resolution TIFF images that can be 100MB+ each, as well as, currently, a handful of PDFs and audio files. We have around 31,000 digital assets, which make up about 1.8TB. In addition to originals, we have “derivatives” for each file (JPG conversions of a TIFF original at various sizes; MP3 conversions of FLAC originals; etc) — around 295,000 derivatives (~10 per original) taking up around 205GB. Not a huge amount of data compared to some, but big enough to be something to deal with, and we expect it could grow by an order of magnitude in the next couple years.

We store them all — originals and derivatives — in S3, which generally works great.

We currently store them all in public S3 buckets, and when we need an image thumb url for an <img src>, we embed a public S3 URL (as opposed to pre-signed URLs) right in our HTML source. Having the user-agent get the resources directly from S3 is awesome, because our app doesn’t have to worry about handling that portion of the “traffic”, something S3 is quite good at (and there are CDN options which work seamlessly with S3 etc; although our traffic is currently fairly low and we aren’t using a CDN).

But this approach stops working if some of your assets can not be public, and need to be access-controlled with some kind of authorization. And we are about to start hosting a class of assets that are such.

Another notable part of our app is that in it’s current design it can have a LOT of img src thumbs on a page. Maybe 600 small thumbs (one or each scanned page of a book), each of which might use an img srcset to deliver multiple resolutions. We use Javascript lazy load code so the browser doesn’t actually try to load all these img src unless they are put in viewport, but it’s still a lot of URLs generated on the page, and potentially a lot of image loads. While this might be excessive and a design in ned of improvement, a 10×10 grid of postage-stamp-sized thumbs on a page (each of which could use a srcset) does not seem unreasonable, right? There can be a lot of URLs on a page in an “asset management” type app, it’s how it is.

As I looked around for advice on this or analysis of the various options, I didn’t find much. So, in my usual verbose style, I’m going to outline my research and analysis of the various options here. None of the options are as magically painless as using public bucket public URL on S3, alas.

All public-read ACLs, Public URLs

What we’re doing now. The S3 bucket is set as public, all files have S3 public-read ACL set, and we use S3 “public” URLs as <img src> in our app. Which might look like https://my-bucket.s3.us-west-2.amazonaws.com/path/to/thumb.jpg .

For actual downloads, we might still use an S3 presigned url , not for access control (the object is already public), but to specify a content-disposition response header for S3 to use on the fly.

Pro

  • URLs are persistent and stable and can be bookmarked, or indexed by search engines. (We really want our images in Google Images for visibility) And since the URLs are permanent and good indefinitely, they aren’t a problem for HTML including these urls in source to be cached indefinitely. (as long as you don’t move your stuff around in your S3 buckets).
  • S3 public URLs are much cheaper to generate than the cryptographically presigned URLs, so it’s less of a problem generating 1200+ of them in a page response. (And can be optimized an order of magnitude further beyond the ruby SDK implementation).
  • S3 can scale to handle a lot of traffic, and Cloudfront or another CDN can easily be used to scale further. Putting a CDN on top of a public bucket is trivial. Our Rails app is entirely uninvolved in delivering the actual images, so we don’t need to use precious Rails workers on delivering images.

Con

  • Some of our materials are still being worked on by staff, and haven’t actually been published yet. But they are still in S3 with a public-read ACL. They have hard to guess URLs that shouldn’t be referred to on any publically viewable web page — but we know that shouldn’t be relied upon for anything truly confidential.
    • That has been an acceptable design so far, as none of these materials are truly confidential, even if not yet published to our site. But this is about to stop being acceptable as we include more truly confidential materials.

All protected ACL, REDIRECT to presigned URL

This is the approach taken by Rails’ ActiveStorage does in standard setup/easy path. It assumes all resources will stored to S3 without public ACL; a random user can’t access via S3 without a time-limited presigned URL being supplied by the app.

ActiveStorage’s standard implementation will give you a URL to your Rails app itself when you ask for a URL for an S3-stored asset — a rails URL is what might be in your <img src> urls. That Rails URL will redirect to a unique temporary S3 presigned URL that allows access to the non-public resource.

Pro

  • This pattern allows your app to decide based on current request/logged-in-user and asset, whether to grant acccess, on a case by case basis. (Although it’s not clear to me where the hooks are in ActiveStorage for this; I don’t actually use ActiveStorage, and it’s easy enough to implement this pattern generally, with authorization logic).
  • S3 is still delivering assets directly to your users, so scaling issues are still between S3 and the requestor, and your app doesn’t have to get involved.
  • The URLs that show up in your delivered HTML pages, say as <img src> or <a href> URLs — are pointing your app, and are still persistent and indefinitely valid — so the HTML is still indefinitely cacheable by any HTTP cache. The will redirect to a unique-per-user and temporary presigned URL, but that’s not what’s in the HTML source.
    • You can even more your images around (to different buckets/locations or entirely different services) without invalidating the cache of the HTML. the URLs in your cached HTML don’t change, where they redirect to do. (This may be ActiveStorage’s motivation for this design?)

Cons

  • Might this interfere with Google Images indexing? While it’s hard (for me) to predict what might effect Google Images indexing, my own current site’s experience seems to say its actually fine. Google is willing to index an image “at” a URL that actually HTTP 302 redirects to a presigned S3 URL. Even though on every access the redirect will be to a different URL, Google doesn’t seem to think this is fishy. Seems to be fine.
  • Makes figuring out how to put a CDN in the mix more of a chore, you can’t just put it in front of your S3, as you only want to CDN/cache public URLs, but may need to use more sophisticated CDN features or setup or choices.
  • The asset responses themselves, at presigned URLs, are not cacheable by an HTTP cache, either browser caching or intermediate. (Or at least not for more than a week, the maximum expiry of presigned urls).
  • This is the big one. Let’s say you have 40 <img src> thumbnails on a page, and use this method. Every browser page load will result in an additional 40 requests to your app. This potentially requires you to scale your app much larger to handle the same amount of actual page requests, because your actual page requests are now (eg) 40x.
    • This has been reported as an actual problem by Rails ActiveStorage users. An app can suddenly handle far less traffic because it’s spending so much time doing all these redirects.
    • Therefore, ActiveStorage users/developers then tried to figure out how to get ActiveStorage to instead use the “All public-read ACLs, Public URLs delivered directly” model we listed above. It is now possible to do that with ActiveStorage (some answers in that StackOverflow), which is great, because it’s a great model when all your assets can be publicly available… but that was already easy enough to do without AS, we’re here cause that’s not my situation and I need something else!.
    • On another platform that isn’t Rails, the performance concerns might be less, but Rails can be, well, slow here. In my app, a response that does nothing but redirect to https://example.com can still take 100ms to return! I think an out-of-the-box Rails app would be a bit faster, I’m not sure what is making mine so slow. But even at 50ms, an extra (eg) 40x50ms == 2000ms of worker time for every page delivery is a price to pay.
    • In my app where many pages may actually have not 40 but 600+ thumbs on them… this is can be really bad. Even if JS lazy-loading is used, it just seems like asking for trouble.

All protected ACL, PROXY to presigned URL

Okay, just like above, but the app action, instead of redireting to S3…. actually reads the bytes from s3 on the back-end, and delivers them to to the user-agent directly, as a sort of proxy.

The pros/cons are pretty similar to redirect solution, but mostly with a lot of extra cons….

Extra Pro

  • I guess it’s an extra pro that the fact it’s on S3 is completely invisible to the user-agent, so it can’t possibly mess up Google Images indexing or anything like that.

Extra Cons

  • If you were worried about the scaling implications of tying up extra app workers with the redirect solution, this is so much worse, as app workers are now tied up for as long as it takes to proxy all those bytes from S3 (hopefully the nginx or passenger you have in front of your web app means you aren’t worried about slow clients, but that byte shuffling from S3 will still add up).
  • For very large assets, such as I have, this is likely incompatible with a heroku deploy, because of heroku’s 30s request timeout.

One reason I mention this option, is I believe it is basically what a hyrax app (some shared code used in our business domain) does. Hyrax isn’t necessarily using S3, but I believe does have the Rails app involved in proxying and delivering bytes for all files (including derivatives), including for <img src>. So that approach is working for them well enough, so maybe shouldn’t be totally dismissed. But it doesn’t seem right to me — I really liked the much better scaling curve of our app when we moved it away from sufia (a hyrax precedessor), and got it to stop proxying bytes like this. Plus I think this is probably a barrier to deploying hyrax apps to heroku, and we are interested in investigating heroku with our app.

All protected ACL, have nginx proxy to presigned URL?

OK, like the above “proxy” solution, but with a twist. A Rails app is not the right technology for proxying lots of bytes.

But nginx is, that’s honestly it’s core use case, it’s literally built for a proxy use case, right? It should be able to handle lots of em concurrently with reasonable CPU/memory resources. If we can get nginx doing the proxying, we don’t need to worry about tying up Rails workers doing it.

I got really excited about this for a second… but it’s kind of a confusing mess. What URLs are we actually delivering in <img src> in HTML source? If they are Rails app URLs, that will then trigger an nginx proxy using something like nginx x-accel but for to a remote (presigned S3) URL instead of a local file, we have all the same downsides as the REDIRECT option above, without any real additional benefit (unless you REALLY want to hide that it’s from S3).

If instead we want to embed URLs in the HTML source that will end up being handled directly by nginx without touching the Rails app… it’s just really confusing to figure out how to set nginx up to proxy non-public content from S3. nginx has to be creating signed requests… but we also want to access-control it somehow, it should only be creating these when the app has given it permission on a per-request basis… there are a variety of of nginx third party modules that look like maybe could be useful to put this together, some more maintained/documented than others… and it just gets really confusing.

PLUS if you want to depoy to heroku (which we are considering), this nginx still couldn’t be running on heroku, cause of that 30s limit, it would have to be running on your own non-heroku host somewhere.

I think if I were a larger commercial company with a product involving lots and lots of user-submitted images that I needed to access control and wanted to store on S3…. I might do some more investigation down this path. But for my use case… I think this is just too complicated for us to maintain, if it can be made to work at all.

All Protected ACL, put presigned URLs in HTML source

Protect all your S3 assets with non-public ACLs, so they can only be accessed after your app decides the requester has privileges to see it, via a presigned URL. But instead of using a redirect or proxy, just generate presigend URLs and use them directly in <img src> for thumbs or or <a href> for downloads etc.

Pro

  • We can control access at the app level
  • No extra requests for redirects or proxies, we aren’t requiring our app to have a lot more resources to handle an additional request per image thumb loaded.
  • Simple.

Con

  • HTML source now includes limited-time-expiring URLs in <img src> etc, so can’t be cached indefinitely, even for public pages. (Although can be cached for up to a week, the maximum expiry of S3 presigned URLs, which might be good enough).
  • Presigned S3 URLs are really expensive to generate. It’s actually infeasible to include hundreds of them on a page, can take almost 1ms per URL generated. This can be optimized somewhat with custom code, but still really expensive. This is the main blocker here I think, for what otherwise might be “simplest thing that will work”.

Different S3 ACLs for different resources

OK, so the “public bucket” approach I am using now will work fine for most of my assets. It is a minority that actually need to be access controlled.

While “access them all with presigned URLs so the app is the one deciding if a given request gets access” has a certain software engineering consistency appeal — the performance and convennience advantages of public_read S3 ACL are maybe too great to give up when 90%+ of my assets work fine with it.

Really, this whole long post is probably to convince myself that this needs to be done, because it seems like such a complicated mess… but it is, I think the lesser evil.

What makes this hard is that the management interface needs to let a manager CHANGE the public-readability status of an asset. And each of my assets might have 12 derivatives, so that’s 13 files to change, which can’t be done instantaneously if you wait for S3 to confirm, which probably means a background job. And you open yourself up to making a mistake and having a resource in the wrong state.

It might make sense to have an architecture that minimizes the number of times state has to be changed. All of our assets start out in a non-published draft state, then are later published; but for most of our resources destined for publication, it’s okay if they have public_read ACL in ‘draft’ state. Maybe there’s another flag for whether to really protect/restrict access securely, that can be set on ingest/creation only for the minority of assets that need it? So only needs to be changed if am mistake were made, or decision changed?

Changing “access state” on S3 could be done by one of two methods. You could have everything in the same bucket, and actually change the S3 ACL. Or you could have two separate buckets, one for public files and one for securely protected files. Then, changing the ‘state’ requires a move (copy then delete) of the file from one bucket to another. While the copy approach seems more painful, it has a lot of advantages: you can easily see if an object has the ‘right’ permissions by just seeing what bucket it is in (while using S3’s “block public access” features on the non-public bucket), making it easier to audit manually or automatically; and you can slap a CDN on top of the “public” bucket just as simply as ever, rather than having mixed public/nonpublic content in the same bucket.

Pro

  • The majority of our files that don’t need to be secured can still benefit from the convenience and performance advantages of public_read ACL.
  • Including can still use a straightforward CDN on top of bucket bucket, and HTTP cache-forever these files too.
  • Including no major additional load put on our app for serving the majority of assets that are public

Con

  • Additional complexity for app. It has to manage putting files in two different buckets with different ACLs, and generating URLs to the two classes differently.
  • Opportunity for bugs where an asset is in the ‘wrong’ bucket/ACL. Probably need a regular automated audit of some kind — making sure you didn’t leave behind a file in ‘public’ bucket that isn’t actually pointed to by the app is a pain to audit.
  • It is expensive to switch the access state of an asset. A book with 600 pages each with 12 derivatives, is over 7K files that need to have their ACLs changed and/or copied to another bucket if the visibility status changes.
  • If we try to minimize need to change ACL state, by leaving files destined to be public with public_read even before publication and having separate state for “really secure on S3” — this is a more confusing mental model for staff asset managers, with more opportunity for human error. Should think carefully of how this is exposed in staff UI.
  • For protected things on S3, you still need to use one of the above methods of giving users access, if any users are to be given access after an auth check.

I don’t love this solution, but this post is a bunch of words to basically convince myself that it is the lesser evil nonetheless.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s