Unexpected performance characteristics when exploring migrating a Rails app to Heroku

I work at a small non-profit research institute. I work on a Rails app that is a “digital collections” or “digital asset management” app. Basically it manages and provides access (public as well as internal) to lots of files and description about those files, mostly images.

It’s currently deployed on some self-managed Amazon EC2 instances (one for web, one for bg workers, one in which postgres is installed, etc). It gets pretty low-traffic in general web/ecommerce/Rails terms. The app is definitely not very optimized — we know it’s kind of a RAM hog, we know it has many actions whose response time is undesirable. But it works “good enough” on it’s current infrastructure for current use, such that optimizing it hasn’t been the highest priority.

We are considering moving it from self-managed EC2 to heroku, largely because we don’t really have the capacity to manage the infrastructure we currently have, especially after some recent layoffs.

Our Rails app is currently served by passenger on an EC2 t2.medium (4G of RAM).

I expected the performance characteristics moving to heroku “standard” dynos would be about the same as they are on our current infrastructure. But was surprised to see some degradation:

  • Responses seem much slower to come back when deployed, mainly for our slowest actions. Quick actions are just as quick on heroku, but slower ones (or perhaps actions that involve more memory allocations?) are much slower on heroku.
  • The application instances seem to take more RAM running on heroku dynos than they do on our EC2 (this one in particular mystifies me).

I am curious if anyone with more heroku experience has any insight into what’s going on here. I know how to do profiling and performance optimization (I’m more comfortable with profiling CPU time with ruby-prof than I am with trying to profile memory allocations with say derailed_benchmarks). But it’s difficult work, and I wasn’t expecting to have to do more of it as part of a migration to heroku, when performance characteristics were acceptable on our current infrastructure.

Response Times (CPU)

Again, yep, know these are fairly slow response times. But they are “good enough” on current infrastruture (EC2 t2.medium), wasn’t expecting them to get worse on heroku (standard-1x dyno, backed by heroku pg standard-0 ).

Fast pages are about the same, but slow pages (that create a lot of objects in memory?) are a lot slower.

This is not load testing, I am not testing under high traffic or for current requests. This is just accessing demo versions of the app manually one page a time, to see response times when the app is only handling one response at a time. So it’s not about how many web workers are running or fit into RAM or anything; one is sufficient.

ActionExisting EC2 t2.mediumHeroku standard-1x dyno
Slow reporting page that does a few very expensive SQL queries, but they do not return a lot of objects. Rails logging reports: Allocations: 8704~3800ms~3200ms (faster pg?)
Fast page with a few AR/SQL queries returning just a few objects each, a few partials, etc. Rails logging reports: Allocations: 820581-120ms~120ms
A fairly small “item” page, Rails logging reports: Allocations: 40210~200ms~300ms
A medium size item page, loads a lot more AR models, has a larger byte size page response. Allocations: 361292~430ms600-700ms
One of our largest pages, fetches a lot of AR instances, does a lot of allocations, returns a very large page response. Allocations: 19837333000-4000ms5000-7000ms

Fast-ish responses (and from this limited sample, actually responses with few allocations even if slow waiting on IO?) are about the same. But our slowest/highest allocating actions are ~50% slower on heroku? Again, I know these allocations and response times are not great even on our existing infrastructure; but why do they get so much worse on heroku? (No, there were no heroku memory errors or swapping happening).

RAM use of an app instance

We currently deploy with passenger (free), running 10 workers on our 4GB t2.medium.

To compare apples to apples, deployed using passenger on a heroku standard-1x. Just one worker instance (because that’s actually all I can fit on a standard-1x!), to compare size of a single worker from one infrastructure to the other.

On our legacy infrastructure, on a server that’s been up for 8 days of production traffic, passenger-status looks something like this:

  Requests in queue: 0
  * PID: 18187   Sessions: 0       Processed: 1074398   Uptime: 8d 23h 32m 12s
    CPU: 7%      Memory  : 340M    Last used: 1s
  * PID: 18206   Sessions: 0       Processed: 78200   Uptime: 8d 23h 32m 12s
    CPU: 0%      Memory  : 281M    Last used: 22s
  * PID: 18225   Sessions: 0       Processed: 2951    Uptime: 8d 23h 32m 12s
    CPU: 0%      Memory  : 197M    Last used: 8m 8
  * PID: 18244   Sessions: 0       Processed: 258     Uptime: 8d 23h 32m 11s
    CPU: 0%      Memory  : 161M    Last used: 1h 2
  * PID: 18261   Sessions: 0       Processed: 127     Uptime: 8d 23h 32m 11s
    CPU: 0%      Memory  : 158M    Last used: 1h 2
  * PID: 18278   Sessions: 0       Processed: 105     Uptime: 8d 23h 32m 11s
    CPU: 0%      Memory  : 169M    Last used: 3h 2
  * PID: 18295   Sessions: 0       Processed: 96      Uptime: 8d 23h 32m 11s
    CPU: 0%      Memory  : 163M    Last used: 3h 2
  * PID: 18312   Sessions: 0       Processed: 91      Uptime: 8d 23h 32m 11s
    CPU: 0%      Memory  : 169M    Last used: 13h
  * PID: 18329   Sessions: 0       Processed: 92      Uptime: 8d 23h 32m 11s
    CPU: 0%      Memory  : 163M    Last used: 13h
  * PID: 18346   Sessions: 0       Processed: 80      Uptime: 8d 23h 32m 11s
    CPU: 0%      Memory  : 162M    Last used: 13h

We can see, yeah, this app is low traffic, most of those workers don’t see a lot of use. The first worker, which has handled by far the most traffic has a Private RSS of 340M. (Other workers having handled fewer requests much slimmer). Kind of overweight, not sure where all that RAM is going, but it is what it is. I could maybe hope to barely fit 3 workers on a heroku standard-2 (1024M) instance, if these sizes were the same on Heroku.

This is after a week of production use — if I restart passenger on a staging server, and manually access some of my largest, hungriest, most-allocating pages a few times, I can only see Private RSS use of like 270MB.

However, on the heroku standard-1x, with one passenger worker, using the heroku log-runtime-metrics feature to look at memory… private RSS is I believe what should correspond to passenger’s report, and what heroku uses for memory capacity limiting…

Immediately after restarting my app, it’s at sample#memory_total=184.57MB sample#memory_rss=126.04MB. After manually accessing a few of my “hungriest” actions, I see: sample#memory_total=511.36MB sample#memory_rss=453.24MB . Just a few manual requests not a week of production traffic, and 33% more RAM than on my legacy EC2 infrastructure after a week of production traffic. Actually approaching the limits of what can fit in a standard-1x (512MB) dyno as just one worker.

Now, is heroku’s memory measurements being done differently than passenger-status does them? Possibly. It would be nice to compare apples to apples, and passenger hypothetically has a service that would let you access passenger-status results from heroku… but unfortunately I have been unable to get it to work. (Ideas welcome).

Other variations tried on heroku

Trying the heroku gaffneyc/jemalloc build-pack with heroku config:set JEMALLOC_ENABLED=true (still with passenger, one worker instance) doesn’t seem to have made any significant differences, maybe 5% RAM savings or maybe it’s a just a fluke.

Switching to puma (puma5 with the experimental possibly memory-saving features turned on; just one worker with one thread), doesn’t make any difference in response time performance (none expected), but… maybe does reduce RAM usage somehow? After a few sample requests of some of my hungriest pages, I see sample#memory_total=428.11MB sample#memory_rss=371.88MB, still more than my baseline, but not drastically so. (with or without jemalloc buildpack seems to make no difference). Odd.

So what should I conclude?

I know this app could use a fitness regime; but it performs acceptably on current infrastructure.

We are exploring heroku because of staffing capacity issues, hoping to not to have to do so much ops. But if we trade ops for having to spend much time on challenging (not really suitable for junior dev) performance optimization…. that’s not what we were hoping for!

But perhaps I don’t know what I’m doing, and this haphapzard anecdotal comparison is not actually data and I shoudn’t conclude much from it? Let me know, ideally with advice of how to do it better?

Or… are there reasons to expect different performance chracteristics from heroku? Might it be running on underlying AWS infrastructure that has less resources than my t2.medium?

Or, starting to make guess hypotheses, maybe the fact that heroku standard tier does not run on “dedicated” compute resources means I should expect a lot more variance compared to my own t2.medium, and as a result when deploying on heroku you need to optimize more (so the worst case of variance isn’t so bad) than when running on your own EC? That’s maybe just part of what you get with heroku, unless paying for performance dynos, it is even more important to have an good performing app? (yeah, I know I could use more caching, but that of course brings it’s own complexities, I wasn’t expecting to have to add it in as part of a heroku migration).

Or… I find it odd that it seems like slower (or more allocating?) actions are the ones that are worse. Is there any reason that memory allocations would be even more expensive on a heroku standard dyno than on my own EC2 t2.medium?

And why would the app workers seem to use so much more RAM on heroku than on my own EC2 anyway?

Any feedback or ideas welcome!

faster_s3_url: Optimized S3 url generation in ruby

Subsequent to my previous investigation about S3 URL generation performance, I ended up writing a gem with optimized implementations of S3 URL generation.

github: faster_s3_url

It has no dependencies (not even aws-sdk). It can speed up both public and presigned URL generation by around an order of magnitude. In benchmarks on my 2015 MacBook compared to aws-sdk-s3: public URLs from 180 in 10ms to 2200 in 10ms; presigned URLs from 10 in 10ms to 300 in 10ms (!!).

While if you are only generating a couple S3 URLs at a time you probably wouldn’t notice aws-sdk-ruby’s poor performance, if you are generating even just hundreds at a time, and especially for presigned URLs, it can really make a difference.

faster_s3_url supports the full API for aws-sdk-s3 presigned URLs , including custom params like response_content_disposition. It’s tests actually test that results match what aws-sdk-s3 would generate.

For shrine users, faster_s3_url includes a Shrine storage sub-class that can be drop-in replacement of Shrine::Storage::S3 to just have all your S3 URL generations via shrine be using the optimized implementation.

Key in giving me the confidence to think I could pull off an independent S3 presigned URL implementation was WeTransfer’s wt_s3_signer gem be succesful. wt_s3_signer makes some assumptions/restrictions to get even higher performance than faster_s3_url (two or three times as fast) — but the restrictions/assumptions and API to get that performance weren’t suitable for use cases, so I implemented my own.

Delete all S3 key versions with ruby AWS SDK v3

If your S3 bucket is versioned, then deleting an object from s3 will leave a previous version there, as a sort of undo history. You may have a “noncurrent expiration lifecycle policy” set which will delete the old versions after so many days, but within that window, they are there.

What if you were deleting something that accidentally included some kind of sensitive or confidential information, and you really want it gone?

To make matters worse, if your bucket is public, the version is public too, and can be requested by an unauthenticated user that has the URL including a versionID, with a URL that looks something like: https://mybucket.s3.amazonaws.com/path/to/someting.pdf?versionId=ZyxTgv_pQAtUS8QGBIlTY4eKmANAYwHT To be fair, it would be pretty hard to “guess” this versionID! But if it’s really sensitive info, that might not be good enough.

It was a bit tricky for me to figure out how to do this with the latest version of ruby SDK (as I write, “v3“, googling sometimes gave old versions).

It turns out you need to first retrieve a list of all versions with bucket.object_versions . With no arg, that will return ALL the versions in the bucket, which could be a lot to retrieve, not what you want when focused on just deleting certain things.

If you wanted to delete all versions in a certain “directory”, that’s actually easiest of all:

s3_bucket.object_versions(prefix: "path/to/").batch_delete!

But what if you want to delete all versions from a specific key? As far as I can tell, this is trickier than it should be.

# danger! This may delete more than you wanted
s3_bucket.object_versions(prefix: "path/to/something.doc").batch_delete!

Because of how S3 “paths” (which are really just prefixes) work, that will ALSO delete all versions for path/to/something.doc2 or path/to/something.docdocdoc etc, for anything else with that as a prefix. There probably aren’t keys like that in your bucket, but that seems dangerously sloppy to assume, that’s how we get super weird bugs later.

I guess there’d be no better way than this?

key = "path/to/something.doc"
s3_bucket.object_versions(prefix: key).each do |object_version|
  object_version.delete if object_version.object_key == key
end

Is there anyone reading this who knows more about this than me, and can say if there’s a better way, or confirm if there isn’t?

Github Actions tutorial for ruby CI on Drifting Ruby

I’ve been using travis for free automated testing (“continuous integration”, CI) on my open source projects for a long time. It works pretty well. But it’s got some little annoyances here and there, including with github integration, that I don’t really expect to get fixed after its acquisition by private equity. They also seem to have cut off actual support channels (other than ‘forums’) for free use; I used to get really good (if not rapid) support when having troubles, now I kinda feel on my own.

So after hearing about the pretty flexible and powerful newish Github Actions feature, I was interested in considering that as an alternative. It looks like it should be free for public/open source projects on github. And will presumably have good integration with the rest of github and few kinks. Yeah, this is an example of how a platform getting an advantage starting out by having good third-party integration can gradually come to absorb all of that functionality itself; but I just need something that works (and, well, is free for open source), I don’t want to spend a lot of time on CI, I just want it to work and get out of the way. (And Github clearly introduced this feature to try to avoid being overtaken by Gitlab, which had integrated flexible CI/CD).

So anyway. I was interested in it, but having a lot of trouble figuring out how to set it up. Github Actions is a very flexible tool, a whole platform really, which you can use to set up almost any kind of automated task you want, in many different languages. Which made it hard for me to figure out “Okay, I just want tests to run on all PR commits, and report back to the PR if it’s mergeable”.

And it was really hard to figure this out from the docs, it’s such a flexible abstract tool. And I have found it hard to find third party write-ups and tutorials and blogs and such — in part because Github Actions was in beta development for so long, that some of the write-ups I did find were out of date.

Fortunately Drifting Ruby has provided a great tutorial, which gets you started with a basic ruby CI testing. It looks pretty straightforward to for instance figure out how to swap in rspec for rake test. And I always find it easier to google for solutions to additional fancy things I want to do, finding results either in official docs or third-party blogs, when I have the basic skeleton in place.

I hope to find time to experiment with Github Actions in the future. I am writing this blog post in part to log for myself the Drifting Ruby episode so I don’t lose it! The show summary has this super useful template:

.github/workflows/main.yml
name: CI
on:
push:
branches: [ master, develop ]
pull_request:
branches: [ master, develop ]
jobs:
test:
# services:
# db:
# image: postgres:11
# ports: ['5432:5432']
runs-on: ubuntu-latest
steps:
uses: actions/checkout@v2
name: Setup Ruby
uses: ruby/setup-ruby@v1.45.0
with:
ruby-version: 2.7.1
uses: Borales/actions-yarn@v2.3.0
with:
cmd: install
name: Install Dependencies
run: |
# sudo apt install -yqq libpq-dev
gem install bundler
name: Install Gems
run: |
bundle install
name: Prepare Database
run: |
bundle exec rails db:prepare
name: Run Tests
# env:
# DATABASE_URL: postgres://postgres:@localhost:5432/databasename
# RAILS_MASTER_KEY: ${{secrets.RAILS_MASTER_KEY}}
run: |
bundle exec rails test
name: Create Coverage Artifact
uses: actions/upload-artifact@v2
with:
name: code-coverage
path: coverage/
security:
runs-on: ubuntu-latest
steps:
uses: actions/checkout@v2
name: Setup Ruby
uses: ruby/setup-ruby@v1.45.0
with:
ruby-version: 2.7.1
name: Install Brakeman
run: |
gem install brakeman
name: Run Brakeman
run: |
brakeman -f json > tmp/brakeman.json || exit 0
name: Brakeman Report
uses: devmasx/brakeman-linter-action@v1.0.0
env:
REPORT_PATH: tmp/brakeman.json
GITHUB_TOKEN: ${{secrets.GITHUB_TOKEN}}

More benchmarking optimized S3 presigned_url generation

In a recent post, I explored profiling and optimizing S3 presigned_url generation in ruby to be much faster. In that post, I got down to using a Aws::Sigv4::Signer instance from the AWS SDK, but wondered if there was a bunch more optimization to be done within that black box.

Julik posted a comment on that post, letting us know that they at WeTransfer have already spent some time investigating and creating an optimized S3 URL signer, at https://github.com/WeTransfer/wt_s3_signer . Nice! It is designed for somewhat different use cases than mine, but is still useful — and can be benchmarked against the other techniques.

Some things to note:

  • wt_s3_signer does not presently do URI escaping; it may or may not be noticeably slower if it did; it will not generate working URLs if your S3 keys include any characters that need to be escaped in the URI
  • wt_s3_signer gets ultimate optimizations by having you re-use a signer object, that has a fixed/common datestamp and expiration and other times. That doesn’t necessarily fit into the APIs I want to fit it into — but can we still get performance advantage by re-creating the object each time with those things? (Answer, yes, although not quite as much. )
  • wt_s3_signer does not let you supply additional query parameters, such as response_content_disposition or response_content_type. I actually need this feature; and need it to be different for each presigned_url created even in a batch.
  • wt_s3_signer’s convenience for_s3_bucket constructor does do at least one network request to S3… to look up the appropriate AWS region I guess? That makes it far too expensive to re-use for_s3_bucket convenience constructor once-per-url, but I don’t want to do this anyway, I’d rather just pass in the known region, as well as the known bucket base URL, etc. Fortunately the code is already factored well to give us a many-argument plain constructor where we can just pass that all in, with no network lookups triggered.
  • Insists on looking up AWS credentials from the standard locations, there’s no way to actually pass in an access_key_id and secret_access_key explicitly, which is a problem for some use cases where an app needs to use more than one set of credentials.

Benchmarks

So the benchmarks! This time I switched to benchmark-ips, cause, well, it’s just better. I am benchmarking 1200 URL generations again.

I am benchmarking re-using a single WT::S3Signer object for all 1200 URLs, as the gem intends. But also compared to instantiating a WT::S3Signer for each URL generation — using WT::S3Signer.new, not WT::S3Signer.for_s3_bucket — the for_s3_bucket version cannot be used instantiated once per URL generation without crazy bad performance (I did try, although it’s not included in these benchmarks).

I include all the presigned_url techniques I demo’d in the last post, but for clarity took any public url techniques out.

Calculating -------------------------------------
   sdk presigned_url      1.291  (± 0.0%) i/s -      7.000  in   5.459268s
use inline instantiated Aws::Sigv4::Signer directly for presigned url (with escaping)
                          4.950  (± 0.0%) i/s -     25.000  in   5.082505s
Re-use Aws::Sigv4::Signer for presigned url (with escaping)
                          5.458  (±18.3%) i/s -     27.000  in   5.037205s
Re-use Aws::Sigv4::Signer for presigned url (without escaping)
                          5.751  (±17.4%) i/s -     29.000  in   5.087846s
wt_s3_signer re-used     45.925  (±15.2%) i/s -    228.000  in   5.068666s
wt_s3_signer instantiated each time
                         15.924  (±18.8%) i/s -     75.000  in   5.016276s

Comparison:
wt_s3_signer re-used:       45.9 i/s
wt_s3_signer instantiated each time:       15.9 i/s - 2.88x  (± 0.00) slower
Re-use Aws::Sigv4::Signer for presigned url (without escaping):        5.8 i/s - 7.99x  (± 0.00) slower
Re-use Aws::Sigv4::Signer for presigned url (with escaping):        5.5 i/s - 8.41x  (± 0.00) slower
use inline instantiated Aws::Sigv4::Signer directly for presigned url (with escaping):        5.0 i/s - 9.28x  (± 0.00) slower
   sdk presigned_url:        1.3 i/s - 35.58x  (± 0.00) slower

Wow! Re-using a single WT::S3Signer, as the intend, is a LOT LOT faster than anything else — 35x faster than the built-in AWS SDK presigned_url method!

But even instantiating a new WT::S3Signer for each URL — while significantly slower than re-use — is still significantly faster than any of the methods using an AWS SDK Aws::Sigv4::Signer directly, and still a lot lot faster than the AWS SDK presigned_url method.

So this has promise, even if you re-use the thing, it’s better than any other option. I may try to PR and/or fork to get some of the features I’d need in there… although the license is problematic for many projects I work on. With the benchmarking showing the value of this approach, I could also just try to reimplement from scratch based on the Amazon instructions/example code that wt_s3_signer itself used, and/or the ruby AWS SDK implementation.

Delivery patterns for non-public resources hosted on S3

I work at the Science History Institute on our Digital Collections app (written in Rails), which is kind of a “digital asset management” app combined with a public catalog of our collection.

We store many high-resolution TIFF images that can be 100MB+ each, as well as, currently, a handful of PDFs and audio files. We have around 31,000 digital assets, which make up about 1.8TB. In addition to originals, we have “derivatives” for each file (JPG conversions of a TIFF original at various sizes; MP3 conversions of FLAC originals; etc) — around 295,000 derivatives (~10 per original) taking up around 205GB. Not a huge amount of data compared to some, but big enough to be something to deal with, and we expect it could grow by an order of magnitude in the next couple years.

We store them all — originals and derivatives — in S3, which generally works great.

We currently store them all in public S3 buckets, and when we need an image thumb url for an <img src>, we embed a public S3 URL (as opposed to pre-signed URLs) right in our HTML source. Having the user-agent get the resources directly from S3 is awesome, because our app doesn’t have to worry about handling that portion of the “traffic”, something S3 is quite good at (and there are CDN options which work seamlessly with S3 etc; although our traffic is currently fairly low and we aren’t using a CDN).

But this approach stops working if some of your assets can not be public, and need to be access-controlled with some kind of authorization. And we are about to start hosting a class of assets that are such.

Another notable part of our app is that in it’s current design it can have a LOT of img src thumbs on a page. Maybe 600 small thumbs (one or each scanned page of a book), each of which might use an img srcset to deliver multiple resolutions. We use Javascript lazy load code so the browser doesn’t actually try to load all these img src unless they are put in viewport, but it’s still a lot of URLs generated on the page, and potentially a lot of image loads. While this might be excessive and a design in ned of improvement, a 10×10 grid of postage-stamp-sized thumbs on a page (each of which could use a srcset) does not seem unreasonable, right? There can be a lot of URLs on a page in an “asset management” type app, it’s how it is.

As I looked around for advice on this or analysis of the various options, I didn’t find much. So, in my usual verbose style, I’m going to outline my research and analysis of the various options here. None of the options are as magically painless as using public bucket public URL on S3, alas.

All public-read ACLs, Public URLs

What we’re doing now. The S3 bucket is set as public, all files have S3 public-read ACL set, and we use S3 “public” URLs as <img src> in our app. Which might look like https://my-bucket.s3.us-west-2.amazonaws.com/path/to/thumb.jpg .

For actual downloads, we might still use an S3 presigned url , not for access control (the object is already public), but to specify a content-disposition response header for S3 to use on the fly.

Pro

  • URLs are persistent and stable and can be bookmarked, or indexed by search engines. (We really want our images in Google Images for visibility) And since the URLs are permanent and good indefinitely, they aren’t a problem for HTML including these urls in source to be cached indefinitely. (as long as you don’t move your stuff around in your S3 buckets).
  • S3 public URLs are much cheaper to generate than the cryptographically presigned URLs, so it’s less of a problem generating 1200+ of them in a page response. (And can be optimized an order of magnitude further beyond the ruby SDK implementation).
  • S3 can scale to handle a lot of traffic, and Cloudfront or another CDN can easily be used to scale further. Putting a CDN on top of a public bucket is trivial. Our Rails app is entirely uninvolved in delivering the actual images, so we don’t need to use precious Rails workers on delivering images.

Con

  • Some of our materials are still being worked on by staff, and haven’t actually been published yet. But they are still in S3 with a public-read ACL. They have hard to guess URLs that shouldn’t be referred to on any publically viewable web page — but we know that shouldn’t be relied upon for anything truly confidential.
    • That has been an acceptable design so far, as none of these materials are truly confidential, even if not yet published to our site. But this is about to stop being acceptable as we include more truly confidential materials.

All protected ACL, REDIRECT to presigned URL

This is the approach taken by Rails’ ActiveStorage does in standard setup/easy path. It assumes all resources will stored to S3 without public ACL; a random user can’t access via S3 without a time-limited presigned URL being supplied by the app.

ActiveStorage’s standard implementation will give you a URL to your Rails app itself when you ask for a URL for an S3-stored asset — a rails URL is what might be in your <img src> urls. That Rails URL will redirect to a unique temporary S3 presigned URL that allows access to the non-public resource.

Pro

  • This pattern allows your app to decide based on current request/logged-in-user and asset, whether to grant acccess, on a case by case basis. (Although it’s not clear to me where the hooks are in ActiveStorage for this; I don’t actually use ActiveStorage, and it’s easy enough to implement this pattern generally, with authorization logic).
  • S3 is still delivering assets directly to your users, so scaling issues are still between S3 and the requestor, and your app doesn’t have to get involved.
  • The URLs that show up in your delivered HTML pages, say as <img src> or <a href> URLs — are pointing your app, and are still persistent and indefinitely valid — so the HTML is still indefinitely cacheable by any HTTP cache. The will redirect to a unique-per-user and temporary presigned URL, but that’s not what’s in the HTML source.
    • You can even more your images around (to different buckets/locations or entirely different services) without invalidating the cache of the HTML. the URLs in your cached HTML don’t change, where they redirect to do. (This may be ActiveStorage’s motivation for this design?)

Cons

  • Might this interfere with Google Images indexing? While it’s hard (for me) to predict what might effect Google Images indexing, my own current site’s experience seems to say its actually fine. Google is willing to index an image “at” a URL that actually HTTP 302 redirects to a presigned S3 URL. Even though on every access the redirect will be to a different URL, Google doesn’t seem to think this is fishy. Seems to be fine.
  • Makes figuring out how to put a CDN in the mix more of a chore, you can’t just put it in front of your S3, as you only want to CDN/cache public URLs, but may need to use more sophisticated CDN features or setup or choices.
  • The asset responses themselves, at presigned URLs, are not cacheable by an HTTP cache, either browser caching or intermediate. (Or at least not for more than a week, the maximum expiry of presigned urls).
  • This is the big one. Let’s say you have 40 <img src> thumbnails on a page, and use this method. Every browser page load will result in an additional 40 requests to your app. This potentially requires you to scale your app much larger to handle the same amount of actual page requests, because your actual page requests are now (eg) 40x.
    • This has been reported as an actual problem by Rails ActiveStorage users. An app can suddenly handle far less traffic because it’s spending so much time doing all these redirects.
    • Therefore, ActiveStorage users/developers then tried to figure out how to get ActiveStorage to instead use the “All public-read ACLs, Public URLs delivered directly” model we listed above. It is now possible to do that with ActiveStorage (some answers in that StackOverflow), which is great, because it’s a great model when all your assets can be publicly available… but that was already easy enough to do without AS, we’re here cause that’s not my situation and I need something else!.
    • On another platform that isn’t Rails, the performance concerns might be less, but Rails can be, well, slow here. In my app, a response that does nothing but redirect to https://example.com can still take 100ms to return! I think an out-of-the-box Rails app would be a bit faster, I’m not sure what is making mine so slow. But even at 50ms, an extra (eg) 40x50ms == 2000ms of worker time for every page delivery is a price to pay.
    • In my app where many pages may actually have not 40 but 600+ thumbs on them… this is can be really bad. Even if JS lazy-loading is used, it just seems like asking for trouble.

All protected ACL, PROXY to presigned URL

Okay, just like above, but the app action, instead of redireting to S3…. actually reads the bytes from s3 on the back-end, and delivers them to to the user-agent directly, as a sort of proxy.

The pros/cons are pretty similar to redirect solution, but mostly with a lot of extra cons….

Extra Pro

  • I guess it’s an extra pro that the fact it’s on S3 is completely invisible to the user-agent, so it can’t possibly mess up Google Images indexing or anything like that.

Extra Cons

  • If you were worried about the scaling implications of tying up extra app workers with the redirect solution, this is so much worse, as app workers are now tied up for as long as it takes to proxy all those bytes from S3 (hopefully the nginx or passenger you have in front of your web app means you aren’t worried about slow clients, but that byte shuffling from S3 will still add up).
  • For very large assets, such as I have, this is likely incompatible with a heroku deploy, because of heroku’s 30s request timeout.

One reason I mention this option, is I believe it is basically what a hyrax app (some shared code used in our business domain) does. Hyrax isn’t necessarily using S3, but I believe does have the Rails app involved in proxying and delivering bytes for all files (including derivatives), including for <img src>. So that approach is working for them well enough, so maybe shouldn’t be totally dismissed. But it doesn’t seem right to me — I really liked the much better scaling curve of our app when we moved it away from sufia (a hyrax precedessor), and got it to stop proxying bytes like this. Plus I think this is probably a barrier to deploying hyrax apps to heroku, and we are interested in investigating heroku with our app.

All protected ACL, have nginx proxy to presigned URL?

OK, like the above “proxy” solution, but with a twist. A Rails app is not the right technology for proxying lots of bytes.

But nginx is, that’s honestly it’s core use case, it’s literally built for a proxy use case, right? It should be able to handle lots of em concurrently with reasonable CPU/memory resources. If we can get nginx doing the proxying, we don’t need to worry about tying up Rails workers doing it.

I got really excited about this for a second… but it’s kind of a confusing mess. What URLs are we actually delivering in <img src> in HTML source? If they are Rails app URLs, that will then trigger an nginx proxy using something like nginx x-accel but for to a remote (presigned S3) URL instead of a local file, we have all the same downsides as the REDIRECT option above, without any real additional benefit (unless you REALLY want to hide that it’s from S3).

If instead we want to embed URLs in the HTML source that will end up being handled directly by nginx without touching the Rails app… it’s just really confusing to figure out how to set nginx up to proxy non-public content from S3. nginx has to be creating signed requests… but we also want to access-control it somehow, it should only be creating these when the app has given it permission on a per-request basis… there are a variety of of nginx third party modules that look like maybe could be useful to put this together, some more maintained/documented than others… and it just gets really confusing.

PLUS if you want to depoy to heroku (which we are considering), this nginx still couldn’t be running on heroku, cause of that 30s limit, it would have to be running on your own non-heroku host somewhere.

I think if I were a larger commercial company with a product involving lots and lots of user-submitted images that I needed to access control and wanted to store on S3…. I might do some more investigation down this path. But for my use case… I think this is just too complicated for us to maintain, if it can be made to work at all.

All Protected ACL, put presigned URLs in HTML source

Protect all your S3 assets with non-public ACLs, so they can only be accessed after your app decides the requester has privileges to see it, via a presigned URL. But instead of using a redirect or proxy, just generate presigend URLs and use them directly in <img src> for thumbs or or <a href> for downloads etc.

Pro

  • We can control access at the app level
  • No extra requests for redirects or proxies, we aren’t requiring our app to have a lot more resources to handle an additional request per image thumb loaded.
  • Simple.

Con

  • HTML source now includes limited-time-expiring URLs in <img src> etc, so can’t be cached indefinitely, even for public pages. (Although can be cached for up to a week, the maximum expiry of S3 presigned URLs, which might be good enough).
  • Presigned S3 URLs are really expensive to generate. It’s actually infeasible to include hundreds of them on a page, can take almost 1ms per URL generated. This can be optimized somewhat with custom code, but still really expensive. This is the main blocker here I think, for what otherwise might be “simplest thing that will work”.

Different S3 ACLs for different resources

OK, so the “public bucket” approach I am using now will work fine for most of my assets. It is a minority that actually need to be access controlled.

While “access them all with presigned URLs so the app is the one deciding if a given request gets access” has a certain software engineering consistency appeal — the performance and convennience advantages of public_read S3 ACL are maybe too great to give up when 90%+ of my assets work fine with it.

Really, this whole long post is probably to convince myself that this needs to be done, because it seems like such a complicated mess… but it is, I think the lesser evil.

What makes this hard is that the management interface needs to let a manager CHANGE the public-readability status of an asset. And each of my assets might have 12 derivatives, so that’s 13 files to change, which can’t be done instantaneously if you wait for S3 to confirm, which probably means a background job. And you open yourself up to making a mistake and having a resource in the wrong state.

It might make sense to have an architecture that minimizes the number of times state has to be changed. All of our assets start out in a non-published draft state, then are later published; but for most of our resources destined for publication, it’s okay if they have public_read ACL in ‘draft’ state. Maybe there’s another flag for whether to really protect/restrict access securely, that can be set on ingest/creation only for the minority of assets that need it? So only needs to be changed if am mistake were made, or decision changed?

Changing “access state” on S3 could be done by one of two methods. You could have everything in the same bucket, and actually change the S3 ACL. Or you could have two separate buckets, one for public files and one for securely protected files. Then, changing the ‘state’ requires a move (copy then delete) of the file from one bucket to another. While the copy approach seems more painful, it has a lot of advantages: you can easily see if an object has the ‘right’ permissions by just seeing what bucket it is in (while using S3’s “block public access” features on the non-public bucket), making it easier to audit manually or automatically; and you can slap a CDN on top of the “public” bucket just as simply as ever, rather than having mixed public/nonpublic content in the same bucket.

Pro

  • The majority of our files that don’t need to be secured can still benefit from the convenience and performance advantages of public_read ACL.
  • Including can still use a straightforward CDN on top of bucket bucket, and HTTP cache-forever these files too.
  • Including no major additional load put on our app for serving the majority of assets that are public

Con

  • Additional complexity for app. It has to manage putting files in two different buckets with different ACLs, and generating URLs to the two classes differently.
  • Opportunity for bugs where an asset is in the ‘wrong’ bucket/ACL. Probably need a regular automated audit of some kind — making sure you didn’t leave behind a file in ‘public’ bucket that isn’t actually pointed to by the app is a pain to audit.
  • It is expensive to switch the access state of an asset. A book with 600 pages each with 12 derivatives, is over 7K files that need to have their ACLs changed and/or copied to another bucket if the visibility status changes.
  • If we try to minimize need to change ACL state, by leaving files destined to be public with public_read even before publication and having separate state for “really secure on S3” — this is a more confusing mental model for staff asset managers, with more opportunity for human error. Should think carefully of how this is exposed in staff UI.
  • For protected things on S3, you still need to use one of the above methods of giving users access, if any users are to be given access after an auth check.

I don’t love this solution, but this post is a bunch of words to basically convince myself that it is the lesser evil nonetheless.

Speeding up S3 URL generation in ruby

It looks like the AWS SDK is very slow at generating S3 URLs, both public and presigned, and that you can generate around an order of magnitude faster in both cases. This can matter if you are generating hundreds of S3 URLs at once.

My app

The app I work is a “digital collections” or “digital asset management” app. It is about displaing lists of files, so it displays a LOT of thumbnails. The thumbnails are all stored in S3, and at present we generate URLs directly to S3 in src‘s on page.

Some of our pages can have 600 thumbnails. (Say, a digitized medieval manuscript with 600 pages). Also, we use srcset to offer the browser two resolutions for each images, so that’s 1200 URLs.

Is this excessive, should we not put 600 URLs on a page? Maybe, although it’s what our app does at present. But 100 thumbnails on a page does not seem excessive; imagine a 10×10 grid of postage-stamp-sized thumbs, why not? And they each could have multiple URLs in a srcset.

It turns out that S3 URL generation can be slow enough to be a bottleneck with 1200 generations in a page, or in some cases even 100. But it can be optimized.

On Benchmarking

It’s hard to do benchmarking in a reliable way. I just used Benchmark.bmbm here; it is notable that on different runs of my comparisons, I could see results differ by 10-20%. But this should be sufficient for relative comparisons and basic orders of magnitude. Exact numbers will of course differ on different hardware/platform anyway. (benchmark-ips might possibly be a way to get somewhat more reliable results, but I didn’t remember it until I was well into this. There may be other options?).

I ran benchmarks on my 2015 Macbook 2.9 GHz Dual-Core Intel Core i5.

I’m used to my MacBook being faster than our deployed app on an EC2 instance, but in this case running benchmarks on EC2 had very similar results. (Of course, EC2 instance CPU performance can be quite variable).

Public S3 URLs

A public S3 URL might look like https://bucket_name.s3.amazonaws.com/path/to/my/object.rb . Or it might have a custom domain name, possibly to a CDN. Pretty simple, right?

Using shrine, you might generate it like model.image_url(public_true). Which calls Aws::S3::Object#public_url . Other dependencies or your own code might call the AWS SDK method as well.

I had noticed in earlier profiling that generating S3 URLs seemed to be taking much longer than I expected, looking like a bottleneck for my app. We use shrine, but shrine doesn’t add much overhead here, it’s pretty much just calling out to the AWS SDK public_url or presigned_url methods.

It seems like generating these URLs should be very simple, right? Here’s a “naive” implementation based on a shrine UploadedFile argument. Obviously it would be easy to use a custom or CDN hostname in this implementation alternately.

def naive_public_url(shrine_file)
"https://#{["#{shrine_file.storage.bucket.name}.s3.amazonaws.com", *shrine_file.storage.prefix, shrine_file.id].join('/')}"
end
naive_public_url(model.image)
#=> "https://somebucket.s3.amazonaws.com/path/to/image.jpg&quot;
view raw naive_s3.rb hosted with ❤ by GitHub

Benchmark generating 1200 URLs with naive implementation vs a straight call of S3 AWS SDK public_url…

original AWS SDK public_url implementation 0.053043 0.000275 0.053318 ( 0.053782)
naive implementation 0.004730 0.000016 0.004746 ( 0.004760)
view raw gistfile1.txt hosted with ❤ by GitHub

53ms vs 5ms, it’s an order of magnitude slower indeed.

53ms is not peanuts when you are trying to keep a web response under 200ms, although it may not be terrible. But let’s see if we can figure out why it’s so slow anyway.

Examining with ruby-prof points to what we could see in the basic implementation in AWS SDK source code, no need to dig down the stack. The most expensive elements are the URI.parse and the URI-safe escaping. Are we missing anything from our naive implementation then?

Well, the URI.parse is just done to make sure we are operating only on the path portion of the URL. But I can’t figure out any way bucket.url would return anything but a hostname-only URL with an empty path anyway, all the examples in docs are such. Maybe it could somehow include a path, but I can’t figure out any way the URL being parsed would have a ? query component or # fragment, and without that it’s safe to just append things without a parse. (Even without that assumption, there will be faster ways than a parse, which is quite slow!) Also just calling bucket.url is a bit expensive, and can deal with some live arn: lookups we won’t be using.

URI Escaping, the pit of confusing alternatives

What about escaping? Escaping can be such a confusing topic with S3, with different libraries at different times handling it different/wrong, then it would be sane to just never use any characters in an S3 key that need any escaping, maybe put some validation on your setters to ensure this. And then you don’t need to take the performance hit of escaping.

But okay, maybe we really need/want escaping to ensure any valid S3 key is turned into a valid S3 URL. Can we do escaping more efficiently?

The original implementation splits the path on / and then runs each component through the SDK’s own Seahorse::Util.uri_escape(s). That method’s implementation uses CGI.escape, but then does two gsub‘s to alter the value somewhat, not being happy with CGI.escape. Those extra gsubs are more performance hit. I think we can use ERB::Util.url_encode instead of CGI.escape + gsubs to get the same behavior, which might get us some speed-up.

But we also seem to be escaping more than is necessary. For instance it will escape any ! in a key to %21, and it turns out this isn’t at all necessary, the URL resolve quite fine without escaping this. If we escape only what is needed, can we go even faster?

I think what we actually need is what URI.escape does — and since URI.escape doesn’t escape /, we don’t need to split on / first, saving us even more time. Annoyingly, URI.escape is marked obsolete/deprecated! But it’s stdlib implementation is relatively simple pure ruby, it would be easy enough to copy it into our codebase.

Even faster? The somewhat maintenance-neglected but still working at present escape_utils gem has a C implementation of some escaping routines. It’s hard when many implementations aren’t clear on exactly what they are escaping, but I think the escape_uri (note i on the end not l) is doing the same thing as URI.escape. Alas, there seems to be no escape_utils implementation that corresponds to CGI.escape or ERB::Util.url_encode.

So now we have a bunch of possibilities, depending on if we are willing to change escaping semantics and/or use our naive implementation of hostname-supplying.

Original AWS SDK public_url100%
optimized AWS SDK public_urlAvoid the URI.parse, use ERB::Util.url_encode. Should be functionally identical, same output, I think!60%
naive implementationNo escaping of S3 key for URL at all7.5%
naive + ERB::Util.url_encodeshould be functionally identical escaping to original implementation, ie over-escaping28%
naive + URI.escapewe think is sufficient escaping, can be done much faster15%
naive + EscapeUtils.escape_uriwe think is identical to URI.escape but faster C implementation11%

We have a bunch of opportunities for much faster implementations, even with existing over-escaping implementation. Here’s the file I used to benchmark.

Presigned S3 URLs

A Presigned URL is used to give access to non-public content, and/or to specify response headeres you’d like S3 to include with response, such as Content-Disposition. Presigned S3 URLs all have an expiration (max one week), and involve a cryptographic signature.

I expect most people are using the AWS SDK for these, rather than reinvent an implementation of the cryptographic signing protocol.

And we’d certainly expect these to be slower than public URLs, because of the crypto signature involved. But can they do be optimized? It looks like yes, at least about an order of magnitude again.

Benchmarking with AWS SDK presigned_url, 1200 URL generations can take around 760-900ms. Wow, that’s a lot — this is definitely enough to matter, especially in a web app response you’d like to keep under 200ms, and this is likely to be a bottleneck.

We do expect the signing to take longer than a public url, but can we do better?

Look at what the SDK is doing, re-implement a quicker path

The presigned_url method just instantiates and calls out to an Aws::S3::Presigner. First idea, what if we create a single Aws::S3::Presigner, and re-use it 1200 times, instead of instantiating it 1200 times, passing it the same args #presigned_url would? Tried that, it was only minor performance improvement.

OK, let’s look at the Aws:S3::Presigner implementation. It’s got kind of a convoluted way of getting a URL, building a Seahorse::Client::Request, and then doing something weird with it…. maybe modifying it to not actually go to the network, but just act as if it had… returning headers and a signed URL, and then we throw out the headers and just use the signed URL…. phew! Ultimately though it does the actual signing work with another object, an Aws::Sigv4:Signer.

What if we just instantiate one of these ourselves, instantiate it the same arguments the Presigner would have for our use cases, and then call presign_url on it with the same args the Presigner would have. Let’s re-use a Signer object 1200 times instead of instantiating it each time, in case that matters.

We still need to create the public_url in order to sign it. Let’s use our replacement naive implementation with URI.escape escaping.

AWS_SIG4_SIGNER = Aws::Sigv4::Signer.new(
service: 's3',
region: AWS_CLIENT.config.region,
credentials_provider: SOME_AWS_CLIENT.config.credentials,
unsigned_headers: Aws::S3::Presigner::BLACKLISTED_HEADERS,
uri_escape_path: false
)
def naive_with_uri_escape_escaping(shrine_file)
# because URI.escape does NOT escape `/`, we don't need to split it,
# which is what actually saves us the time.
path = URI.escape(shrine_file.id)
"https://#{["#{shrine_file.storage.bucket.name}.s3.amazonaws.com", *shrine_file.storage.prefix, shrine_file.id].join('/')}"
end
# not yet handling custom query params eg for content-disposition
def direct_aws_sig4_signer(url)
AWS_SIG4_SIGNER.presign_url(
http_method: "GET",
url: url,
headers: {},
body_digest: 'UNSIGNED-PAYLOAD',
expires_in: 900, # seconds
time: nil
).to_s
end
direct_aws_sig4_signer( naive_with_uri_escape_escaping( shrine_uploaded_file ) )
# => presigned S3 url

Yes, it’s much faster!

Bingo! Now I measure 1200 URLs in 170-220ms, around 25% of the time. Still too slow to want to do 1200 of them on a single page, and around 4x slower than SDK public_url.

Interestingly, while we expect the cryptographic signature to take some extra time… that seems to be at most 10% of the overhead that the logic to sign a URL was adding? We experimented with re-using an Aws::Sigv4::Signer vs instantiating one each time; and applying URI-escaping or not. These did make noticeable differences, but not astounding ones.

This optimized version would have to be enhanced to be able to handle additional query param options such as specified content-disposition, I optimistically hope that can be done without changing the performance characteristics much.

Could it be optimized even more, by profiling within the Aws::Sigv4::Signer implementation? Maybe, but it doesn’t really seem worth it — we are already introducing some fragility into our code by using lower-level APIs and hoping they will remain valid even if AWS changes some things in the future. I don’t really want to re-implement Aws::Sigv4::Signer, just glad to have it available as a tool I can use like this already.

The Numbers

The script I used to compare performance in different ways of creating presigned S3 URLs (with a couple public URLs for comparison) is available in a gist, and here is the output of one run:

user system total real
sdk public_url 0.054114 0.000335 0.054449 ( 0.054802)
naive S3 public url 0.004575 0.000009 0.004584 ( 0.004582)
naive S3 public url with URI.escape 0.009892 0.000090 0.009982 ( 0.011209)
sdk presigned_url 0.756642 0.005855 0.762497 ( 0.789622)
re-use instantiated SDK Presigner 0.817595 0.005955 0.823550 ( 0.859270)
use inline instantiated Aws::Sigv4::Signer directly for presigned url (with escaping) 0.216338 0.001941 0.218279 ( 0.226991)
Re-use Aws::Sigv4::Signer for presigned url (with escaping) 0.185855 0.001124 0.186979 ( 0.188798)
Re-use Aws::Sigv4::Signer for presigned url (without escaping) 0.178457 0.001049 0.179506 ( 0.180920)
view raw gistfile1.txt hosted with ❤ by GitHub

So what to do?

Possibly there are optimizations that would make sense in the AWS SDK gem itself? But it would actually take a lot more work to be sure what can be done without breaking some use cases.

I think there is no need to use URI.parse in public_url, the URIs can just be treated as strings and concatenated. But is there an edge case I’m missing?

Using different URI escaping method definitely helps in public_url; but how many other people who aren’t me care about optimizing public_url; and what escaping method is actually required/expected, is changing it a backwards compat problem; and is it okay maintenance-wise to make the S3 object use a different escaping mechanism than the common SDK Seahorse::Util.uri_escape workhorse, which might be used in places with different escaping requirements?

For presigned_urls, cutting out a lot of the wrapper code and using a Aws::Sigv4::Signer directly seems to have significant performance benefits, but what edge cases get broken there, and do they matter, and can a regression be avoided through alternate performant maintainable code?

Figuring this all out would take a lot more research (and figuring out how to use the test suite for the ruby SDK more facilely than I can write now; it’s a test suite for the whole SDK, and it’s a bear to run the whole thing).

Although if any Amazon maintainers of the ruby SDK, or other experts in it’s internals, see this and have an opinion, I am curious as to their thoughts.

But I am a lot more confident that some of these optimizations will work fine for my use cases. One of the benefits of using shrine is that all of my code already accesses S3 URL generation via shrine API. So I could easily swap in a locally optimized version, either with a shrine plugin, or just a local sub-class of the shrine S3 storage class. So I may consider doing that.

A custom local OHMS front-end

Here at the Science History Institute, we’ve written a custom OHMS viewer front-end, to integrate seamlessly with our local custom “content management system” (a Rails-based digital repository app with source available), and provide some local functionality like the ability to download certain artifacts related to the oral history.

We spent quite a bit of energy on the User Interface/User Experience (UI/UX) and think we have something that ended up somewhat novel or innovative in an interesting way.

The Center for Oral History here has been conducting interviews with scientists since 1979, and publishing them in print. We only have a handful of oral histories with synchronized OHMS content yet, although more are available in our digital collections as audio without synchronized transcripts.

We aren’t yet advertising the new features to the public at large, waiting until we have a bit more content — and we might continue to tweak this initial draft of the software.

But we wanted to share this initial soft release with you, our colleagues in the library/archives/museum sector, because we’re pretty happy with how it came out, wanted to show off a bit, and thought it might be a useful model to others.

Here are some examples you can look at, that have OHMS content and use our custom OHMS viewer front-end:

Screen Shot 2020-04-21 at 10.18.04 AM

Consider checking out our interface on a smartphone too, it works pretty nicely there too. (We try to do every single feature and page such that it works nicely on small and touch screens; this one was a bit of challenge).

Also, before doing this work, we tried to find as many examples as we could of different UI’s for this kind of functionality, from other custom OHMS viewers, or similar software like Aviary. We didn’t find that many, but if you want to see the few we did find to compare and contrast as well, you can see them in our internal wiki: https://chemheritage.atlassian.net/wiki/x/AQCrKQ

Compare ours to the ‘out of the box’ OHMS viewer, for instance here:

Screen Shot 2020-04-21 at 10.21.52 AM

Note on OHMS architecture

The way the OHMS software works standardly, is there is a centrally-hosted (“cloud”) OHMS application for metadata editors, in which you enter and prepare the metadata.

This centrally cloud-hosted app (which does not have public source code) produces an XML file for each Oral History.

The OHMS project then also gives you an open source PHP app that an institution is responsible for running themselves (although there is at least one third-party vendorthat will host it for you), which you give your XML file(s) too. So the actual end-users are not using the centrally located OHMS server; this means the OHMS organization doesn’t have to worry about providing a web app that can scale to every institution’s end-users, and it also means the OHMS central server could completely disappear, and institution’s end-user facing OHMS content would be unaffected.  So it’s a pretty reasonable architecture for the organizational practicalities.

So that means we simply “replace” the “standard” PHP OHMS viewer with our own implementation. Which is integrated into our Rails app on the back-end, with some custom Javascript on the front-end, mainly for the ‘search’ functionality. The OHMS architecture makes it ‘natural’ to use a custom or alternate front-end like this, although it requires some reverse-engineering of the OHMS XML format (and embedded markup conventions) which isn’t always super well-documented. We’ll see how “forwards compatible” it ends up moving forwards, but so far I think OHMS has really prioritized not breaking backwards compatibility with the existing large base of PHP viewers installed.

It allows us to do some work to push the UI/UX forward. But perhaps more importantly, and our main motivation, it allows us to integrate much better into our existing web app, instead of using the iframe approach that is standard with the default OHMS viewer. Easier to get consistent styling and functionality with the rest of our app, as well as naturally include features relevant to our app and use cases but not in the standard OHMS viewer, like artifact downloads. And it also allows us to avoid having to run a PHP application in our Rails shop.

Encrypting patron data (in Rails): why and how

Special guest post by Eddie Rubeiz

I’m Eddie Rubeiz. Along with the owner of this blog, Jonathan Rochkind, and our system administrator, Dan, I work on the Science History Institute’s digital collections website, where you will find, among other marvels, this picture of the inventor of Styrofoam posing with a Santa “sculpture”, which predates the invention of the term “Styrofoam”:

Ray McIntire posed with Styrofoam Santa Claus
Ray McIntire posed with Styrofoam Santa Claus

Our work, unlike the development of polystyrene, is not shrouded in secret. That is as it should be: we are a nonprofit, and the files we store are all mostly in the public domain. Our goal is to remove as many barriers to access as we can, to make our public collection as public as it can be. Most of our materials are open to the public and don’t require us to collect much personal information. So what use could we have for encryption?

Sensitive Data

Well, once in a while, a patron will approach our staff asking that a particular physical item in our collections be photographed. The patron is often a researcher who’s already working with our physical materials. In some of those cases, we determine the item — a rare book, or a scientific instrument, for instance — is also a good fit with the rest of our digital collections, and we add it in to our queue so it can be ingested and made available not just to the researcher, but to the general public.

In many cases, by the time we determined an item was a good fit, we had already done much of the work of cataloging it. The resulting pile of metadata, stored in a Google spreadsheet, then had to be copied and pasted from our request spreadsheet to our digitization queue. To save time over the long run, we decided last December to keep track of these requests inside our Rails-based digital collections web app, thus allowing us to keep track of the entire pipeline in the same place, from the moment a patron asked us to photograph an item all the way until the point it is presented, fully described and indexed, to the public.

Accepting patrons’ names and addresses into our database is problematic. As librarians, we’re inclined to encrypt this information; as software developers, we’re wary of the added complexity of encryption, and all the ways we might get it wrong. On the one hand, you don’t want private information to be seen by to an attacker. On the other hand, you don’t want to throw out the only copy of your encryption key, out of an excess of caution, and find yourself locked out of your own vault. Encryption tends to be difficult to understand, explain, install, and maintain.

Possible Security Solutions

This post on Securing Sensitive Data in Rails offers a pretty good overview of data security options in ruby/rails context, and was very helpful in getting us started thinking about it.

Here are the solutions we considered:

0) Don’t store the names or emails at all. Instead, we could use arbitrary IDs to allow everyone involved to keep track of the request. (Think of those pager buzzers some restaurants hand out, which buzz when your table is ready. They allow the restaurant greeters to avoid keeping track of your name and number in much the same way.) The person who handled the initial conversation with the patron, not our database, would thus be in charge of keeping track of which ID goes with which patron.

1) Disk-level encryption: simply encrypt the drives the database is stored on. If those drives are stolen, an attacker needs the encryption key to decipher anything on the drives — not just the database. Backup copies of the database stored in other unsecured locations remain vulnerable.

2) Database-level encryption: the database encrypts and decrypts data using a key that is sent (with every query) by the database adapter on the webserver. (See e.g. PGCrypto for ActiveRecord). See also postgres documentation on encryption options. One challenge with this approach, since encryption key is sent with many db queries, is keeping it out of any logs.

3) Encrypt just the names and emails — per-column encryption — at the application logic level. When the app pulls them out, they are encrypted. The app is in charge of decrypting them as it reads them, and re-encrypting them before writing them to the database. If an attacker gets hold of the database, they get all of our collection info (which is public anyway), but also two columns of encrypted gobbledygook. To read these columns, the attacker would need the key. In the simplest case, they could obtain this by breaking into one of our web/application servers (on a different machine). But at least our DB backups alone are secure and don’t need to be treated as if they had confidential info.

Our solution: per-column encryption with the lockbox gem

We weighed our options: 0) and 1) were too bureaucratic and not particularly secure either. The relative merits of 2) and 3) are debated at length in this post and others like it. We eventually settled on 3) as the path that affords us the best security given that our web server and DB are on separate servers.

Within 3), and given that our site is a Ruby on Rails site, we gave two tools a test drive: attr_encrypted and lockbox. That post I mentioned before Securing Sensitive Data in Rails was by lockbox’ author, arkane, which raised our confidence that the lockbox author had the background to implement encryption correctly. After tinkering with each, it appeared that both lockbox and attr_encrypted worked as advertised, but Lockbox seemed better designed, coming with fewer initial settings for us to agonize over, but offering a variety of ways to customize it later on should we be unsatistifed with the defaults. Furthermore:

  • lockbox works with blind indexing, whereas in attr_encrypted searches and joins on the encrypted data are not available. We do not currently need to search on the columns, and these requests are fairly infrequent (perhaps a hundred in any given year, with only a few active at a time.) But it’s good to know we won’t have to switch encryption libraries in the future if we did need that functionality.
  • lockbox offers better support for key management services such as Vault, AWS KMS, and Google Cloud KMS, we consider the logical next step in securing data. For now we’re just leaving keys on the disk of servers that need them but may take this next step eventually — if we were storing birth dates or social security numbers, we would probably up the priority of this.
  • attr_encrypted has not been updated for over a year, whereas lockbox is under active development.

We had a proof of concept up and running on our development server within an afternoon, and it only took a few of days to get things working in production, with some basic tests.

An important part of deciding to use lockbox was figuring out what to do if someone did gain access to our encryption key. The existing documentation for Lockbox key rotation was a bit sparse, but this was quickly remedied by the Andrew Kane, the developer of Lockbox, once we reached out to him. The key realization (pardon the pun) was that Lockbox uses both a master key and a series of secondary keys for each encrypted column. The secondary keys are the product of a recipe that includes the master key and the names of the tables and columns to be encrypted.

If someone gets access to your key, you currently need to:

  • figure out what all your secondary keys are
  • use them to decrypt all your stuff
  • generate a new master key
  • re-encrypt everything using your new keys
  • burn all the old keys.

However, Andrew, within hours of us reaching out via a Github Issue, added some code to Lockbox that drastically simplifies this process; this will be available in the next release.

It’s worth noting in retrospect how many choices were available to us in the decision, and thus how much research was thus needed to narrow them down. The time consuming part was figuring out what to do, but once we had made up our mind, the actual work of implementing our chosen solution took only a few hours of work, some of which involved being confused about some of the lockbox documentation which has since been improved. Lockbox is a great piece of software, and our pull request to implement it in our app is notably concise.

If you have been thinking you maybe should be treating patron data more securely in your Rails app, but thought you didn’t have time to deal with it, we recommend considering lockbox. It may be easier and quicker than you think!

Another byproduct of our investigations was a heightened awareness of technological security in the rest of our organization, which is of course a never-ending project. Where else might this same data be stored that is even less secure than our Rails app? In an nonprofit with over a hundred employees, there are always some data stores that are guarded more securely than others, and focusing so carefully on a particular tool naturally leads one to notice other areas where we will want to do more. One day at a time!

Intentionally considering fixity checking

In our digital collections app rewrite at Science History Institute, we took a moment to step back and  be intentional about how we approach “fixity checking” features and UI, to make sure it’s well-supporting the needs it’s meant to.  I think we do a good job of providing UI to let repository managers and technical staff get a handle on a reliable fixity checking service, that others may be interested in seeing as an example to consider. Much of our code was implemented by my colleague Eddie Rubeiz.

What is “fixity checking”?

In the field of digital preservation, “fixity” and “fixity checking” basically just means:

  • Having a calculated checksum/digest value for a file
  • Periodically recalculating that value and making sure it matches the recorded expected value, to make sure there has been no file corruption.

See more at the Digital Preservation Coalition’s Digital Preservation Handbook.

Do we really need fixity checking?

I have been part of some conversations with peers wondering if we really need to be doing all this fixity checking. Modern file/storage systems are pretty good at preventing byte corruption, whether on-premises or cloud PaaS, many have their own low-level “fixity checking” with built-in recovery happening anyway. And it can get kind of expensive doing all that fixity checking, whether in cloud platform fees or local hardware, or just time spent on the systems.  Reports of actual fixity check failures (that are not false positives) happening in production are rare to possibly nonexistent.

However, I think everyone I’ve heard questioning is still doing it. We’re not sure we don’t need it,  industry/field best practices still mostly suggest doing it, we’re a conservative/cautious bunch.

Myself, I was skeptical of whether we needed to do fixity checking — but when we did our data migration to a new system, it was super helpful to at least have the feature available to be able to help ensure all data was migrated properly. Now I think it’s probably worthwhile to have the feature in a digital preservation system; I think it’s probably good enough to “fixity check” files way less often than many of us do though, maybe as infrequently as once a year?

But, if we’re gonna do fixity checking, we might as well do it right, and reliably.

Pitfalls of Fixity Check Features, and Requirements

Fixity checks are something you need for reliablity, but might rarely use or even look at — and that means it’s easy to have them not working and have nobody notice. It’s a “requirements checklist” thing, institutions want to be able to say the app supports it, but some may not actually prioritize spending much time to make sure it’s working, or the exposed UI is good enough to accomplish the purpose of it.

And in fact, when we were implementing the first version of our app on sufia (the predecessor to hyrax) — we realized that the UI in sufia for reporting fixity check results on a given file object seemed to be broken, and we weren’t totally sure it was recording/keeping the results of it’s checks. (If a fixity check fails in a forest, and…) This may have been affecting other institutions who hadn’t noticed either, not sure. It’s sort of like thinking you have backups, but never testing them, it’s a pitfall of “just in case” reliability features. (I did spend a chunk of time understanding what was going on and submitting code to hyrax fix it up a bit.).

If you have an app that does regular fixity checking, it’s worth considering: Are you sure it’s happening, instead of failing to run (properly or at all) due to an error? How would you check that? Do you have the data and/or UX you need to be confident fixity checking is working as intended, in the absence of any fixity failures?

A fixity check system might send a “push” alert in case of a fixity check failure — but that will usually be rare to nonexistent.  We decided that in addition to being able to look at current fixity check status on an individual File/Asset — we need some kind of “Fixity Health Summary” dashboard, that tells you how many fixity checks have been done, which Files (if any) lack fixity checks, if any haven’t gotten a fixity check in longer than expected, total count of any failing fixity check, etc.

This still relies on someone to look at it, but at least there is some way in the UI to answer the question “Are fixity checks happening as expected”.

Fixity Check record data model

Basically following the lead set by sufia/hyrax, we keep a history of multiple past fixity checks.

In our new app, which uses ordinary ActiveRecord postgres rdbms, it’s just a one-to-many association between Asset (our file model class) and a FixityCheck model. 

Having many instead of one fixity status on record did end up significantly complicating the code compared to keeping only the latest fixity check result. Because you often want to do SQL queries based on the date and/or status of the latest fixity check result, and needing to get this from “the record from the set of associated FixityChecks with the latest date” can be kind of tricky to do in SQL, especially when fetching or reporting over many/all of your Assets.

Still, it might be a good idea/requirement? I’m not really sure, or sure what I’d do if I had it to do over, but it ended up this way in our implementation.

Still, we don’t  want to keep every past fixity check on record — it would eventually fill up our database if we’re doing regular fixity checks. So what do we want to keep?  If a record keeps passing fixity every day, there’s no new info from keeping em all, we decided to mostly just keep fixity checks which established windows on status changes. (I think Hyrax does similar at present).

  • The first fixity check
  • The N most recent fixity checks (where N may be 1)
  • Any failed checks.
  • The check right before or right after any failed check, to establish the maximum window that the item may have been failing fixity, as a sort of digital provenance context. (The idea is that maybe something failed, and then you restored it from backup, and then it passed again).

We have some code that looks through all fixity checks for a given work, and deletes any checks not spec’d as keepable as above.Which we normally call after recording any additional fixity check.

Our “FixityCheck” database table includes a bunch of description of exactly what happened: the date of the fixity check, status (success or failure), expected and actual digest values, the location of the file checked (ie S3 bucket and path), as well as of course the foreign key to the Asset “model” object that the file corresponds to.

We also store the digest algorithm used. We use SHA512, due to general/growing understanding that MD5 and SHA1 are outdated and should not be used, and SHA512 is a good one. But want to record this in the database for record-keeping purposes, and to accomodate any future changes to digest algorithm, which may require historical data points using different algorithms to coexist in the database.

The Check: Use shrine API to calculate via streaming bytes from cloud storage

The process of doing a fixity check is pretty straightforward — you just have to compute a checksum!

Because we’re going to be doing this a lot, on some fairly big files (generally we store ~100MB TIFFs, but we have some even larger ones) we want the code that does the check to be as efficient as possible.

Our files are stored in S3, and we thought doing it as efficiently as possible means calculating the SHA512 from a stream of bytes being read from S3, without ever storing them to disk. Reading/writing from disk is actually a pretty slow thing for a process to do, and also risks clogging up disk IO pipelines if lots of processes are doing it at once. And by streaming, calculating iteratively based on the bytes as we fetch them them over the network (which the SHA512 algorithm and most other modern digesting algorithms are capable of), we can get a computation faster.

We are careful to use the proper shrine API to get a stream from our remote storage, avoid shrine caching read bytes to disk, and pass it to the proper ruby OpenSSL::Digest API to calculate the SHA512 from streamed bytes.  Here is our implementation. (Shrine 3.0 may make this easier).

Calculate for 1/Nth of all assets per night

If our goal is to fixity check every file once every 7 days, then we want to spread that out by checking 1/7th of our assets every night. In fact we wanted to parameterize that to N, although N==7 for us at present, we want the freedom to make it a lot higher without a code rewrite.  To keep it less confusing, I’ll keep writing as if N is 7.

At first, we considered just taking an arbitrary 1/7th of all Assets, just take the Asset PK, turn it into an integer with random distribution (say MD5 it, I dunno, whatever), and modulo 7.

But we decided that instead taking the 1/7th of Assets that have been least recently checked (or never checked; sort nulls first) has some nice properties. You always check the things most in need of being checked, including recently created assets without yet a check. If some error keeps some thing from being checked or having a check recorded, it’ll still be first in line for the next nightly check.

A little bit tricky to find that list of things to check in SQL cause of our data model, but a little “group by” will do it, here’s our code. We use ActiveRecord find_each to make sure we’re being efficient with memory use when iterating through thousands+ of records.

And some batching in postgres transactions writing results to try to speed things up yet further (not actually sure how well that works). Here’s our rake task for doing nightly fixity checking.— which can show a nice progress bar when run interactively. We think it’s important to have good “developer UI” for all this stuff, if you actually want it to be used regularly — the more frustrating it is to use, the less it will get used, developers are users too!

It ends up taking somewhere around 1-2s per File to check fixity and record check for our files which are typically 100MB or so each. The time it takes to fixity check a file mainly scales with the size of the file. We think mainly simply waiting on streaming the bytes from S3 to calculate a digest (even more than the CPU time of actually calculating the digest).  So it should be pretty parallelizable, although we haven’t really tried parallelizing it, cause this is fast enough for us at our scale. (We have around ~25K Files, 1.5TB of total original content).

Notification UI

If a fixity check fails, we want a “push” notification, to actually contact someone and tell them it failed. Currently we do that with both an email and register an error to the Honeybadger error reporting service we already used. (since we already have honeybadger errors being reported to a slack channel with a honeybadger integration, this means it goes to our Slack too).

Admin UI for individual asset fixity status

In the administration page for an individual Asset, you want to be able to confirm the fixity check status, and when the last time a check happened was. Also, you might want to see when the earliest fixity check on record is, and look at the complete recorded history of fixity checks (what’s the point of keeping them around if you aren’t going to show them in any admin UI?)

Screenshot 2019-12-16 15.48.34.png

That “Fixity check history” link is a little expand/contract collapsible control, the history underneath it does not start out expanded. Note it also confirms the digest algorithm used (sha512), and what the actual recorded digest checksum at that timestamp was.

As you can see we also give a “Schedule a check now” button — this actually queues up a fixity check as a background ActiveJob, it’s usually complete within 10 or 20 seconds. This “schedule now” button is useful if you have any concerns, or are trying to diagnose or debug something.

If there’s a failure, you might need a bit more information:

Screenshot 2019-12-16 15.55.30.png

The actual as well as expected digest value; the postgres PK for the table recording this logged info, for a developer to really get into it; and a reverse engineered AWS S3 Console URL that will (after you login to your AWS account with privs) take you to the S3 console view of the key, so you can investigate the file directly from S3, download it, whatever.

(Yeah, all our files are in S3).

Fixity Health Dashboard Admin UI

As discussed, we decided it’s important not just to be able to see fixity check info for a specified known item, but to get a sense of general “fixity health”.

So we provide a dashboard that most importantly will tell you:

  • If any assets have a currently failing fixity check
  • If any assets haven’t been fixity checked in longer than expected (for us at present, last 7 days).
    • But there may be db records for Assets that are still in process of ingesting; these aren’t expected to have fixity checks (although if they are days old and still not fully ingested, it may indicate a problem in our ingest pipeline!)
    • And if an Asset was ingested only in the past 24 hours, maybe it just hasn;t gotten it’s first nightly check, so that’s also okay.

It gives some large red or green thumbs-up or thumbs-down icons based on these values, so a repository/collections manager that may not look at this page very often or be familiar with the details of what everything means can immediately know if fixity check health is good or bad.

screencapture-digital-sciencehistory-org-admin-fixity-report-2019-12-16-16_02_51

In addition to the big green/red health summary info at the top, there’s some additional “Asset and Fixity Descriptive Statistics” that will help an administrator, especially a more technical staff member, get more of a sense of what’s going on with our assets and their fixity checks in general/summary, perhaps especially useful for diagnosing a ‘red’ condition.

Here’s another example from a very unhealthy development instance. You can see the list of assets failing fixity check is hyperlinked, so you can go to the administrative page for that asset to get more info, as above.

screencapture-localhost-3000-admin-fixity-report-2019-12-16-16_08_00.png

The nature of our schema (a one-to-many asset-to-fixity-checks instead of a single fixity check on record) makes it a bit tricky to write the SQL for this, involving GROUP BYs and inner subqueries and such.

The SQL is also a bit expensive, despite trying to index what can be indexed — I think whole-table-aggregate-statistics are just inherently expensive (at least in postgres) — our fixity health summary report page can take ~2 seconds to return in production, which is not terrible by some standards, but not great — and we have much smaller corpus than some, it will presumably scale slower linearly with number of Assets.  One approach to dealing with that I can think of is caching (possibly with calculation in a bg job), but it’s not bad enough for us to require that attention at present.

Current mysteries/bugs

So, we’re pretty happy with this feature set — although fixity check features are something we don’t actually use or look at that much, so I’m not sure what being happy with it means — but if we’re going to do fixity checking, we might as well do our best to make it reliable and give collection/repository managers the info they need to know if it’s being done reliably and we think we’ve done pretty good, and better than a lot of things we’ve seen.

There are a couple outstanding mysteries in our code.

  1. While we thought we wrote things and set things up to fixity check 1/7th of the collection every night… it seems to be checking 100% of the collection every night instead. Haven’t spent the time to get to the bottom of that and find the bug.
  2. For a while, we were getting fixity check failures that seemed to be false positive failures. After a failure, if we went to the Asset detail page for the failed asset and clicked “schedule fixity check now” — it would pass. (This is one reason that “fixity check now” button is a useful feature!).  Not sure if there’s a race condition or some other kind of bug in our code (or shrine code) that fetches bytes. OR it also could have just been byproduct of some of our syncing/migration logic that was in operation before we went totally live with new site — I don’t believe we’ve gotten any fixity failures since we actually cut over fully to the newly launched site, so possibly we won’t ever again and won’t have to worry about it. But in the interests of full disclosure, wanted to admit it.