Rails auto-scaling on Heroku

We are investigating moving our medium-small-ish Rails app to heroku.

We looked at both the Rails Autoscale add-on available on heroku marketplace, and the hirefire.io service which is not listed on heroku marketplace and I almost didn’t realize it existed.

I guess hirefire.io doesn’t have any kind of a partnership with heroku, but still uses the heroku API to provide an autoscale service. hirefire.io ended up looking more fully-featured and lesser priced than Rails Autoscale; so the main service of this post is just trying to increase visibility of hirefire.io and therefore competition in the field, which benefits us consumers.

Background: Interest in auto-scaling Rails background jobs

At first I didn’t realize there was such a thing as “auto-scaling” on heroku, but once I did, I realized it could indeed save us lots of money.

I am more interested in scaling Rails background workers than I a web workers though — our background workers are busiest when we are doing “ingests” into our digital collections/digital asset management system, so the work is highly variable. Auto-scaling up to more when there is ingest work piling up can give us really nice inget throughput while keeping costs low.

On the other hand, our web traffic is fairly low and probably isn’t going to go up by an order of magnitude (non-profit cultural institution here). And after discovering that a “standard” dyno is just too slow, we will likely be running a performance-m or performance-l anyway — which likely can handle all anticipated traffic on it’s own. If we have an auto-scaling solution, we might configure it for web dynos, but we are especially interested in good features for background scaling.

There is a heroku built-in autoscale feature, but it only works for performance dynos, and won’t do anything for Rails background job dynos, so that was right out.

That could work for Rails bg jobs, the Rails Autoscale add-on on the heroku marketplace; and then we found hirefire.io.

Pricing: Pretty different

hirefire

As of now January 2021, hirefire.io has pretty simple and affordable pricing. $15/month/heroku application. Auto-scaling as many dynos and process types as you like.

hirefire.io by default can only check into your apps metrics to decide if a scaling event can occur once per minute. If you want more frequent than that (up to once every 15 seconds), you have to pay an additional $10/month, for $25/month/heroku application.

Even though it is not a heroku add-on, hirefire does advertise that they bill pro-rated to the second, just like heroku and heroku add-ons.

Rails autoscale

Rails autoscale has a more tiered approach to pricing that is based on number and type of dynos you are scaling. Starting at $9/month for 1-3 standard dynos, the next tier up is $39 for up to 9 standard dynos, all the way up to $279 (!) for 1 to 99 dynos. If you have performance dynos involved, from $39/month for 1-3 performance dynos, up to $599/month for up to 99 performance dynos.

For our anticipated uses… if we only scale bg dynos, I might want to scale from (low) 1 or 2 to (high) 5 or 6 standard dynos, so we’d be at $39/month. Our web dynos are likely to be performance and I wouldn’t want/need to scale more than probably 2, but that puts us into performance dyno tier, so we’re looking at $99/month.

This is of course significantly more expensive than hirefire.io’s flat rate.

Metric Resolution

Since Hirefire had an additional charge for finer than 1-minute resolution on checks for autoscaling, we’ll discuss resolution here in this section too. Rails Autoscale has same resolution for all tiers, and I think it’s generally 10 seconds, so approximately the same as hirefire if you pay the extra $10 for increased resolution.

Configuration

Let’s look at configuration screens to get a sense of feature-sets.

Rails Autoscale

web dynos

To configure web dynos, here’s what you get, with default values:

The metric Rails Autoscale uses for scaling web dynos is time in heroku routing queue, which seems right to me — when things are spending longer in heroku routing queue before getting to a dyno, it means scale up.

worker dynos

For scaling worker dynos, Rails Autoscale can scale dyno type named “worker” — it can understand ruby queuing libraries Sidekiq, Resque, Delayed Job, or Que. I’m not certain if there are options for writing custom adapter code for other backends.

Here’s what the configuration options are — sorry these aren’t the defaults, I’ve already customized them and lost track of what defaults are.

You can see that worker dynos are scaled based on the metric “number of jobs queued”, and you can tell it to only pay attention to certain queues if you want.

Hirefire

Hirefire has far more options for customization than Rails Autoscale, which can make it a bit overwhelming, but also potentially more powerful.

web dynos

You can actually configure as many Heroku process types as you have for autoscale, not just ones named “web” and “worker”. And for each, you have your choice of several metrics to be used as scaling triggers.

For web, I think Queue Time (percentile, average) matches what Rails Autoscale does, configured to percentile, 95, and is probably the best to use unless you have a reason to use another. (“Rails Autoscale tracks the 95th percentile queue time, which for most applications will hover well below the default threshold of 100ms.“)

Here’s what configuration Hirefire makes available if you are scaling on “queue time” like Rails Autoscale, configuration may vary for other metrics.

I think if you fill in the right numbers, you can configure to work equivalently to Rails Autoscale.

worker dynos

If you have more than one heroku process type for workers — say, working on different queues — Hirefire can scale the independently, with entirely separate configuration. This is pretty handy, and I don’t think Rails Autoscale offers this. (update i may be wrong, Rails Autoscale says they do support this, so check on it yourself if it matters to you).

For worker dynos, you could choose to scale based on actual “dyno load”, but I think this is probably mostly for types of processes where there isn’t the ability to look at “number of jobs”. A “number of jobs in queue” like Rails Autoscale does makes a lot more sense to me as an effective metric for scaling queue-based bg workers.

Hirefire’s metric is slightly difererent than Rails Autoscale’s “jobs in queue”. For recognized ruby queue systems (a larger list than Rails Autoscale’s; and you can write your own custom adapter for whatever you like), it actually measures jobs in queue plus workers currently busy. So queued+in-progress, rather than Rails Autoscale’s just queued. I actually have a bit of trouble wrapping my head around the implications of this, but basically, it means that Hirefire’s “jobs in queue” metric strategy is intended to try to scale all the way to emptying your queue, or reaching your max scale limit, whichever comes first. I think this may make sense and work out at least as well or perhaps better than Rails Autoscale’s approach?

Here’s what configuration Hirefire makes available for worker dynos scaling on “job queue” metric.

Since the metric isn’t the same as Rails Autosale, we can’t configure this to work identically. But there are a whole bunch of configuration options, some similar to Rails Autoscale’s.

The most important thing here is that “Ratio” configuration. It may not be obvious, but with the way the hirefire metric works, you are basically meant to configure this to equal the number of workers/threads you have on each dyno. I have it configured to 3 because my heroku worker processes use resque, with resque_pool, configured to run 3 resque workers on each dyno. If you use sidekiq, set ratio to your configured concurrency — or if you are running more than one sidekiq process, processes*concurrency. Basically how many jobs your dyno can be concurrently working is what you should normally set for ‘ratio’.

Hirefire not a heroku plugin

Hirefire isn’t actually a heroku plugin. In addition to that meaning separate invoicing, there can be some other inconveniences.

Since hirefire only can interact with heroku API, for some metrics (including the “queue time” metric that is probably optimal for web dyno scaling) you have to configure your app to log regular statistics to heroku’s “Logplex” system. This can add a lot of noise to your log, and for heroku logging add-ons that are tired based on number of log lines or bytes, can push you up to higher pricing tiers.

If you use paperclip, I think you should be able to use the log filtering feature to solve this, keep that noise out of your logs and avoid impacting data log transfer limits. However, if you ever have cause to look at heroku’s raw logs, that noise will still be there.

Support and Docs

I asked a couple questions of both Hirefire and Rails Autoscale as part of my evaluation, and got back well-informed and easy-to-understand answers quickly from both. Support for both seems to be great.

I would say the documentation is decent-but-not-exhaustive for both products. Hirefire may have slightly more complete documentation.

Other Features?

There are other things you might want to compare, various kinds of observability (bar chart or graph of dynos or observed metrics) and notification. I don’t have time to get into the details (and didn’t actually spend much time exploring them to evaluate), but they seem to offer roughly similar features.

Conclusion

Rails Autoscale is quite a bit more expensive than hirefire.io’s flat rate, once you get past Rails Autoscale’s most basic tier (scaling no more than 3 standard dynos).

It’s true that autoscaling saves you money over not, so even an expensive price could be considered a ‘cut’ of that, and possibly for many ecommerce sites even $99 a month might a drop in the bucket (!)…. but this price difference is so significant with hirefire (which has flat rate regardless of dynos), that it seems to me it would take a lot of additional features/value to justify.

And it’s not clear that Rails Autoscale has any feature advantage. In general, hirefire.io seems to have more features and flexibility.

Until 2021, hirefire.io could only analyze metrics with 1-minute resolution, so perhaps that was a “killer feature”?

Honestly I wonder if this price difference is sustained by Rails Autoscale only because most customers aren’t aware of hirefire.io, it not being listed on the heroku marketplace? Single-invoice billing is handy, but probably not worth $80+ a month. I guess hirefire’s logplex noise is a bit inconvenient?

Or is there something else I’m missing? Pricing competition is good for the consumer.

And are there any other heroku autoscale solutions, that can handle Rails bg job dynos, that I still don’t know about?

update a day after writing djcp on a reddit thread writes:

I used to be a principal engineer for the heroku add-ons program.

One issue with hirefire is they request account level oauth tokens that essentially give them ability to do anything with your apps, where Rails Autoscaling worked with us to create a partnership and integrate with our “official” add-on APIs that limits security concerns and are scoped to the application that’s being scaled.

Part of the reason for hirefire working the way it does is historical, but we’ve supported the endpoints they need to scale for “official” partners for years now.

A lot of heroku customers use hirefire so please don’t think I’m spreading FUD, but you should be aware you’re giving a third party very broad rights to do things to your apps. They probably won’t, of course, but what if there’s a compromise?

“Official” add-on providers are given limited scoped tokens to (mostly) only the actions / endpoints they need, minimizing blast radius if they do get compromised.

You can read some more discussion at that thread.

Gem authors, check your release sizes

Most gems should probably be a couple hundred kb at most. I’m talking about the package actually stored in and downloaded from rubygems by an app using the gem.

After all, source code is just text, and it doesn’t take up much space. OK, maybe some gems have a couple images in there.

But if you are looking at your gem in rubygems and realize that it’s 10MB or bigger… and that it seems to be getting bigger with every release… something is probably wrong and worth looking into it.

One way to look into it is to look at the actual gem package. If you use the handy bundler rake task to release your gem (and I recommend it), you have a ./pkg directory in your source you last released from. Inside it are “.gem” files for each release you’ve made from there, unless you’ve cleaned it up recently.

.gem files are just .tar files it turns out. That have more tar and gz files inside them etc. We can go into it, extract contents, and use the handy unix utility du -sh to see what is taking up all the space.

How I found the bytes

jrochkind-chf kithe (master ?) $ cd pkg

jrochkind-chf pkg (master ?) $ ls
kithe-2.0.0.beta1.gem        kithe-2.0.0.pre.rc1.gem
kithe-2.0.0.gem            kithe-2.0.1.gem
kithe-2.0.0.pre.beta1.gem    kithe-2.0.2.gem

jrochkind-chf pkg (master ?) $ mkdir exploded

jrochkind-chf pkg (master ?) $ cp kithe-2.0.0.gem exploded/kithe-2.0.0.tar

jrochkind-chf pkg (master ?) $ cd exploded

jrochkind-chf exploded (master ?) $ tar -xvf kithe-2.0.0.tar
 x metadata.gz
 x data.tar.gz
 x checksums.yaml.gz

jrochkind-chf exploded (master ?) $  mkdir unpacked_data_tar

jrochkind-chf exploded (master ?) $ tar -xvf data.tar.gz -C unpacked_data_tar/

jrochkind-chf exploded (master ?) $ cd unpacked_data_tar/
/Users/jrochkind/code/kithe/pkg/exploded/unpacked_data_tar

jrochkind-chf unpacked_data_tar (master ?) $ du -sh *
 4.0K    MIT-LICENSE
  12K    README.md
 4.0K    Rakefile
 160K    app
 8.0K    config
  32K    db
 100K    lib
 300M    spec

jrochkind-chf unpacked_data_tar (master ?) $ cd spec

jrochkind-chf spec (master ?) $ du -sh *
 8.0K    derivative_transformers
 300M    dummy
  12K    factories
  24K    indexing
  72K    models
 4.0K    rails_helper.rb
  44K    shrine
  12K    simple_form_enhancements
 8.0K    spec_helper.rb
 188K    test_support
 4.0K    validators

jrochkind-chf spec (master ?) $ cd dummy/

jrochkind-chf dummy (master ?) $ du -sh *
 4.0K    Rakefile
  56K    app
  24K    bin
 124K    config
 4.0K    config.ru
 8.0K    db
 300M    log
 4.0K    package.json
  12K    public
 4.0K    tmp

Doh! In this particular gem, I have a dummy rails app, and it has 300MB of logs, cause I haven’t b bothered trimming them in a while, that are winding up including in the gem release package distributed to rubygems and downloaded by all consumers! Even if they were small, I don’t want these in the released gem package at all!

That’s not good! It only turns into 12MB instead of 300MB, because log files are so compressable and there is compression involved in assembling the rubygems package. But I have no idea how much space it’s actually taking up on consuming applications machines. This is very irresponsible!

What controls what files are included in the gem package?

Your .gemspec file of course. The line s.files = is an array of every file to include in the gem package. Well, plus s.test_files is another array of more files, that aren’t supposed to be necessary to run the gem, but are to test it.

(Rubygems was set up to allow automated *testing* of gems after download, is why test files are included in the release package. I am not sure how useful this is, and who if anyone does it; although I believe that some linux distro packagers try to make use of it, for better or worse).

But nobody wants to list every file in your gem individually, manually editing the array every time you add, remove, or move one. Fortunately, gemspec files are executable ruby code, so you can use ruby as a shortcut.

I have seen two main ways of doing this, with different “gem skeleton generators” taking one of two approaches.

Sometimes a shell out to git is used — the idea is that everything you have checked into your git should be in the gem release package, no more or no less. For instance, one of my gems has this in it, not sure where it came from or who/what generated it.

spec.files = `git ls-files -z`.split("\x0").reject do |f|
 f.match(%r{^(test|spec|features)/})
end

In that case, it wouldn’t have included anything in ./spec already, so this obviously isn’t actually the gem we were looking at before.

But in this case, in addition to using ruby logic to manipulate the results, nothing excluded by your .gitignore file will end up included in your gem package, great!

In kithe we were looking at before, those log files were in the .gitignore (they weren’t in my repo!), so if I had been using that git-shellout technique, they wouldn’t have been included in the ruby release already.

But… I wasn’t. Instead this gem has a gemspec that looks like:

s.test_files = Dir["spec/*/"]

Just include every single file inside ./spec in the test_files list. Oops. Then I get all those log files!

One way to fix

I don’t really know which is to be preferred of the git-shellout approach vs the dir-glob approach. I suspect it is the subject of historical religious wars in rubydom, when there were still more people around to argue about such things. Any opinions? Or another approach?

Without being in the mood to restructure this gemspec in anyway, I just did the simplest thing to keep those log files out…

Dir["spec/*/"].delete_if {|a| a =~ %r{/dummy/log/}}

Build the package without releasing with the handy bundler supplied rake build task… and my gem release package size goes from 12MB to 64K. (which actually kind of sounds like a minimum block size or something, right?)

Phew! That’s a big difference! Sorry for anyone using previous versions and winding up downloading all that cruft! (Actually this particular gem is mostly a proof of concept at this point and I don’t think anyone else is using it).

Check your gem sizes!

I’d be willing to be there are lots of released gems with heavily bloated release packages like this. This isn’t the first one I’ve realized was my fault. Because who pays attention to gem sizes anyway? Apparently not many!

But rubygems does list them, so it’s pretty easy to see. Are your gem release packages multiple megs, when there’s no good reason for them to be? Do they get bigger every release by far more than the bytes of lines of code you think were added? At some point in gem history was there a big jump from hundreds of KB to multiple MB? When nothing particularly actually happened to gem logic to lead to that?

All hints that you might be including things you didn’t mean to include, possibly things that grow each release.

You don’t need to have a dummy rails app in your repo to accidentally do this (I accidentally did it once with a gem that had nothing to do with rails). There could be other kind of log files. Or test coverage or performance metric files, or any other artifacts of your build or your development, especially ones that grow over time — that aren’t actually meant to or needed as part of the gem release package!

It’s good to sanity check your gem release packages now and then. In most cases, your gem release package should be hundreds of KB at most, not MBs. Help keep your users’ installs and builds faster and slimmer!

Updating SolrCloud configuration in ruby

We have an app that uses Solr. We currently run a Solr in legacy “not cloud” mode. Our solr configuration directory is on disk on the Solr server, and it’s up to our processes to get our desired solr configuration there, and to update it when it changes.

We are in the process of moving to a Solr in “SolrCloud mode“, probably via the SearchStax managed Solr service. Our Solr “Cloud” might only have one node, but “SolrCloud mode” gives us access to additional API’s for managing your solr configuration, as opposed to writing it directly to disk (which may not be possible at all in SolrCloud mode? And certainly isn’t using managed SearchStax).

That is, the Solr ConfigSets API, although you might also want to use a few pieces of the Collection Management API for associating a configset with a Solr collection.

Basically, you are taking your desired solr config directory, zipping it up, and uploading it to Solr as a “config set” [or “configset”] with a certain name. Then you can create collections using this config set, or reassign which named configset an existing collection uses.

I wasn’t able to find any existing ruby gems for interacting with these Solr API’s. RSolr is a “ruby client for interacting with solr”, but was written before most of these administrative API’s existed for Solr, and doesn’t seem to have been updated to deal with them (unless I missed it), RSolr seems to be mostly/only about querying solr, and some limited indexing.

But no worries, it’s not too hard to wrap the specific API I want to use in some ruby. Which did seem far better to me than writing the specific HTTP requests each time (and making sure you are dealing with errors etc!). (And yes, I will share the code with you).

I decided I wanted an object that was bound to a particular solr collection at a particular solr instance; and was backed by a particular local directory with solr config. That worked well for my use case, and I wound up with an API that looks like this:

updater = SolrConfigsetUpdater.new(
  solr_url: "https://example.com/solr",
  conf_dir: "./solr/conf",
  collection_name: "myCollection"
)

# will zip up ./solr/conf and upload it as named MyConfigset:
updater.upload("myConfigset")

updater.list #=> ["myConfigSet"]
updater.config_name # what configset name is MyCollection currently configured to use?
# => "oldConfigSet"

# what if we try to delete the one it's using?
updater.delete("oldConfigSet")
# => raises SolrConfigsetUpdater::SolrError with message:
# "Can not delete ConfigSet as it is currently being used by collection [myConfigset]"

# okay let's change it to use the new one and delete the old one

updater.update_config_name("myConfigset")
# now MyCollection uses this new configset, although we possibly
# need to reload the collection to make that so
updater.reload
# now let's delete the one we're not using
updater.delete("oldConfigSet")

OK, great. There were some tricks in there in trying to catch the apparently multiple ways Solr can report different kinds of errors, to make sure Solr-reported errors turn into exceptions ideally with good error messages.

Now, in addition to uploading a configset initially for a collection you are creating to use, the main use case I have is wanting to UPDATE the configuration to new values in an existing collection. Sure, this often requires a reindex afterwards.

If you have the recently released Solr 8.7, it will let you overwrite an existing configset, so this can be done pretty easily.

updater.upload(updater.config_name, overwrite: true)
updater.reload

But prior to Solr 8.7 you can not overwrite an existing configset. And SearchStax doesn’t yet have Solr 8.7. So one way or another, we need to do a dance where we upload the configset under a new name than switch the collection to use it.

Having this updater object that lets us easily execute relevant Solr API lets us easily experiment with different logic flows for this. For instance in a Solr listserv thread, Alex Halovnic suggests a somewhat complicated 8-step process workaround, which we can implement like so:

current_name = updater.config_name
temp_name = "#{current_name}_temp"

updater.create(from: current_name, to: temp_name)
updater.change_config_name(temp_name)
updater.reload
updater.delete(current_name)
updater.upload(configset_name: current_name)
updater.change_config_name(current_name)
updater.reload
updater.delete(temp_name)

That works. But talking to Dann Bohn at Penn State University, he shared a different algorithm, which goes like:

  • Make a cryptographic digest hash of the entire solr directory, which we’re going to use in the configset name.
  • Check if the collection is already using a configset named $name_$digest, which if it already is, you’re done, no change needed.
  • Otherwise, upload the configset with the fingerprint-based name, switch the collection to use it, reload, delete the configset that the collection used to use.

At first this seemed like overkill to me, but after thinking and experimenting with it, I like it! It is really quick to make a digest of a handful of files, that’s not a big deal. (I use first 7 chars of hex SHA256). And even if we had Solr 8.7, I like that we can avoid doing any operation on solr at all if there had been no changes — I really want to use this operation much like a Rails db:migrate, running it on every deploy to make sure the solr schema matches the one in the repo for the depoy.

Dann also shared his open source code with me, which was helpful for seeing how to make the digest, how to make a Zip file in ruby, etc. Thanks Dann!

Sharing my code

So I also wrote some methods to implement those variant updating stragies, Dann’s, and Alex Halovnic’s from the list etc.

I thought about wrapping this all up as a gem, but I didn’t really have the time to make it really good enough for that. My API is a little bit janky, I didn’t spend the extra time think it out really well to minimize the need for future backwards incompat changes like I would if it were a gem. I also couldn’t figure out a great way to write automated tests for this that I would find particularly useful; so in my code base it’s actually not currently test-covered (shhhhh) but in a gem I’d want to solve that somehow.

But I did try to write the code general purpose/flexible so other people could use it for their use cases; I tried to document it to my highest standards; and I put it all in one file which actually might not be the best OO abstraction/design, but makes it easier for you to copy and paste the single file for your own use. :)

So you can find my code here; it is apache-licensed; and you are welcome to copy and paste it and do whatever you like with it, including making a gem yourself if you want. Maybe I’ll get around to making it a gem in the future myself, I dunno, curious if there’s interest.

The SearchStax proprietary API’s

SearchStax has it’s own API’s that can I think be used for updating configsets and setting collections to use certain configsets etc. When I started exploring them, they are’t the worst vendor API’s I’ve seen, but I did find them a bit cumbersome to work with. The auth system involves a lot of steps (why can’t you just create an API Key from the SearchStax Web GUI?).

Overall I found them harder to use than just the standard Solr Cloud API’s, which worked fine in the SearchStax deployment, and have the added bonus of being transferable to any SolrCloud deployment instead of being SearchStax-specific. While the SearchStax docs and support try to steer you to the SearchStax specific API’s, I don’t think there’s really any good reason for this. (Perhaps the custom SearchStax API’s were written long ago when Solr API’s weren’t as complete?)

SearchStax support suggested that the SearchStax APIs were somehow more secure; but my SearchStax Solr API’s are protected behind HTTP basic auth, and if I’ve created basic auth credentials (or IP addr allowlist) those API’s will be available to anyone with auth to access Solr whether I use em or not! And support also suggested that the SearchStax API use would be logged, whereas my direct Solr API use would not be, which seems to be true at least in default setup, I can probably configure solr logging differently, but it just isn’t that important to me for these particular functions.

So after some initial exploration with SearchStax API, I realized that SolrCloud API (which I had never used before) could do everything I need and was more straightforward and transferable to use, and I’m happy with my decision to go with that.

Are you talking to Heroku redis in cleartext or SSL?

In “typical” Redis installation, you might be talking to redis on localhost or on a private network, and clients typically talk to redis in cleartext. Redis doesn’t even natively support communications over SSL. (Or maybe it does now with redis6?)

However, the Heroku redis add-on (the one from Heroku itself) supports SSL connections via “Stunnel”, a tool popular with other redis users use to get SSL redis connections too. (Or maybe via native redis with redis6? Not sure if you’d know the difference, or if it matters).

There are heroku docs on all of this which say:

While you can connect to Heroku Redis without the Stunnel buildpack, it is not recommend. The data traveling over the wire will be unencrypted.

Perhaps especially because on heroku your app does not talk to redis via localhost or on a private network, but on a public network.

But I think I’ve worked on heroku apps before that missed this advice and are still talking to heroku in the clear. I just happened to run across it when I got curious about the REDIS_TLS_URL env/config variable I noticed heroku setting.

Which brings us to another thing, that heroku doc on it is out of date, it doesn’t mention the REDIS_TLS_URL config variable, just the REDIS_URL one. The difference? the TLS version will be a url beginning with rediss:// instead of redis:// , note extra s, which many redis clients use as a convention for “SSL connection to redis probably via stunnel since redis itself doens’t support it”. The redis docs provide ruby and go examples which instead use REDIS_URL and writing code to swap the redis:// for rediss:// and even hard-code port number adjustments, which is silly!

(While I continue to be very impressed with heroku as a product, I keep running into weird things like this outdated documentation, that does not match my experience/impression of heroku’s all-around technical excellence, and makes me worry if heroku is slipping…).

The docs also mention a weird driver: ruby arg for initializing the Redis client that I’m not sure what it is and it doesn’t seem necessary.

The docs are correct that you have to tell the ruby Redis client not to try to verify SSL keys against trusted root certs, and this implementation uses a self-signed cert. Otherwise you will get an error that looks like: OpenSSL::SSL::SSLError: SSL_connect returned=1 errno=0 state=error: certificate verify failed (self signed certificate in certificate chain)

So, can be as simple as:

redis_client = Redis.new(url: ENV['REDIS_TLS_URL'], ssl_params: { verify_mode: OpenSSL::SSL::VERIFY_NONE })

$redis = redis_client
# and/or
Resque.redis = redis_client

I don’t use sidekiq on this project currently, but to get the SSL connection with VERIFY_NONE, looking at sidekiq docs maybe on sidekiq docs you might have to(?):

redis_conn = proc {
  Redis.new(url: ENV['REDIS_TLS_URL'], ssl_params: { verify_mode: OpenSSL::SSL::VERIFY_NONE })
}

Sidekiq.configure_client do |config|
  config.redis = ConnectionPool.new(size: 5, &redis_conn)
end

Sidekiq.configure_server do |config|
  config.redis = ConnectionPool.new(size: 25, &redis_conn)
end

(Not sure what values you should pick for connection pool size).

While the sidekiq docs mention heroku in passing, they don’t mention need for SSL connections — I think awareness of this heroku feature and their recommendation you use it may not actually be common!

Update: Beware REDIS_URL can also be rediss

On one of my apps I saw a REDIS_URL which used redis: and a REDIS_TLS_URL which uses (secure) rediss:.

But on another app, it provides *only* a REDIS_URL, which is rediss — meaning you have to set the verify_mode: OpenSSL::SSL::VERIFY_NONE when passing it to ruby redis client. So you have to be prepared to do this with REDIS_URL values too — I think it shouldn’t hurt to set the ssl_params option even if you pass it a non-ssl redis: url, so just set it all the time?

This second app was heroku-20 stack, and the first was heroku-18 stack, is that the difference? No idea.

Documented anywhere? I doubt it. Definitely seems sloppy for what I expect of heroku, making me get a bit suspicious of whether heroku is sticking to the really impressive level of technical excellence and documentation I expect from them.

So, your best bet is to check for both REDIS_TLS_URL and REDIS_URL, prefering the TLS one if present, realizing the REDIS_URL can have a rediss:// value in it too.

The heroku docs also say you don’t get secure TLS redis connection on “hobby” plans, but I”m not sure that’s actually true anymore on heroku-20? Not trusting the docs is not a good sign.

Comparing performance of a Rails app on different Heroku formations

I develop a “digital collections” or “asset management” app, which manages and makes digitized historical objects and their descriptions available to the public, from the collections here at the Science History Institute.

The app receives relatively low level of traffic (according to Google Analytics, around 25K pageviews a month), although we want it to be able to handle spikes without falling down. It is not the most performance-optimized app, it does have some relatively slow responses and can be RAM-hungry. But it works adequately on our current infrastructure: Web traffic is handled on a single AWS EC2 t2.medium instance, with 10 passenger processes (free version of passenger, so no multi-threading).

We are currently investigating the possibility of moving our infrastructure to heroku. After realizing that heroku standard dynos did not seem to have the performance characteristics I had expected, I decided to approach performance testing more methodically, to compare different heroku dyno formations to each other and to our current infrastructure. Our basic research question is probably What heroku formation do we need to have similar performance to our existing infrastructure?

I am not an expert at doing this — I did some research, read some blog posts, did some thinking, and embarked on this. I am going to lead you through how I approached this and what I found. Feedback or suggestions are welcome. The most surprising result I found was much poorer performance from heroku standard dynos than I expected, and specifically that standard dynos would not match performance of present infrastructure.

What URLs to use in test

Some older load-testing tools only support testing one URL over and over. I decided I wanted to test a larger sample list of URLs — to be a more “realistic” load, and also because repeatedly requesting only one URL might accidentally use caches in ways you aren’t expecting giving you unrepresentative results. (Our app does not currently use fragment caching, but caches you might not even be thinking about include postgres’s built-in automatic caches, or passenger’s automatic turbocache (which I don’t think we have turned on)).

My initial thought to get a list of such URLs from our already-in-production app from production logs, to get a sample of what real traffic looks like. There were a couple barriers for me to using production logs as URLs:

  1. Some of those URLs might require authentication, or be POST requests. The bulk of our app’s traffic is GET requests available without authentication, and I didn’t feel like the added complexity of setting up anything else in a load traffic was worthwhile.
  2. Our app on heroku isn’t fully functional yet. Without having connected it to a Solr or background job workers, only certain URLs are available.

In fact, a large portion of our traffic is an “item” or “work” detail page like this one. Additionally, those are the pages that can be the biggest performance challenge, since the current implementation includes a thumbnail for every scanned page or other image, so response time unfortunately scales with number of pages in an item.

So I decided a good list of URLs was simply a representative same of those “work detail” pages. In fact, rather than completely random sample, I took the 50 largest/slowest work pages, and then added in another 150 randomly chosen from our current ~8K pages. And gave them all a randomly shuffled order.

In our app, every time a browser requests a work detail page, the JS on that page makes an additional request for a JSON document that powers our page viewer. So for each of those 200 work detail pages, I added the JSON request URL, for a more “realistic” load, and 400 total URLs.

Performance: “base speed” vs “throughput under load”

Thinking about it, I realized there were two kinds of “performance” or “speed” to think about.

You might just have a really slow app, to exagerate let’s say typical responses are 5 seconds. That’s under low/no-traffic, a single browser is the only thing interacting with the app, it makes a single request, and has to wait 5 seconds for a response.

That number might be changed by optimizations or performance regressions in your code (including your dependencies). It might also be changed by moving or changing hardware or virtualization environment — including giving your database more CPU/RAM resources, etc.

But that number will not change by horizontally scaling your deployment — adding more puma or passenger processes or threads, scaling out hosts with a load balancer or heroku dynos. None of that will change this base speed because it’s just how long the app takes to prepare a response when not under load, how slow it is in a test only one web worker , where adding web workers won’t matter because they won’t be used.

Then there’s what happens to the app actually under load by multiple users at once. The base speed is kind of a lower bound on throughput under load — page response time is never going to get better than 5s for our hypothetical very slow app (without changing the underlying base speed). But it can get a lot worse if it’s hammered by traffic. This throughput under load can be effected not only by changing base speed, but also by various forms of horizontal scaling — how many puma or passenger processes you have with how many threads each, and how many CPUs they have access to, as well as number of heroku dynos or other hosts behind a load balancer.

(I had been thinking about this distinction already, but Nate Berkopec’s great blog post on scaling Rails apps gave me the “speed” vs “throughout” terminology to use).

For my condition, we are not changing the code at all. But we are changing the host architecture from a manual EC2 t2.medium to heroku dynos (of various possible types) in a way that could effect base speed, and we’re also changing our scaling architecture in a way that could change throughput under load on top of that — from one t2.medium with 10 passenger process to possibly multiple heroku dynos behind heroku’s load balancer, and also (for Reasons) switching from free passenger to trying puma with multiple threads per process. (we are running puma 5 with new experimental performance features turned on).

So we’ll want to get a sense of base speed of the various host choices, and also look at how throughput under load changes based on various choices.

Benchmarking tool: wrk

We’re going to use wrk.

There are LOTS of choices for HTTP benchmarking/load testing, with really varying complexity and from different eras of web history. I got a bit overwhelmed by it, but settled on wrk. Some other choices didn’t have all the features we need (some way to test a list of URLs, with at least some limited percentile distribution reporting). Others were much more flexible and complicated and I had trouble even figuring out how to use them!

wrk does need a custom lua script in order to handle a list of URLs. I found a nice script here, and modified it slightly to take filename from an ENV variable, and not randomly shuffle input list.

It’s a bit confusing understanding the meaning of “threads” vs “connections” in wrk arguments. This blog post from appfolio clears it up a bit. I decided to leave threads set to 1, and vary connections for load — so -c1 -t1 is a “one URL at a time” setting we can use to test “base speed”, and we can benchmark throughput under load by increasing connections.

We want to make sure we run the test for long enough to touch all 400 URLs in our list at least once, even in the slower setups, to have a good comparison — ideally it would be go through the list more than once, but for my own ergonomics I had to get through a lot of tests so ended up less tha ideal. (Should I have put fewer than 400 URLs in? Not sure).

Conclusions in advance

As benchmarking posts go (especially when I’m the one writing them), I’m about to drop a lot of words and data on you. So to maximize the audience that sees the conclusions (because they surprise me, and I want feedback/pushback on them), I’m going to give you some conclusions up front.

Our current infrastructure has web app on a single EC2 t2.medium, which is a burstable EC2 type — our relatively low-traffic app does not exhaust it’s burst credits. Measuring base speed (just one concurrent request at a time), we found that performance dynos seem to have about the CPU speed of a bursting t2.medium (just a hair slower).

But standard dynos are as a rule 2 to 3 times slower; additionally they are highly variable, and that variability can be over hours/days. A 3 minute period can have measured response times 2 or more times slower than another 3 minute period a couple hours later. But they seem to typically be 2-3x slower than our current infrastructure.

Under load, they scale about how you’d expect if you knew how many CPUs are present, no real surprises. Our existing t2.medium has two CPUs, so can handle 2 simultaneous requests as fast as 1, and after that degrades linearly.

A single performance-L ($500/month) has 4 CPUs (8 hyperthreads), so scales under load much better than our current infrastructure.

A single performance-M ($250/month) has only 1 CPU (!), so scales pretty terribly under load.

Testing scaling with 4 standard-2x’s ($200/month total), we see that it scales relatively evenly. Although lumpily because of variability, and it starts out so much worse performing that even as it scales “evenly” it’s still out-performed by all other arcchitectures. :( (At these relatively fast median response times you might say it’s still fast enough who cares, but in our fat tail of slower pages it gets more distressing).

Now we’ll give you lots of measurements, or you can skip all that to my summary discussion or conclusions for our own project at the end.

Let’s compare base speed

OK, let’s get to actual measurements! For “base speed” measurements, we’ll be telling wrk to use only one connection and one thread.

Existing t2.medium: base speed

Our current infrastructure is one EC2 t2.medium. This EC2 instance type has two vCPUs and 4GB of RAM. On that single EC2 instance, we run passenger (free not enterprise) set to have 10 passenger processes, although the base speed test with only one connection should only touch one of the workers. The t2 is a “burstable” type, and we do always have burst credits (this is not a high traffic app; verified we never exhausted burst credits in these tests), so our test load may be taking advantage of burst cpu.

$ URLS=./sample_works.txt  wrk -c 1 -t 1 -d 3m --timeout 20s --latency -s load_test/multiplepaths.lua.txt https://[current staging server]
 multiplepaths: Found 400 paths
 multiplepaths: Found 400 paths
 Running 3m test @ https://staging-digital.sciencehistory.org
   1 threads and 1 connections
   Thread Stats   Avg      Stdev     Max   +/- Stdev
     Latency   311.00ms  388.11ms   2.37s    86.45%
     Req/Sec    11.89      8.96    40.00     69.95%
   Latency Distribution
      50%   90.99ms
      75%  453.40ms
      90%  868.81ms
      99%    1.72s
   966 requests in 3.00m, 177.43MB read
 Requests/sec:      5.37
 Transfer/sec:      0.99MB

I’m actually feeling pretty good about those numbers on our current infrastructure! 90ms median, not bad, and even 453ms 75th percentile is not too bad. Now, our test load involves some JSON responses that are quicker to deliver than corresponding HTML page, but still pretty good. The 90th/99th/and max request (2.37s) aren’t great, but I knew I had some slow pages, this matches my previous understanding of how slow they are in our current infrastructure.

90th percentile is ~9 times 50th percenile.

I don’t have an understanding of why the two different Req/Sec and Requests/Sec values are so different, and don’t totally understand what to do with the Stdev and +/- Stdev values, so I’m just going to be sticking to looking at the latency percentiles, I think “latency” could also be called “response times” here.

But ok, this is our baseline for this workload. And doing this 3 minute test at various points over the past few days, I can say it’s nicely regular and consistent, occasionally I got a slower run, but 50th percentile was usually 90ms–105ms, right around there.

Heroku standard-2x: base speed

From previous mucking about, I learned I can only reliably fit one puma worker in a standard-1x, and heroku says “we typically recommend a minimum of 2 processes, if possible” (for routing algorithmic reasons when scaled to multiple dynos), so I am just starting at a standard-2x with two puma workers each with 5 threads, matching heroku recommendations for a standard-2x dyno.

So one thing I discovered is that bencharks from a heroku standard dyno are really variable, but here are typical ones:

$ heroku dyno:resize
 type     size         qty  cost/mo
 ───────  ───────────  ───  ───────
 web      Standard-2X  1    50

$ heroku config:get --shell WEB_CONCURRENCY RAILS_MAX_THREADS
 WEB_CONCURRENCY=2
 RAILS_MAX_THREADS=5

$ URLS=./sample_works.txt  wrk -c 1 -t 1 -d 3m --timeout 20s --latency -s load_test/multiplepaths.lua.txt https://scihist-digicoll.herokuapp.com/
 multiplepaths: Found 400 paths
 multiplepaths: Found 400 paths
 Running 3m test @ https://scihist-digicoll.herokuapp.com/
   1 threads and 1 connections
   Thread Stats   Avg      Stdev     Max   +/- Stdev
     Latency   645.08ms  768.94ms   4.41s    85.52%
     Req/Sec     5.78      4.36    20.00     72.73%
   Latency Distribution
      50%  271.39ms
      75%  948.00ms
      90%    1.74s
      99%    3.50s
   427 requests in 3.00m, 74.51MB read
 Requests/sec:      2.37
 Transfer/sec:    423.67KB

I had heard that heroku standard dynos would have variable performance, because they are shared multi-tenant resources. I had been thinking of this like during a 3 minute test I might see around the same median with more standard deviation — but instead, what it looks like to me is that running this benchmark on Monday at 9am might give very different results than at 9:50am or Tuesday at 2pm. The variability is over a way longer timeframe than my 3 minute test — so that’s something learned.

Running this here and there over the past week, the above results seem to me typical of what I saw. (To get better than “seem typical” on this resource, you’d have to run a test, over several days or a week I think, probably not hammering the server the whole time, to get a sense of actual statistical distribution of the variability).

I sometimes saw tests that were quite a bit slower than this, up to a 500ms median. I rarely if ever saw results too much faster than this on a standard-2x. 90th percentile is ~6x median, less than my current infrastructure, but that still gets up there to 1.74 instead of 864ms.

This typical one is quite a bit slower than than our current infrastructure, our median response time is 3x the latency, with 90th and max being around 2x. This was worse than I expected.

Heroku performance-m: base speed

Although we might be able to fit more puma workers in RAM, we’re running a single-connection base speed test, so it shouldn’t matter to, and we won’t adjust it.

$ heroku dyno:resize
 type     size           qty  cost/mo
 ───────  ─────────────  ───  ───────
 web      Performance-M  1    250

$ heroku config:get --shell WEB_CONCURRENCY RAILS_MAX_THREADS
 WEB_CONCURRENCY=2
 RAILS_MAX_THREADS=5

$ URLS=./sample_works.txt  wrk -c 1 -t 1 -d 3m --timeout 20s --latency -s load_test/multiplepaths.lua.txt https://scihist-digicoll.herokuapp.com/
 multiplepaths: Found 400 paths
 multiplepaths: Found 400 paths
 Running 3m test @ https://scihist-digicoll.herokuapp.com/
   1 threads and 1 connections
   Thread Stats   Avg      Stdev     Max   +/- Stdev
     Latency   377.88ms  481.96ms   3.33s    86.57%
     Req/Sec    10.36      7.78    30.00     37.03%
   Latency Distribution
      50%  117.62ms
      75%  528.68ms
      90%    1.02s
      99%    2.19s
   793 requests in 3.00m, 145.70MB read
 Requests/sec:      4.40
 Transfer/sec:    828.70KB

This is a lot closer to the ballpark of our current infrastructure. It’s a bit slower (117ms median intead of 90ms median), but in running this now and then over the past week it was remarkably, thankfully, consistent. Median and 99th percentile are both 28% slower (makes me feel comforted that those numbers are the same in these two runs!), that doesn’t bother me so much if it’s predictable and regular, which it appears to be. The max appears to me still a little bit less regular on heroku for some reason, since performance is supposed to be non-shared AWS resources, you wouldn’t expect it to be, but slow requests are slow, ok.

90th percentile is ~9x median, about the same as my current infrastructure.

heroku performance-l: base speed

$ heroku dyno:resize
 type     size           qty  cost/mo
 ───────  ─────────────  ───  ───────
 web      Performance-L  1    500

$ heroku config:get --shell WEB_CONCURRENCY RAILS_MAX_THREADS
 WEB_CONCURRENCY=2
 RAILS_MAX_THREADS=5

URLS=./sample_works.txt  wrk -c 1 -t 1 -d 3m --timeout 20s --latency -s load_test/multiplepaths.lua.txt https://scihist-digicoll.herokuapp.com/
 multiplepaths: Found 400 paths
 multiplepaths: Found 400 paths
 Running 3m test @ https://scihist-digicoll.herokuapp.com/
   1 threads and 1 connections
   Thread Stats   Avg      Stdev     Max   +/- Stdev
     Latency   471.29ms  658.35ms   5.15s    87.98%
     Req/Sec    10.18      7.78    30.00     36.20%
   Latency Distribution
      50%  123.08ms
      75%  635.00ms
      90%    1.30s
      99%    2.86s
   704 requests in 3.00m, 130.43MB read
 Requests/sec:      3.91
 Transfer/sec:    741.94KB

No news is good news, it looks very much like performance-m, which is exactly what we expected, because this isn’t a load test. It tells us that performance-m and performance-l seem to have similar CPU speeds and similar predictable non-variable regularity, which is what I find running this test periodically over a week.

90th percentile is ~10x median, about the same as current infrastructure.

The higher Max speed is just evidence of what I mentioned, the speed of slowest request did seem to vary more than on our manual t2.medium, can’t really explain why.

Summary: Base speed

Not sure how helpful this visualization is, charting 50th, 75th, and 90th percentile responses across architectures.

But basically: performance dynos perform similarly to my (bursting) t2.medium. Can’t explain why performance-l seems slightly slower than performance-m, might be just incidental variation when I ran the tests.

The standard-2x is about twice as slow as my (bursting) t2.medium. Again recall standard-2x results varied a lot every time I ran them, the one I reported seems “typical” to me, that’s not super scientific, admittedly, but I’m confident that standard-2x are a lot slower in median response times than my current infrastructure.

Throughput under load

Ok, now we’re going to test using wrk to use more connections. In fact, I’ll test each setup with various number of connections, and graph the result, to get a sense of how each formation can handle throughput under load. (This means a lot of minutes to get all these results, at 3 minutes per number of connection test, per formation!).

An additional thing we can learn from this test, on heroku we can look at how much RAM is being used after a load test, to get a sense of the app’s RAM usage under traffic to understand the maximum number of puma workers we might be able to fit in a given dyno.

Existing t2.medium: Under load

A t2.medium has 4G of RAM and 2 CPUs. We run 10 passenger workers (no multi-threading, since we are free, rather than enterprise, passenger). So what do we expect? With 2 CPUs and more than 2 workers, I’d expect it to handle 2 simultaneous streams of requests almost as well as 1; 3-10 should be quite a bit slower because they are competing for the 2 CPUs. Over 10, performance will probably become catastrophic.

2 connections are exactly flat with 1, as expected for our two CPUs, hooray!

Then it goes up at a strikingly even line. Going over 10 (to 12) simultaneous connections doesn’t matter, even though we’ve exhausted our workers, I guess at this point there’s so much competition for the two CPUs already.

The slope of this curve is really nice too actually. Without load, our median response time is 100ms, but even at a totally overloaded 12 overloaded connections, it’s only 550ms, which actually isn’t too bad.

We can make a graph that in addition to median also has 75th, 90th, and 99th percentile response time on it:

It doesn’t tell us too much; it tells us the upper percentiles rise at about the same rate as the median. At 1 simultaneous connection 90th percentile of 846ms is about 9 times the median of 93ms; at 10 requests the 90th percentile of 3.6 seconds is about 8 times the median of 471ms.

This does remind us that under load when things get slow, this has more of a disastrous effect on already slow requests than fast requests. When not under load, even our 90th percentile was kind of sort of barley acceptable at 846ms, but under load at 3.6 seconds it really isn’t.

Single Standard-2X dyno: Under load

A standard-2X dyno has 1G of RAM. The (amazing, excellent, thanks schneems) heroku puma guide suggests running two puma workers with 5 threads each. At first I wanted to try running three workers, which seemed to fit into available RAM — but under heavy load-testing I was getting Heroku R14 Memory Quota Exceeded errors, so we’ll just stick with the heroku docs recommendations. Two workers with 5 threads each fit with plenty of headroom.

A standard-2x dyno is runs on shared (multi-tenant) underlying Amazon virtual hardware. So while it is running on hardware with 4 CPUs (each of which can run two “hyperthreads“), the puma doc suggests “it is best to assume only one process can execute at a time” on standard dynos.

What do we expect? Well, if it really only had one CPU, it would immediately start getting bad at 2 simulataneous connections, and just get worse from there. When we exceed the two worker count, will it get even worse? What about when we exceed the 10 thread (2 workers * 5 threads) count?

You’d never run just one dyno if you were expecting this much traffic, you’d always horizontally scale. This very artificial test is just to get a sense of it’s characteristics.

Also, we remember that standard-2x’s are just really variable; I could get much worse or better runs than this, but graphed numbers from a run that seemed typical.

Well, it really does act like 1 CPU, 2 simultaneous connections is immediately a lot worse than 1.

The line isn’t quite as straight as in our existing t2.medium, but it’s still pretty straight; I’d attribute the slight lumpiness to just the variability of shared-architecture standard dyno, and figure it would get perfectly straight with more data.

It degrades at about the same rate of our baseline t2.medium, but when you start out slower, that’s more disastrous. Our t2.medium at an overloaded 10 simultaneous requests is 473ms (pretty tolerable actually), 5 times the median at one request only. This standard-2x has a median response time of 273 ms at only one simultaneous request, and at an overloaded 10 requests has a median response time also about 5x worse, but that becomes a less tolerable 1480ms.

Does also graphing the 75th, 90th, and 99th percentile tell us much?

Eh, I think the lumpiness is still just standard shared-architecture variability.

The rate of “getting worse” as we add more overloaded connections is actually a bit better than it was on our t2.medium, but since it already starts out so much slower, we’ll just call it a wash. (On t2.medium, 90th percentile without load is 846ms and under an overloaded 10 connections 3.6s. On this single standard-2x, it’s 1.8s and 5.2s).

I’m not sure how much these charts with various percentiles on them tell us, I’ll not include them for every architecture hence.

standard-2x, 4 dynos: Under load

OK, realistically we already know you shouldn’t have just one standard-2x dyno under that kind of load. You’d scale out, either manually or perhaps using something like the neat Rails Autoscale add-on.

Let’s measure with 4 dynos. Each is still running 2 puma workers, with 5 threads each.

What do we expect? Hm, treating each dyno as if it has only one CPU, we’d expect it to be able to handle traffic pretty levelly up to 4 simultenous connections, distributed to 4 dynos. It’s going to do worse after that, but up to 8 there is still one puma worker per connection so it might get even worse after 8?

Well… I think that actually is relatively flat from 1 to 4 simultaneous connections, except for lumpiness from variability. But lumpiness from variability is huge! We’re talking 250ms median measured at 1 connection, up to 369ms measured median at 2, down to 274ms at 3.

And then maybe yeah, a fairly shallow slope up to 8 simutaneous connections than steeper.

But it’s all fairly shallow slope compared to our base t2.medium. At 8 connections (after which we pretty much max out), the standard-2x median of 464ms is only 1.8 times the median at 1 conection. Compared to the t2.median increase of 3.7 times.

As we’d expect, scaling out to 4 dynos (with four cpus/8 hyperthreads) helps us scale well — the problem is the baseline is so slow to begin (with very high bounds of variability making it regularly even slower).

performance-m: Under load

A performance-m has 2.5 GB of memory. It only has one physical CPU, although two “vCPUs” (two hyperthreads) — and these are all your apps, it is not shared.

By testing under load, I demonstrated I could actually fit 12 workers on there without any memory limit errors. But is there any point to doing that with only 1/2 CPUs? Under a bit of testing, it appeared not.

The heroku puma docs recommend only 2 processes with 5 threads. You could do a whole little mini-experiment just trying to measure/optimize process/thread count on performance-m! We’ve already got too much data here, but in some experimentation it looked to me like 5 processes with 2 threads each performed better (and certainly no worse) than 2 processes with 5 threads — if you’ve got the RAM just sitting there anyway (as we do), why not?

I actually tested with 6 puma processes with 2 threads each. There is still a large amount of RAM headroom we aren’t going to use even under load.

What do we expect? Well, with the 2 “hyperthreads” perhaps it can handle 2 simultaneous requests nearly as well as 1 (or not?); after that, we expect it to degrade quickly same as our original t2.medium did.

It an handle 2 connections slightly better than you’d expect if there really was only 1 CPU, so I guess a hyperthread does give you something. Then the slope picks up, as you’d expect; and it looks like it does get steeper after 4 simultaneous connections, yup.

performance-l: Under load

A performance-l ($500/month) costs twice as much as a performance-m ($250/month), but has far more than twice as much resources. performance-l has a whopping 14GB of RAM compared to performance-m’s 2.5GB; and performance-l has 4 real CPUs/hyperthreads available to use (visible using the nproc technique in the heroku puma article.

Because we have plenty of RAM to do so, we’re going to run 10 worker processes to match our original t2.medium’s. We still ran with 2 threads, just cause it seems like maybe you should never run a puma worker with only one thread? But who knows, maybe 10 workers with 1 thread each would perform better; plenty of room (but not plenty of my energy) for yet more experimentation.

What do we expect? The graph should be pretty flat up to 4 simultaneous connections, then it should start getting worse, pretty evenly as simultaneous connections rise all the way up to 12.

It is indeed pretty flat up to 4 simultaneous connections. Then up to 8 it’s still not too bad — median at 8 is only ~1.5 median at 1(!). Then it gets worse after 8 (oh yeah, 8 hyperthreads?).

But the slope is wonderfully shallow all the way. Even at 12 simultaneous connections, the median response time of 266ms is only 2.5x what it was at one connection. (In our original t2.medium, at 12 simultaneous connections median response time was over 5x what it was at 1 connection).

This thing is indeed a monster.

Summary Comparison: Under load

We showed a lot of graphs that look similar, but they all had different sclaes on the y-axis. Let’s plot median response times under load of all architectures on the same graph, and see what we’re really dealing with.

The blue t2.medium is our baseline, what we have now. We can see that there isn’t really a similar heroku option, we have our choice of better or worse.

The performance-l is just plain better than what we have now. It starts out performing about the same as what we have now for 1 or 2 simultaneous connections, but then scales so much flatter.

The performance-m also starts out about thesame, but sccales so much worse than even what we have now. (it’s that 1 real CPU instead of 2, I guess?).

The standard-2x scaled to 4 dynos… has it’s own characteristics. It’s baseline is pretty terrible, it’s 2 to 3 times as slow as what we have now even not under load. But then it scales pretty well, since it’s 4 dynos after all, it doesn’t get worse as fast as performance-m does. But it started out so bad, that it remains far worse than our original t2.medium even under load. Adding more dynos to standard-2x will help it remain steady under even higher load, but won’t help it’s underlying problem that it’s just slower than everyone else.

Discussion: Thoughts and Surprises

  • I had been thinking of a t2.medium (even with burst) as “typical” (it is after all much slower than my 2015 Macbook), and has been assuming (in retrospect with no particular basis) that a heroku standard dyno would perform similarly.
    • Most discussion and heroku docs, as well as the naming itself, suggest that a ‘standard’ dyno is, well, standard, and performance dynos are for “super scale, high traffic apps”, which is not me.
    • But in fact, heroku standard dynos are much slower and more variable in performance than a bursting t2.medium. I suspect they are slower than other options you might consider non-heroku “typical” options.



  • My conclusion is honestly that “standard” dynos are really “for very fast, well-optimized apps that can handle slow and variable CPU” and “performance” dynos are really “standard, matching the CPU speeds you’d get from a typical non-heroku option”. But this is not how they are documented or usually talked about. Are other people having really different experiences/conclusions than me? If so, why, or where have I gone wrong?
    • This of course has implications for estimating your heroku budget if considering switching over. :(
    • If you have a well-optimized fast app, say even 95th percentile is 200ms (on bursting t2.medium), then you can handle standard slowness — so what your 95th percentile is now 600ms (and during some time periods even much slower, 1s or worse, due to variability). That’s not so bad for a 95th percentile.
    • One way to get a very fast is of course caching. There is lots of discussion of using caching in Rails, sometimes the message (explicit or implicit) is “you have to use lots of caching to get reasonable performance cause Rails is so slow.” What if many of these people are on heroku, and it’s really you have to use lots of caching to get reasonable performance on heroku standard dyno??
    • I personally don’t think caching is maintenance free; in my experience properly doing cache invalidation and dealing with significant processing spikes needed when you choose to invalidate your entire cache (cause cached HTML needs to change) lead to real maintenance/development cost. I have not needed caching to meet my performance goals on present architecture.
    • Everyone doesn’t necessarily have the same performance goals/requirements. Mine of a low-traffic non-commercial site are are maybe more modest, I just need users not to be super annoyed. But whatever your performance goals, you’re going to have to spend more time on optimization on a heroku standard than something with much faster CPU — like a standard affordable mid-tier EC2. Am I wrong?


  • One significant factor on heroku standard dyno performance is that they use shared/multi-tenant infrastructure. I wonder if they’ve actually gotten lower performance over time, as many customers (who you may be sharing with) have gotten better at maximizing their utilization, so the shared CPUs are typically more busy? Like a frog boiling, maybe nobody noticed that standard dynos have become lower performance? I dunno, brainstorming.
    • Or maybe there are so many apps that start on heroku instead of switcching from somewhere else, that people just don’t realize that standard dynos are much slower than other low/mid-tier options?
    • I was expecting to pay a premium for heroku — but even standard-2x’s are a significant premium over paying for t2.medium EC2 yourself, one I found quite reasonable…. performance dynos are of course even more premium.


  • I had a sort of baked-in premise that most Rails apps are “IO-bound”, they spend more time waiting on IO than using CPU. I don’t know where I got that idea, I heard it once a long time ago and it became part of my mental model. I now do not believe this is true true of my app, and I do not in fact believe it is true of most Rails apps in 2020. I would hypothesize that most Rails apps today are in fact CPU-bound.

  • The performance-m dyno only has one CPU. I had somehow also been assuming that it would have two CPUs — I’m not sure why, maybe just because at that price! It would be a much better deal with two CPUs.
    • Instead we have a huge jump from $250 performance-m to $500 performance-l that has 4x the CPUs and ~5x the RAM.
    • So it doesn’t make financial sense to have more than one performance-m dyno, you might as well go to performance-l. But this really complicates auto-scaling, whether using Heroku’s feature , or the awesome Rails Autoscale add-on. I am not sure I can afford a performance-l all the time, and a performance-m might be sufficient most of the time. But if 20% of the time I’m going to need more (or even 5%, or even unexpectedly-mentioned-in-national-media), it would be nice to set things up to autoscale up…. I guess to financially irrational 2 or more performance-m’s? :(

  • The performance-l is a very big machine, that is significantly beefier than my current infrastructure. And has far more RAM than I need/can use with only 4 physical cores. If I consider standard dynos to be pretty effectively low tier (as I do), heroku to me is kind of missing mid-tier options. A 2 CPU option at 2.5G or 5G of RAM would make a lot of sense to me, and actually be exactly what I need… really I think performance-m would make more sense with 2 CPUs at it’s existing already-premium price point, and to be called a “performance” dyno. . Maybe heroku is intentionally trying set options to funnel people to the highest-priced performance-l.

Conclusion: What are we going to do?

In my investigations of heroku, my opinion of the developer UX and general service quality only increases. It’s a great product, that would increase our operational capacity and reliability, and substitute for so many person-hours of sysadmin/operational time if we were self-managing (even on cloud architecture like EC2).

But I had originally been figuring we’d use standard dynos (even more affordably, possibly auto-scaled with Rails Autoscale plugin), and am disappointed that they end up looking so much lower performance than our current infrastructure.

Could we use them anyway? Response time going from 100ms to 300ms — hey, 300ms is still fine, even if I’m sad to lose those really nice numbers I got from a bit of optimization. But this app has a wide long-tail ; our 75th percentile going from 450ms to 1s, our 90th percentile going from 860ms to 1.74s and our 99th going from 2.3s to 4.4s — a lot harder to swallow. Especially when we know that due to standard dyno variability, a slow-ish page that on my present architecture is reliably 1.5s, could really be anywhere from 3 to 9(!) on heroku.

I would anticipate having to spend a lot more developer time on optimization on heroku standard dynos — or, i this small over-burdened non-commercial shop, not prioritizing that (or not having the skills for it), and having our performance just get bad.

So I’m really reluctant to suggest moving our app to heroku with standard dynos.

A performance-l dyno is going to let us not have to think about performance any more than we do now, while scaling under high-traffic better than we do now — I suspect we’d never need to scale to more than one performance-l dyno. But it’s pricey for us.

A performance-m dyno has a base-speed that’s fine, but scales very poorly and unaffordably. Doesn’t handle an increase in load very well as one dyno, and to get more CPUs you have to pay far too much (especially compared to standard dynos I had been assuming I’d use).

So I don’t really like any of my options. If we do heroku, maybe we’ll try a performance-m, and “hope” our traffic is light enough that a single one will do? Maybe with Rails autoscale for traffic spikes, even though 2 performance-m dynos isn’t financially efficient? If we are scaling to 2 (or more!) performance-m’s more than very occasionally, switch to performance-l, which means we need to make sure we have the budget for it?

faster_s3_url: Optimized S3 url generation in ruby

Subsequent to my previous investigation about S3 URL generation performance, I ended up writing a gem with optimized implementations of S3 URL generation.

github: faster_s3_url

It has no dependencies (not even aws-sdk). It can speed up both public and presigned URL generation by around an order of magnitude. In benchmarks on my 2015 MacBook compared to aws-sdk-s3: public URLs from 180 in 10ms to 2200 in 10ms; presigned URLs from 10 in 10ms to 300 in 10ms (!!).

While if you are only generating a couple S3 URLs at a time you probably wouldn’t notice aws-sdk-ruby’s poor performance, if you are generating even just hundreds at a time, and especially for presigned URLs, it can really make a difference.

faster_s3_url supports the full API for aws-sdk-s3 presigned URLs , including custom params like response_content_disposition. It’s tests actually test that results match what aws-sdk-s3 would generate.

For shrine users, faster_s3_url includes a Shrine storage sub-class that can be drop-in replacement of Shrine::Storage::S3 to just have all your S3 URL generations via shrine be using the optimized implementation.

Key in giving me the confidence to think I could pull off an independent S3 presigned URL implementation was WeTransfer’s wt_s3_signer gem be succesful. wt_s3_signer makes some assumptions/restrictions to get even higher performance than faster_s3_url (two or three times as fast) — but the restrictions/assumptions and API to get that performance weren’t suitable for use cases, so I implemented my own.

Delete all S3 key versions with ruby AWS SDK v3

If your S3 bucket is versioned, then deleting an object from s3 will leave a previous version there, as a sort of undo history. You may have a “noncurrent expiration lifecycle policy” set which will delete the old versions after so many days, but within that window, they are there.

What if you were deleting something that accidentally included some kind of sensitive or confidential information, and you really want it gone?

To make matters worse, if your bucket is public, the version is public too, and can be requested by an unauthenticated user that has the URL including a versionID, with a URL that looks something like: https://mybucket.s3.amazonaws.com/path/to/someting.pdf?versionId=ZyxTgv_pQAtUS8QGBIlTY4eKmANAYwHT To be fair, it would be pretty hard to “guess” this versionID! But if it’s really sensitive info, that might not be good enough.

It was a bit tricky for me to figure out how to do this with the latest version of ruby SDK (as I write, “v3“, googling sometimes gave old versions).

It turns out you need to first retrieve a list of all versions with bucket.object_versions . With no arg, that will return ALL the versions in the bucket, which could be a lot to retrieve, not what you want when focused on just deleting certain things.

If you wanted to delete all versions in a certain “directory”, that’s actually easiest of all:

s3_bucket.object_versions(prefix: "path/to/").batch_delete!

But what if you want to delete all versions from a specific key? As far as I can tell, this is trickier than it should be.

# danger! This may delete more than you wanted
s3_bucket.object_versions(prefix: "path/to/something.doc").batch_delete!

Because of how S3 “paths” (which are really just prefixes) work, that will ALSO delete all versions for path/to/something.doc2 or path/to/something.docdocdoc etc, for anything else with that as a prefix. There probably aren’t keys like that in your bucket, but that seems dangerously sloppy to assume, that’s how we get super weird bugs later.

I guess there’d be no better way than this?

key = "path/to/something.doc"
s3_bucket.object_versions(prefix: key).each do |object_version|
  object_version.delete if object_version.object_key == key
end

Is there anyone reading this who knows more about this than me, and can say if there’s a better way, or confirm if there isn’t?

Github Actions tutorial for ruby CI on Drifting Ruby

I’ve been using travis for free automated testing (“continuous integration”, CI) on my open source projects for a long time. It works pretty well. But it’s got some little annoyances here and there, including with github integration, that I don’t really expect to get fixed after its acquisition by private equity. They also seem to have cut off actual support channels (other than ‘forums’) for free use; I used to get really good (if not rapid) support when having troubles, now I kinda feel on my own.

So after hearing about the pretty flexible and powerful newish Github Actions feature, I was interested in considering that as an alternative. It looks like it should be free for public/open source projects on github. And will presumably have good integration with the rest of github and few kinks. Yeah, this is an example of how a platform getting an advantage starting out by having good third-party integration can gradually come to absorb all of that functionality itself; but I just need something that works (and, well, is free for open source), I don’t want to spend a lot of time on CI, I just want it to work and get out of the way. (And Github clearly introduced this feature to try to avoid being overtaken by Gitlab, which had integrated flexible CI/CD).

So anyway. I was interested in it, but having a lot of trouble figuring out how to set it up. Github Actions is a very flexible tool, a whole platform really, which you can use to set up almost any kind of automated task you want, in many different languages. Which made it hard for me to figure out “Okay, I just want tests to run on all PR commits, and report back to the PR if it’s mergeable”.

And it was really hard to figure this out from the docs, it’s such a flexible abstract tool. And I have found it hard to find third party write-ups and tutorials and blogs and such — in part because Github Actions was in beta development for so long, that some of the write-ups I did find were out of date.

Fortunately Drifting Ruby has provided a great tutorial, which gets you started with a basic ruby CI testing. It looks pretty straightforward to for instance figure out how to swap in rspec for rake test. And I always find it easier to google for solutions to additional fancy things I want to do, finding results either in official docs or third-party blogs, when I have the basic skeleton in place.

I hope to find time to experiment with Github Actions in the future. I am writing this blog post in part to log for myself the Drifting Ruby episode so I don’t lose it! The show summary has this super useful template:

.github/workflows/main.yml
name: CI
on:
push:
branches: [ master, develop ]
pull_request:
branches: [ master, develop ]
jobs:
test:
# services:
# db:
# image: postgres:11
# ports: ['5432:5432']
runs-on: ubuntu-latest
steps:
uses: actions/checkout@v2
name: Setup Ruby
uses: ruby/setup-ruby@v1.45.0
with:
ruby-version: 2.7.1
uses: Borales/actions-yarn@v2.3.0
with:
cmd: install
name: Install Dependencies
run: |
# sudo apt install -yqq libpq-dev
gem install bundler
name: Install Gems
run: |
bundle install
name: Prepare Database
run: |
bundle exec rails db:prepare
name: Run Tests
# env:
# DATABASE_URL: postgres://postgres:@localhost:5432/databasename
# RAILS_MASTER_KEY: ${{secrets.RAILS_MASTER_KEY}}
run: |
bundle exec rails test
name: Create Coverage Artifact
uses: actions/upload-artifact@v2
with:
name: code-coverage
path: coverage/
security:
runs-on: ubuntu-latest
steps:
uses: actions/checkout@v2
name: Setup Ruby
uses: ruby/setup-ruby@v1.45.0
with:
ruby-version: 2.7.1
name: Install Brakeman
run: |
gem install brakeman
name: Run Brakeman
run: |
brakeman -f json > tmp/brakeman.json || exit 0
name: Brakeman Report
uses: devmasx/brakeman-linter-action@v1.0.0
env:
REPORT_PATH: tmp/brakeman.json
GITHUB_TOKEN: ${{secrets.GITHUB_TOKEN}}

Delivery patterns for non-public resources hosted on S3

I work at the Science History Institute on our Digital Collections app (written in Rails), which is kind of a “digital asset management” app combined with a public catalog of our collection.

We store many high-resolution TIFF images that can be 100MB+ each, as well as, currently, a handful of PDFs and audio files. We have around 31,000 digital assets, which make up about 1.8TB. In addition to originals, we have “derivatives” for each file (JPG conversions of a TIFF original at various sizes; MP3 conversions of FLAC originals; etc) — around 295,000 derivatives (~10 per original) taking up around 205GB. Not a huge amount of data compared to some, but big enough to be something to deal with, and we expect it could grow by an order of magnitude in the next couple years.

We store them all — originals and derivatives — in S3, which generally works great.

We currently store them all in public S3 buckets, and when we need an image thumb url for an <img src>, we embed a public S3 URL (as opposed to pre-signed URLs) right in our HTML source. Having the user-agent get the resources directly from S3 is awesome, because our app doesn’t have to worry about handling that portion of the “traffic”, something S3 is quite good at (and there are CDN options which work seamlessly with S3 etc; although our traffic is currently fairly low and we aren’t using a CDN).

But this approach stops working if some of your assets can not be public, and need to be access-controlled with some kind of authorization. And we are about to start hosting a class of assets that are such.

Another notable part of our app is that in it’s current design it can have a LOT of img src thumbs on a page. Maybe 600 small thumbs (one or each scanned page of a book), each of which might use an img srcset to deliver multiple resolutions. We use Javascript lazy load code so the browser doesn’t actually try to load all these img src unless they are put in viewport, but it’s still a lot of URLs generated on the page, and potentially a lot of image loads. While this might be excessive and a design in ned of improvement, a 10×10 grid of postage-stamp-sized thumbs on a page (each of which could use a srcset) does not seem unreasonable, right? There can be a lot of URLs on a page in an “asset management” type app, it’s how it is.

As I looked around for advice on this or analysis of the various options, I didn’t find much. So, in my usual verbose style, I’m going to outline my research and analysis of the various options here. None of the options are as magically painless as using public bucket public URL on S3, alas.

All public-read ACLs, Public URLs

What we’re doing now. The S3 bucket is set as public, all files have S3 public-read ACL set, and we use S3 “public” URLs as <img src> in our app. Which might look like https://my-bucket.s3.us-west-2.amazonaws.com/path/to/thumb.jpg .

For actual downloads, we might still use an S3 presigned url , not for access control (the object is already public), but to specify a content-disposition response header for S3 to use on the fly.

Pro

  • URLs are persistent and stable and can be bookmarked, or indexed by search engines. (We really want our images in Google Images for visibility) And since the URLs are permanent and good indefinitely, they aren’t a problem for HTML including these urls in source to be cached indefinitely. (as long as you don’t move your stuff around in your S3 buckets).
  • S3 public URLs are much cheaper to generate than the cryptographically presigned URLs, so it’s less of a problem generating 1200+ of them in a page response. (And can be optimized an order of magnitude further beyond the ruby SDK implementation).
  • S3 can scale to handle a lot of traffic, and Cloudfront or another CDN can easily be used to scale further. Putting a CDN on top of a public bucket is trivial. Our Rails app is entirely uninvolved in delivering the actual images, so we don’t need to use precious Rails workers on delivering images.

Con

  • Some of our materials are still being worked on by staff, and haven’t actually been published yet. But they are still in S3 with a public-read ACL. They have hard to guess URLs that shouldn’t be referred to on any publically viewable web page — but we know that shouldn’t be relied upon for anything truly confidential.
    • That has been an acceptable design so far, as none of these materials are truly confidential, even if not yet published to our site. But this is about to stop being acceptable as we include more truly confidential materials.

All protected ACL, REDIRECT to presigned URL

This is the approach taken by Rails’ ActiveStorage does in standard setup/easy path. It assumes all resources will stored to S3 without public ACL; a random user can’t access via S3 without a time-limited presigned URL being supplied by the app.

ActiveStorage’s standard implementation will give you a URL to your Rails app itself when you ask for a URL for an S3-stored asset — a rails URL is what might be in your <img src> urls. That Rails URL will redirect to a unique temporary S3 presigned URL that allows access to the non-public resource.

Pro

  • This pattern allows your app to decide based on current request/logged-in-user and asset, whether to grant acccess, on a case by case basis. (Although it’s not clear to me where the hooks are in ActiveStorage for this; I don’t actually use ActiveStorage, and it’s easy enough to implement this pattern generally, with authorization logic).
  • S3 is still delivering assets directly to your users, so scaling issues are still between S3 and the requestor, and your app doesn’t have to get involved.
  • The URLs that show up in your delivered HTML pages, say as <img src> or <a href> URLs — are pointing your app, and are still persistent and indefinitely valid — so the HTML is still indefinitely cacheable by any HTTP cache. The will redirect to a unique-per-user and temporary presigned URL, but that’s not what’s in the HTML source.
    • You can even more your images around (to different buckets/locations or entirely different services) without invalidating the cache of the HTML. the URLs in your cached HTML don’t change, where they redirect to do. (This may be ActiveStorage’s motivation for this design?)

Cons

  • Might this interfere with Google Images indexing? While it’s hard (for me) to predict what might effect Google Images indexing, my own current site’s experience seems to say its actually fine. Google is willing to index an image “at” a URL that actually HTTP 302 redirects to a presigned S3 URL. Even though on every access the redirect will be to a different URL, Google doesn’t seem to think this is fishy. Seems to be fine.
  • Makes figuring out how to put a CDN in the mix more of a chore, you can’t just put it in front of your S3, as you only want to CDN/cache public URLs, but may need to use more sophisticated CDN features or setup or choices.
  • The asset responses themselves, at presigned URLs, are not cacheable by an HTTP cache, either browser caching or intermediate. (Or at least not for more than a week, the maximum expiry of presigned urls).
  • This is the big one. Let’s say you have 40 <img src> thumbnails on a page, and use this method. Every browser page load will result in an additional 40 requests to your app. This potentially requires you to scale your app much larger to handle the same amount of actual page requests, because your actual page requests are now (eg) 40x.
    • This has been reported as an actual problem by Rails ActiveStorage users. An app can suddenly handle far less traffic because it’s spending so much time doing all these redirects.
    • Therefore, ActiveStorage users/developers then tried to figure out how to get ActiveStorage to instead use the “All public-read ACLs, Public URLs delivered directly” model we listed above. It is now possible to do that with ActiveStorage (some answers in that StackOverflow), which is great, because it’s a great model when all your assets can be publicly available… but that was already easy enough to do without AS, we’re here cause that’s not my situation and I need something else!.
    • On another platform that isn’t Rails, the performance concerns might be less, but Rails can be, well, slow here. In my app, a response that does nothing but redirect to https://example.com can still take 100ms to return! I think an out-of-the-box Rails app would be a bit faster, I’m not sure what is making mine so slow. But even at 50ms, an extra (eg) 40x50ms == 2000ms of worker time for every page delivery is a price to pay.
    • In my app where many pages may actually have not 40 but 600+ thumbs on them… this is can be really bad. Even if JS lazy-loading is used, it just seems like asking for trouble.

All protected ACL, PROXY to presigned URL

Okay, just like above, but the app action, instead of redireting to S3…. actually reads the bytes from s3 on the back-end, and delivers them to to the user-agent directly, as a sort of proxy.

The pros/cons are pretty similar to redirect solution, but mostly with a lot of extra cons….

Extra Pro

  • I guess it’s an extra pro that the fact it’s on S3 is completely invisible to the user-agent, so it can’t possibly mess up Google Images indexing or anything like that.

Extra Cons

  • If you were worried about the scaling implications of tying up extra app workers with the redirect solution, this is so much worse, as app workers are now tied up for as long as it takes to proxy all those bytes from S3 (hopefully the nginx or passenger you have in front of your web app means you aren’t worried about slow clients, but that byte shuffling from S3 will still add up).
  • For very large assets, such as I have, this is likely incompatible with a heroku deploy, because of heroku’s 30s request timeout.

One reason I mention this option, is I believe it is basically what a hyrax app (some shared code used in our business domain) does. Hyrax isn’t necessarily using S3, but I believe does have the Rails app involved in proxying and delivering bytes for all files (including derivatives), including for <img src>. So that approach is working for them well enough, so maybe shouldn’t be totally dismissed. But it doesn’t seem right to me — I really liked the much better scaling curve of our app when we moved it away from sufia (a hyrax precedessor), and got it to stop proxying bytes like this. Plus I think this is probably a barrier to deploying hyrax apps to heroku, and we are interested in investigating heroku with our app.

All protected ACL, have nginx proxy to presigned URL?

OK, like the above “proxy” solution, but with a twist. A Rails app is not the right technology for proxying lots of bytes.

But nginx is, that’s honestly it’s core use case, it’s literally built for a proxy use case, right? It should be able to handle lots of em concurrently with reasonable CPU/memory resources. If we can get nginx doing the proxying, we don’t need to worry about tying up Rails workers doing it.

I got really excited about this for a second… but it’s kind of a confusing mess. What URLs are we actually delivering in <img src> in HTML source? If they are Rails app URLs, that will then trigger an nginx proxy using something like nginx x-accel but for to a remote (presigned S3) URL instead of a local file, we have all the same downsides as the REDIRECT option above, without any real additional benefit (unless you REALLY want to hide that it’s from S3).

If instead we want to embed URLs in the HTML source that will end up being handled directly by nginx without touching the Rails app… it’s just really confusing to figure out how to set nginx up to proxy non-public content from S3. nginx has to be creating signed requests… but we also want to access-control it somehow, it should only be creating these when the app has given it permission on a per-request basis… there are a variety of of nginx third party modules that look like maybe could be useful to put this together, some more maintained/documented than others… and it just gets really confusing.

PLUS if you want to depoy to heroku (which we are considering), this nginx still couldn’t be running on heroku, cause of that 30s limit, it would have to be running on your own non-heroku host somewhere.

I think if I were a larger commercial company with a product involving lots and lots of user-submitted images that I needed to access control and wanted to store on S3…. I might do some more investigation down this path. But for my use case… I think this is just too complicated for us to maintain, if it can be made to work at all.

All Protected ACL, put presigned URLs in HTML source

Protect all your S3 assets with non-public ACLs, so they can only be accessed after your app decides the requester has privileges to see it, via a presigned URL. But instead of using a redirect or proxy, just generate presigend URLs and use them directly in <img src> for thumbs or or <a href> for downloads etc.

Pro

  • We can control access at the app level
  • No extra requests for redirects or proxies, we aren’t requiring our app to have a lot more resources to handle an additional request per image thumb loaded.
  • Simple.

Con

  • HTML source now includes limited-time-expiring URLs in <img src> etc, so can’t be cached indefinitely, even for public pages. (Although can be cached for up to a week, the maximum expiry of S3 presigned URLs, which might be good enough).
  • Presigned S3 URLs are really expensive to generate. It’s actually infeasible to include hundreds of them on a page, can take almost 1ms per URL generated. This can be optimized somewhat with custom code, but still really expensive. This is the main blocker here I think, for what otherwise might be “simplest thing that will work”.

Different S3 ACLs for different resources

OK, so the “public bucket” approach I am using now will work fine for most of my assets. It is a minority that actually need to be access controlled.

While “access them all with presigned URLs so the app is the one deciding if a given request gets access” has a certain software engineering consistency appeal — the performance and convennience advantages of public_read S3 ACL are maybe too great to give up when 90%+ of my assets work fine with it.

Really, this whole long post is probably to convince myself that this needs to be done, because it seems like such a complicated mess… but it is, I think the lesser evil.

What makes this hard is that the management interface needs to let a manager CHANGE the public-readability status of an asset. And each of my assets might have 12 derivatives, so that’s 13 files to change, which can’t be done instantaneously if you wait for S3 to confirm, which probably means a background job. And you open yourself up to making a mistake and having a resource in the wrong state.

It might make sense to have an architecture that minimizes the number of times state has to be changed. All of our assets start out in a non-published draft state, then are later published; but for most of our resources destined for publication, it’s okay if they have public_read ACL in ‘draft’ state. Maybe there’s another flag for whether to really protect/restrict access securely, that can be set on ingest/creation only for the minority of assets that need it? So only needs to be changed if am mistake were made, or decision changed?

Changing “access state” on S3 could be done by one of two methods. You could have everything in the same bucket, and actually change the S3 ACL. Or you could have two separate buckets, one for public files and one for securely protected files. Then, changing the ‘state’ requires a move (copy then delete) of the file from one bucket to another. While the copy approach seems more painful, it has a lot of advantages: you can easily see if an object has the ‘right’ permissions by just seeing what bucket it is in (while using S3’s “block public access” features on the non-public bucket), making it easier to audit manually or automatically; and you can slap a CDN on top of the “public” bucket just as simply as ever, rather than having mixed public/nonpublic content in the same bucket.

Pro

  • The majority of our files that don’t need to be secured can still benefit from the convenience and performance advantages of public_read ACL.
  • Including can still use a straightforward CDN on top of bucket bucket, and HTTP cache-forever these files too.
  • Including no major additional load put on our app for serving the majority of assets that are public

Con

  • Additional complexity for app. It has to manage putting files in two different buckets with different ACLs, and generating URLs to the two classes differently.
  • Opportunity for bugs where an asset is in the ‘wrong’ bucket/ACL. Probably need a regular automated audit of some kind — making sure you didn’t leave behind a file in ‘public’ bucket that isn’t actually pointed to by the app is a pain to audit.
  • It is expensive to switch the access state of an asset. A book with 600 pages each with 12 derivatives, is over 7K files that need to have their ACLs changed and/or copied to another bucket if the visibility status changes.
  • If we try to minimize need to change ACL state, by leaving files destined to be public with public_read even before publication and having separate state for “really secure on S3” — this is a more confusing mental model for staff asset managers, with more opportunity for human error. Should think carefully of how this is exposed in staff UI.
  • For protected things on S3, you still need to use one of the above methods of giving users access, if any users are to be given access after an auth check.

I don’t love this solution, but this post is a bunch of words to basically convince myself that it is the lesser evil nonetheless.

Speeding up S3 URL generation in ruby

It looks like the AWS SDK is very slow at generating S3 URLs, both public and presigned, and that you can generate around an order of magnitude faster in both cases. This can matter if you are generating hundreds of S3 URLs at once.

My app

The app I work is a “digital collections” or “digital asset management” app. It is about displaing lists of files, so it displays a LOT of thumbnails. The thumbnails are all stored in S3, and at present we generate URLs directly to S3 in src‘s on page.

Some of our pages can have 600 thumbnails. (Say, a digitized medieval manuscript with 600 pages). Also, we use srcset to offer the browser two resolutions for each images, so that’s 1200 URLs.

Is this excessive, should we not put 600 URLs on a page? Maybe, although it’s what our app does at present. But 100 thumbnails on a page does not seem excessive; imagine a 10×10 grid of postage-stamp-sized thumbs, why not? And they each could have multiple URLs in a srcset.

It turns out that S3 URL generation can be slow enough to be a bottleneck with 1200 generations in a page, or in some cases even 100. But it can be optimized.

On Benchmarking

It’s hard to do benchmarking in a reliable way. I just used Benchmark.bmbm here; it is notable that on different runs of my comparisons, I could see results differ by 10-20%. But this should be sufficient for relative comparisons and basic orders of magnitude. Exact numbers will of course differ on different hardware/platform anyway. (benchmark-ips might possibly be a way to get somewhat more reliable results, but I didn’t remember it until I was well into this. There may be other options?).

I ran benchmarks on my 2015 Macbook 2.9 GHz Dual-Core Intel Core i5.

I’m used to my MacBook being faster than our deployed app on an EC2 instance, but in this case running benchmarks on EC2 had very similar results. (Of course, EC2 instance CPU performance can be quite variable).

Public S3 URLs

A public S3 URL might look like https://bucket_name.s3.amazonaws.com/path/to/my/object.rb . Or it might have a custom domain name, possibly to a CDN. Pretty simple, right?

Using shrine, you might generate it like model.image_url(public_true). Which calls Aws::S3::Object#public_url . Other dependencies or your own code might call the AWS SDK method as well.

I had noticed in earlier profiling that generating S3 URLs seemed to be taking much longer than I expected, looking like a bottleneck for my app. We use shrine, but shrine doesn’t add much overhead here, it’s pretty much just calling out to the AWS SDK public_url or presigned_url methods.

It seems like generating these URLs should be very simple, right? Here’s a “naive” implementation based on a shrine UploadedFile argument. Obviously it would be easy to use a custom or CDN hostname in this implementation alternately.

def naive_public_url(shrine_file)
"https://#{["#{shrine_file.storage.bucket.name}.s3.amazonaws.com", *shrine_file.storage.prefix, shrine_file.id].join('/')}"
end
naive_public_url(model.image)
#=> "https://somebucket.s3.amazonaws.com/path/to/image.jpg&quot;
view raw naive_s3.rb hosted with ❤ by GitHub

Benchmark generating 1200 URLs with naive implementation vs a straight call of S3 AWS SDK public_url…

original AWS SDK public_url implementation 0.053043 0.000275 0.053318 ( 0.053782)
naive implementation 0.004730 0.000016 0.004746 ( 0.004760)
view raw gistfile1.txt hosted with ❤ by GitHub

53ms vs 5ms, it’s an order of magnitude slower indeed.

53ms is not peanuts when you are trying to keep a web response under 200ms, although it may not be terrible. But let’s see if we can figure out why it’s so slow anyway.

Examining with ruby-prof points to what we could see in the basic implementation in AWS SDK source code, no need to dig down the stack. The most expensive elements are the URI.parse and the URI-safe escaping. Are we missing anything from our naive implementation then?

Well, the URI.parse is just done to make sure we are operating only on the path portion of the URL. But I can’t figure out any way bucket.url would return anything but a hostname-only URL with an empty path anyway, all the examples in docs are such. Maybe it could somehow include a path, but I can’t figure out any way the URL being parsed would have a ? query component or # fragment, and without that it’s safe to just append things without a parse. (Even without that assumption, there will be faster ways than a parse, which is quite slow!) Also just calling bucket.url is a bit expensive, and can deal with some live arn: lookups we won’t be using.

URI Escaping, the pit of confusing alternatives

What about escaping? Escaping can be such a confusing topic with S3, with different libraries at different times handling it different/wrong, then it would be sane to just never use any characters in an S3 key that need any escaping, maybe put some validation on your setters to ensure this. And then you don’t need to take the performance hit of escaping.

But okay, maybe we really need/want escaping to ensure any valid S3 key is turned into a valid S3 URL. Can we do escaping more efficiently?

The original implementation splits the path on / and then runs each component through the SDK’s own Seahorse::Util.uri_escape(s). That method’s implementation uses CGI.escape, but then does two gsub‘s to alter the value somewhat, not being happy with CGI.escape. Those extra gsubs are more performance hit. I think we can use ERB::Util.url_encode instead of CGI.escape + gsubs to get the same behavior, which might get us some speed-up.

But we also seem to be escaping more than is necessary. For instance it will escape any ! in a key to %21, and it turns out this isn’t at all necessary, the URL resolve quite fine without escaping this. If we escape only what is needed, can we go even faster?

I think what we actually need is what URI.escape does — and since URI.escape doesn’t escape /, we don’t need to split on / first, saving us even more time. Annoyingly, URI.escape is marked obsolete/deprecated! But it’s stdlib implementation is relatively simple pure ruby, it would be easy enough to copy it into our codebase.

Even faster? The somewhat maintenance-neglected but still working at present escape_utils gem has a C implementation of some escaping routines. It’s hard when many implementations aren’t clear on exactly what they are escaping, but I think the escape_uri (note i on the end not l) is doing the same thing as URI.escape. Alas, there seems to be no escape_utils implementation that corresponds to CGI.escape or ERB::Util.url_encode.

So now we have a bunch of possibilities, depending on if we are willing to change escaping semantics and/or use our naive implementation of hostname-supplying.

Original AWS SDK public_url100%
optimized AWS SDK public_urlAvoid the URI.parse, use ERB::Util.url_encode. Should be functionally identical, same output, I think!60%
naive implementationNo escaping of S3 key for URL at all7.5%
naive + ERB::Util.url_encodeshould be functionally identical escaping to original implementation, ie over-escaping28%
naive + URI.escapewe think is sufficient escaping, can be done much faster15%
naive + EscapeUtils.escape_uriwe think is identical to URI.escape but faster C implementation11%

We have a bunch of opportunities for much faster implementations, even with existing over-escaping implementation. Here’s the file I used to benchmark.

Presigned S3 URLs

A Presigned URL is used to give access to non-public content, and/or to specify response headeres you’d like S3 to include with response, such as Content-Disposition. Presigned S3 URLs all have an expiration (max one week), and involve a cryptographic signature.

I expect most people are using the AWS SDK for these, rather than reinvent an implementation of the cryptographic signing protocol.

And we’d certainly expect these to be slower than public URLs, because of the crypto signature involved. But can they do be optimized? It looks like yes, at least about an order of magnitude again.

Benchmarking with AWS SDK presigned_url, 1200 URL generations can take around 760-900ms. Wow, that’s a lot — this is definitely enough to matter, especially in a web app response you’d like to keep under 200ms, and this is likely to be a bottleneck.

We do expect the signing to take longer than a public url, but can we do better?

Look at what the SDK is doing, re-implement a quicker path

The presigned_url method just instantiates and calls out to an Aws::S3::Presigner. First idea, what if we create a single Aws::S3::Presigner, and re-use it 1200 times, instead of instantiating it 1200 times, passing it the same args #presigned_url would? Tried that, it was only minor performance improvement.

OK, let’s look at the Aws:S3::Presigner implementation. It’s got kind of a convoluted way of getting a URL, building a Seahorse::Client::Request, and then doing something weird with it…. maybe modifying it to not actually go to the network, but just act as if it had… returning headers and a signed URL, and then we throw out the headers and just use the signed URL…. phew! Ultimately though it does the actual signing work with another object, an Aws::Sigv4:Signer.

What if we just instantiate one of these ourselves, instantiate it the same arguments the Presigner would have for our use cases, and then call presign_url on it with the same args the Presigner would have. Let’s re-use a Signer object 1200 times instead of instantiating it each time, in case that matters.

We still need to create the public_url in order to sign it. Let’s use our replacement naive implementation with URI.escape escaping.

AWS_SIG4_SIGNER = Aws::Sigv4::Signer.new(
service: 's3',
region: AWS_CLIENT.config.region,
credentials_provider: SOME_AWS_CLIENT.config.credentials,
unsigned_headers: Aws::S3::Presigner::BLACKLISTED_HEADERS,
uri_escape_path: false
)
def naive_with_uri_escape_escaping(shrine_file)
# because URI.escape does NOT escape `/`, we don't need to split it,
# which is what actually saves us the time.
path = URI.escape(shrine_file.id)
"https://#{["#{shrine_file.storage.bucket.name}.s3.amazonaws.com", *shrine_file.storage.prefix, shrine_file.id].join('/')}"
end
# not yet handling custom query params eg for content-disposition
def direct_aws_sig4_signer(url)
AWS_SIG4_SIGNER.presign_url(
http_method: "GET",
url: url,
headers: {},
body_digest: 'UNSIGNED-PAYLOAD',
expires_in: 900, # seconds
time: nil
).to_s
end
direct_aws_sig4_signer( naive_with_uri_escape_escaping( shrine_uploaded_file ) )
# => presigned S3 url

Yes, it’s much faster!

Bingo! Now I measure 1200 URLs in 170-220ms, around 25% of the time. Still too slow to want to do 1200 of them on a single page, and around 4x slower than SDK public_url.

Interestingly, while we expect the cryptographic signature to take some extra time… that seems to be at most 10% of the overhead that the logic to sign a URL was adding? We experimented with re-using an Aws::Sigv4::Signer vs instantiating one each time; and applying URI-escaping or not. These did make noticeable differences, but not astounding ones.

This optimized version would have to be enhanced to be able to handle additional query param options such as specified content-disposition, I optimistically hope that can be done without changing the performance characteristics much.

Could it be optimized even more, by profiling within the Aws::Sigv4::Signer implementation? Maybe, but it doesn’t really seem worth it — we are already introducing some fragility into our code by using lower-level APIs and hoping they will remain valid even if AWS changes some things in the future. I don’t really want to re-implement Aws::Sigv4::Signer, just glad to have it available as a tool I can use like this already.

The Numbers

The script I used to compare performance in different ways of creating presigned S3 URLs (with a couple public URLs for comparison) is available in a gist, and here is the output of one run:

user system total real
sdk public_url 0.054114 0.000335 0.054449 ( 0.054802)
naive S3 public url 0.004575 0.000009 0.004584 ( 0.004582)
naive S3 public url with URI.escape 0.009892 0.000090 0.009982 ( 0.011209)
sdk presigned_url 0.756642 0.005855 0.762497 ( 0.789622)
re-use instantiated SDK Presigner 0.817595 0.005955 0.823550 ( 0.859270)
use inline instantiated Aws::Sigv4::Signer directly for presigned url (with escaping) 0.216338 0.001941 0.218279 ( 0.226991)
Re-use Aws::Sigv4::Signer for presigned url (with escaping) 0.185855 0.001124 0.186979 ( 0.188798)
Re-use Aws::Sigv4::Signer for presigned url (without escaping) 0.178457 0.001049 0.179506 ( 0.180920)
view raw gistfile1.txt hosted with ❤ by GitHub

So what to do?

Possibly there are optimizations that would make sense in the AWS SDK gem itself? But it would actually take a lot more work to be sure what can be done without breaking some use cases.

I think there is no need to use URI.parse in public_url, the URIs can just be treated as strings and concatenated. But is there an edge case I’m missing?

Using different URI escaping method definitely helps in public_url; but how many other people who aren’t me care about optimizing public_url; and what escaping method is actually required/expected, is changing it a backwards compat problem; and is it okay maintenance-wise to make the S3 object use a different escaping mechanism than the common SDK Seahorse::Util.uri_escape workhorse, which might be used in places with different escaping requirements?

For presigned_urls, cutting out a lot of the wrapper code and using a Aws::Sigv4::Signer directly seems to have significant performance benefits, but what edge cases get broken there, and do they matter, and can a regression be avoided through alternate performant maintainable code?

Figuring this all out would take a lot more research (and figuring out how to use the test suite for the ruby SDK more facilely than I can write now; it’s a test suite for the whole SDK, and it’s a bear to run the whole thing).

Although if any Amazon maintainers of the ruby SDK, or other experts in it’s internals, see this and have an opinion, I am curious as to their thoughts.

But I am a lot more confident that some of these optimizations will work fine for my use cases. One of the benefits of using shrine is that all of my code already accesses S3 URL generation via shrine API. So I could easily swap in a locally optimized version, either with a shrine plugin, or just a local sub-class of the shrine S3 storage class. So I may consider doing that.