Notes on retrying all jobs with ActiveJob retry_on

I would like to configure all my ActiveJobs to retry on failure, and I’d like to do so with the ActiveJob retry_on method.

So I’m going to configure it in my ApplicationJob class, in order to retry on any error, maybe something like:

class ApplicationJob < ActiveJob::Base
  retry_on StandardError # other args to be discussed
end

Why use ActiveJob retry_on for this? Why StandardError?

Many people use backend-specific logic for retries, especially with Sidekiq. That’s fine!

I like the idea of using the ActiveJob functionality:

  • I currently use resque (more on challenges with retry here later), but plan to switch to something else at some point medium-term. Maybe sideqkiq, but maybe delayed_job or good_job. (Just using the DB and not having a redis is attractive to me, as is open source). I like the idea of not having to redo this setup when I switch back-ends, or am trying out different ones.
  • In general, I like the promise of ActiveJob as swappable commoditized backends
  • I like what I see as good_job’s philosophy here, why have every back-end reinvent the wheel when a feature can be done at the ActiveJob level? That can help keep the individual back-end smaller, and less “expensive” to maintain. good_job encourages you to use ActiveJob retries I think.

Note, dhh is on record from 2018 saying he thinks setting up retries for all StandardError is a bad idea. But I don’t really understand why! He says “You should know why you’d want to retry, and the code should document that knowledge.” — but the fact that so many ActiveJob back-ends provide “retry all jobs” functionality makes it seem to me an established common need and best practice, and why shouldn’t you be able to do it with ActiveJob alone?

dhh thinks ActiveJob retry is for specific targetted retries maybe, and the backend retry should be used for generic universal ones? Honestly I don’t see myself doing much specific targetted retries, making all your jobs idempotent (important! Best practice for ActiveJob always!), and just having them all retry on any error seems to me to be the way to go, a more efficient use of developer time and sufficient for at least a relatively simple app.

One situation I have where a retry is crucial, is when I have a fairly long-running job (say it takes more than 60 seconds to run; I have some unavoidably!), and the machine running the jobs needs to restart. It might interrupt the job. It is convenient if it is just automatically retried — put back in the queue to be run again by restarted or other job worker hosts! Otherwise it’s just sitting there failed, never to run again, requiring manual action. An automatic retry will take care of it almost invisibly.

Resque and Resque Scheduler

Resque by default doens’t supprot future-scheduled jobs. You can add them with the resque-scheduler plugin. But I had a perhaps irrational desire to avoid this — resque and it’s ecosystem have at different times had different amounts of maintenance/abandonment, and I’m (perhaps irrationally) reluctant to complexify my resque stack.

And do I need future scheduling for retries? For my most important use cases, it’s totally fine if I retry just once, immediately, with a wait: 0. Sure, that won’t take care of all potential use cases, but it’s a good start.

I thought even without resque supporting future-scheduling, i could get away with:

retry_on StandardError, wait: 0

Alas, this won’t actually work, it still ends up being converted to a future-schedule call, which gets rejected by the resque_adapter bundled with Rails unless you have resque-scheduler installed.

But of course, resque can handle wait:0 semantically, if the code was willing to do it by queing an ordinary resque job…. I don’t know if it’s a good idea, but this simple patch to Rails-bundled resque_adapter will make it willing to accept “scheduled” jobs when the time to be scheduled is actually “now”, just scheduling them normally, while still raising on attempts to future schedule. For me, it makes retry_on.... wait: 0 work with just plain resque.

Note: retry_on attempts count includes first run

So wanting to retry just once, I tried something like this:

# Will never actually retry
retry_on StandardError, attempts: 1

My job was never actually retried this way! It looks like the attempts count includes the first non-error run, the total number of times job will be run, including the very first one before any “retries”! So attempts 1 means “never retry” and does nothing. Oops. If you actually want to retry only once, in my Rails 6.1 app this is what did it for me:

# will actually retry once
retry_on StandardError, attempts: 2

(I think this means the default, attempts: 5 actually means your job can be run a total of 5 times– one original time and 4 retries. I guess that’s what was intended?)

Note: job_id stays the same through retries, hooray

By the way, I checked, and at least in Rails 6.1, the ActiveJob#job_id stays the same on retries. If the job runs once and is retried twice more, it’ll have the same job_id each time, you’ll see three Performing lines in your logs, with the same job_id.

Phew! I think that’s the right thing to do, so we can easily correlate these as retries of the same jobs in our logs. And if we’re keeping the job_id somewhere to check back and see if it succeeded or failed or whatever, it stays consistent on retry.

Glad this is what ActiveJob is doing!

Logging isn’t great, but can be customized

Rails will automatically log retries with a line that looks like this:

Retrying TestFailureJob in 0 seconds, due to a RuntimeError.
# logged at `info` level

Eventually when it decides it’s attempts are exhausted, it’ll say something like:

Stopped retrying TestFailureJob due to a RuntimeError, which reoccurred on 2 attempts.
# logged at `error` level

This does not include the job-id though, which makes it harder than it should be to correlate with other log lines about this job, and follow the job’s whole course through your log file.

It’s also inconsistent with other default ActiveJob log lines, which include:

  • the Job ID in text
  • tags (Rails tagged logging system) with the job id and the string "[ActiveJob]". Because of the way the Rails code applies these only around perform/enqueue, retry/discard related log lines apparently end up not included.
  • The Exception message not just the class when there’s a class.

You can see all the built-in ActiveJob logging in the nicely compact ActiveJob::LogSubscriber class. And you can see how the log line for retry is kind of inconsistent with eg perform.

Maybe this inconsistency has persisted so long in part because few people actually use ActiveJob retry, they’re all still using their backends backend-specific functionality? I did try a PR to Rails for at least consistent formatting (my PR doesn’t do tagging), not sure if it will go anywhere, I think blind PR’s to Rails usually do not.

In the meantime, after trying a bunch of different things, I think I figured out the reasonable way to use the ActiveSupport::Notifications/LogSubscriber API to customize logging for the retry-related events while leaving it untouched from Rails for the others? See my solution here.

(Thanks to BigBinary blog for showing up in google and giving me a head start into figuring out how ActiveJob retry logging was working.)

(note: There’s also this: https://github.com/armandmgt/lograge_active_job But I’m not sure how working/maintained it is. It seems to only customize activejob exception reports, not retry and other events. It would be an interesting project to make an up-to-date activejob-lograge that applied to ALL ActiveJob logging, expressing every event as key/values and using lograge formatter settings to output. I think we see exactly how we’d do that, with a custom log subscriber as we’ve done above!)

Warning: ApplicationJob configuration won’t work for emails

You might think since we configured retry_on on ApplicationJob, all our bg jobs are now set up for retrying.

Oops! Not deliver_later emails.

Good_job README explains that ActiveJob mailers don’t descend from ApplicationMailer. (I am curious if there’s any good reason for this, it seems like it would be nice if they did!)

The good_job README provides one way to configure the built-in Rails mailer superclass for retries.

You could maybe also try setting delivery_job on that mailer superclass to use a custom delivery job (thanks again BigBinary for the pointer)… maybe one that subclasses the default class to deliver emails as normal, but let you set some custom options like retry_on? Not sure if this would be preferable in any way.

logging URI query params with lograge

The lograge gem for taming Rails logs by default will lot the path component of the URI, but leave out the query string/query params.

For instance, perhaps you have a URL to your app /search?q=libraries.

lograge will log something like:

method=GET path=/search format=html

The q=libraries part is completely left out of the log. I kinda want that part, it’s important.

The lograge README provides instructions for “logging request parameters”, by way of the params hash.

I’m going to modify them a bit slightly to:

  • use the more recent custom_payload config instead of custom_options. (I’m not certain why there are both, but I think mostly for legacy reasons and newer custom_payload? is what you should read for?)
  • If we just put params in there, then a bunch of ugly <ActionController::Parameters show up in the log if you have nested hash params. We could fix that with params.to_unsafe_h, but…
  • We should really use request.filtered_parameters instead to make sure we’re not logging anything that’s been filtered out with Rails 6 config.filter_parameters. (Thanks /u/ezekg on reddit). This also converts to an ordinary hash that isn’t ActionController::Parameters, taking care of previous bullet point.
  • (It kind of seems like lograge README could use a PR updating it?)
  config.lograge.custom_payload do |controller|
    exceptions = %w(controller action format id)
    params: controller.request.filtered_parameters.except(*exceptions)
  end

That gets us a log line that might look something like this:

method=GET path=/search format=html controller=SearchController action=index status=200 duration=107.66 view=87.32 db=29.00 params={"q"=>"foo"}

OK. The params hash isn’t exactly the same as the query string, it can include things not in the URL query string (like controller and action, that we have to strip above, among others), and it can in some cases omit things that are in the query string. It just depends on your routing and other configuration and logic.

The params hash itself is what default rails logs… but what if we just log the actual URL query string instead? Benefits:

  • it’s easier to search the logs for actually an exact specific known URL (which can get more complicated like /search?q=foo&range%5Byear_facet_isim%5D%5Bbegin%5D=4&source=foo or something). Which is something I sometimes want to do, say I got a URL reported from an error tracking service and now I want to find that exact line in the log.
  • I actually like having the exact actual URL (well, starting from path) in the logs.
  • It’s a lot simpler, we don’t need to filter out controller/action/format/id etc.
  • It’s actually a bit more concise? And part of what I’m dealing with in general using lograge is trying to reduce my bytes of logfile for papertrail!

Drawbacks?

  • if you had some kind of structured log search (I don’t at present, but I guess could with papertrail features by switching to json format?), it might be easier to do something like “find a /search with q=foo and source=ef without worrying about other params)
  • To the extent that params hash can include things not in the actual url, is that important to log like that?
  • ….?

Curious what other people think… am I crazy for wanting the actual URL in there, not the params hash?

At any rate, it’s pretty easy to do. Note we use filtered_path rather than fullpath to again take account of Rails 6 parameter filtering, and thanks again /u/ezekg:

  config.lograge.custom_payload do |controller|
    {
      path: controller.request.filtered_path
    }
  end

This is actually overwriting the default path to be one that has the query string too:

method=GET path=/search?q=libraries format=html ...

You could of course add a different key fullpath instead, if you wanted to keep path as it is, perhaps for easier collation in some kind of log analyzing system that wants to group things by same path invariant of query string.

I’m gonna try this out!

Meanwhile, on lograge…

As long as we’re talking about lograge…. based on commit history, history of Issues and Pull Requests… the fact that CI isn’t currently running (travis.org grr) and doesn’t even try to test on Rails 6.0+ (although lograge seems to work fine)… one might worry that lograge is currently un/under-maintained…. No comment on a GH issue filed in May asking about project status.

It still seems to be one of the more popular solutions to trying to tame Rails kind of out of control logs. It’s mentioned for instance in docs from papertrail and honeybadger, and many many other blog posts.

What will it’s future be?

Looking around for other possibilties, I found semantic_logger (rails_semantic_logger). It’s got similar features. It seems to be much more maintained. It’s got a respectable number of github stars, although not nearly as many as lograge, and it’s not featured in blogs and third-party platform docs nearly as much.

It’s also a bit more sophisticated and featureful. For better or worse. For instance mainly I’m thinking of how it tries to improve app performance by moving logging to a background thread. This is neat… and also can lead to a whole new class of bug, mysterious warning, or configuration burden.

For now I’m sticking to the more popular lograge, but I wish it had CI up that was testing with Rails 6.1, at least!

Incidentally, trying to get Rails to log more compactly like both lograge and rails_semantic_logger do… is somewhat more complicated than you might expect, as demonstrated by the code in both projects that does it! Especially semantic_logger is hundreds of lines of somewhat baroque code split accross several files. A refactor of logging around Rails 5 (I think?) to use ActiveSupport::LogSubscriber made it possible to customize Rails logging like this (although I think both lograge and rails_semantic_logger still do some monkey-patching too!), but in the end didn’t make it all that easy or obvious or future-proof. This may discourage too many other alternatives for the initial primary use case of both lograge and rails_semantic_logger — turn a rails action into one log line, with a structured format.

Notes on Cloudfront in front of Rails Assets on Heroku, with CORS

Heroku really recommends using a CDN in front of your Rails app static assets — which, unlike in non-heroku circumstances where a web server like nginx might be taking care of it, otherwise on heroku static assets will be served directly by your Rails app, consuming limited/expensive dyno resources.

After evaluating a variety of options (including some heroku add-ons), I decided AWS Cloudfront made the most sense for us — simple enough, cheap, and we are already using other direct AWS services (including S3 and SES).

While heroku has an article on using Cloudfront, which even covers Rails specifically, and even CORS issues specifically, I found it a bit too vague to get me all the way there. And while there are lots of blog posts you can find on this topic, I found many of them outdated (Rails has introduced new API; Cloudfront has also changed it’s configuration options!), or otherwise spotty/thin.

So while I’m not an expert on this stuff, i’m going to tell you what I was able to discover, and what I did to set up Cloudfront as a CDN in front of Rails static assets running on heroku — although there’s really nothing at all specific to heroku here, if you have any other context where Rails is directly serving assets in production.

First how I set up Rails, then Cloudfront, then some notes and concerns. Btw, you might not need to care about CORS here, but one reason you might is if you are serving any fonts (including font-awesome or other icon fonts!) from Rails static assets.

Rails setup

In config/environments/production.rb

# set heroku config var RAILS_ASSET_HOST to your cloudfront
# hostname, will look like `xxxxxxxx.cloudfront.net`
config.asset_host = ENV['RAILS_ASSET_HOST']

config.public_file_server.headers = {
  # CORS:
  'Access-Control-Allow-Origin' => "*", 
  # tell Cloudfront to cache a long time:
  'Cache-Control' => 'public, max-age=31536000' 
}

Cloudfront Setup

I changed some things from default. The only one that absolutely necessary — if you want CORS to work — seemed to be changing Allowed HTTP Methods to include OPTIONS.

Click on “Create Distribution”. All defaults except:

  • Origin Domain Name: your heroku app host like app-name.herokuapp.com
  • Origin protocol policy: Switch to “HTTPS Only”. Seems like a good idea to ensure secure traffic between cloudfront and origin, no?
  • Allowed HTTP Methods: Switch to GET, HEAD, OPTIONS. In my experimentation, necessary for CORS from a browser to work — which AWS docs also suggest.
  • Cached HTTP Methods: Click “OPTIONS” too now that we’re allowing it, I don’t see any reason not to?
  • Compress objects automatically: yes
    • Sprockets is creating .gz versions of all your assets, but they’re going to be completely ignored in a Cloudfront setup either way. ☹️ (Is there a way to tell Sprockets to stop doing it? WHO KNOWS not me, it’s so hard to figure out how to reliably talk to Sprockets). But we can get what it was trying to do by having Cloudfront encrypt stuff for us, seems like a good idea, Google PageSpeed will like it, etc?
    • I noticed by experimentation that Cloudfront will compress CSS and JS (sometimes with brotli sometimes gz, even with the same browser, don’t know how it decides, don’t care), but is smart enough not to bother trying to compress a .jpg or .png (which already has internal compression).
  • Comment field: If there’s a way to edit it after you create the distribution, I haven’t found it, so pick a good one!

Notes on CORS

AWS docs here and here suggest for CORS support you also need to configure the Cloudfront distribution to forward additional headers — Origin, and possibly Access-Control-Request-Headers and Access-Control-Request-Method. Which you can do by setting up a custom “cache policy”. Or maybe instead by by setting the “Origin Request Policy”. Or maybe instead by setting custom cache header settings differently using the Use legacy cache settings option. It got confusing — and none of these settings seemed to be necessary to me for CORS to be working fine, nor could I see any of these settings making any difference in CloudFront behavior or what headers were included in responses.

Maybe they would matter more if I were trying to use a more specific Access-Control-Allow-Origin than just setting it to *? But about that….

If you set Access-Control-Allow-Origin to a single host, MDN docs say you have to also return a Vary: Origin header. Easy enough to add that to your Rails config.public_file_server.headers. But I couldn’t get Cloudfront to forward/return this Vary header with it’s responses. Trying all manner of cache policy settings, referring to AWS’s quite confusing documentation on the Vary header in Cloudfront and trying to do what it said — couldn’t get it to happen.

And what if you actually need more than one allowed origin? Per spec Access-Control-Allow-Origin as again explained by MDN, you can’t just include more than one in the header, the header is only allowed one: ” If the server supports clients from multiple origins, it must return the origin for the specific client making the request.” And you can’t do that with Rails static/global config.public_file_server.headers, we’d need to use and setup rack-cors instead, or something else.

So I just said, eh, * is probably just fine. I don’t think it actually involves any security issues for rails static assets to do this? I think it’s probably what everyone else is doing?

The only setup I needed for this to work was setting Cloudfront to allow OPTIONS HTTP method, and setting Rails config.public_file_server.headers to include 'Cache-Control' => 'public, max-age=31536000'.

Notes on Cache-Control max-age

A lot of the existing guides don’t have you setting config.public_file_server.headers to include 'Cache-Control' => 'public, max-age=31536000'.

But without this, will Cloudfront actually be caching at all? If with every single request to cloudfront, cloudfront makes a request to the Rails app for the asset and just proxies it — we’re not really getting much of the point of using Cloudfront in the first place, to avoid the traffic to our app!

Well, it turns out yes, Cloudfront will cache anyway. Maybe because of the Cloudfront Default TTL setting? My Default TTL was left at the Cloudfront default, 86400 seconds (one day). So I’d think that maybe Cloudfront would be caching resources for a day when I’m not supplying any Cache-Control or Expires headers?

In my observation, it was actually caching for less than this though. Maybe an hour? (Want to know if it’s caching or not? Look at headers returned by Cloudfront. One easy way to do this? curl -IXGET https://whatever.cloudfront.net/my/asset.jpg, you’ll see a header either x-cache: Miss from cloudfront or x-cache: Hit from cloudfront).

Of course, Cloudfront doesn’t promise to cache for as long as it’s allowed to, it can evict things for it’s own reasons/policies before then, so maybe that’s all that’s going on.

Still, Rails assets are fingerprinted, so they are cacheable forever, so why not tell Cloudfront that? Maybe more importantly, if Rails isn’t returning a Cache-Cobntrol header, then Cloudfront isn’t either to actual user-agents, which means they won’t know they can cache the response in their own caches, and they’ll keep requesting/checking it on every reload too, which is not great for your far too large CSS and JS application files!

So, I think it’s probably a great idea to set the far-future Cache-Control header with config.public_file_server.headers as I’ve done above. We tell Cloudfront it can cache for the max-allowed-by-spec one year, and this also (I checked) gets Cloudfront to forward the header on to user-agents who will also know they can cache.

Note on limiting Cloudfront Distribution to just static assets?

The CloudFront distribution created above will actually proxy/cache our entire Rails app, you could access dynamic actions through it too. That’s not what we intend it for, our app won’t generate any URLs to it that way, but someone could.

Is that a problem?

I don’t know?

Some blog posts try to suggest limiting it only being willing to proxy/cache static assets instead, but this is actually a pain to do for a couple reasons:

  1. Cloudfront has changed their configuration for “path patterns” since many blog posts were written (unless you are using “legacy cache settings” options), such that I’m confused about how to do it at all, if there’s a way to get a distribution to stop caching/proxying/serving anything but a given path pattern anymore?
  2. Modern Rails with webpacker has static assets at both /assets and /packs, so you’d need two path patterns, making it even more confusing. (Why Rails why? Why aren’t packs just at public/assets/packs so all static assets are still under /assets?)

I just gave up on figuring this out and figured it isn’t really a problem that Cloudfront is willing to proxy/cache/serve things I am not intending for it? Is it? I hope?

Note on Rails asset_path helper and asset_host

You may have realized that Rails has both asset_path and asset_url helpers for linking to an asset. (And similar helpers with dashes instead of underscores in sass, and probably different implementations, via sass-rails)

Normally asset_path returns a relative URL without a host, and asset_url returns a URL with a hostname in it. Since using an external asset_host requires we include the host with all URLs for assets to properly target CDN… you might think you have to stop using asset_path anywhere and just use asset_urlYou would be wrong.

It turns out if config.asset_host is set, asset_path starts including the host too. So everything is fine using asset_path. Not sure if at that point it’s a synonym for asset_url? I think not entirely, because I think in fact once I set config.asset_host, some of my uses of asset_url actually started erroring and failing tests? And I had to actually only use asset_path? In ways I don’t really understand what’s going on and can’t explain it?

Ah, Rails.

Heroku release phase, rails db:migrate, and command failure

If you use capistrano to deploy a Rails app, it will typically run a rails db:migrate with every deploy, to apply any database schema changes.

If you are deploying to heroku you might want to do the same thing. The heroku “release phase” feature makes this possible. (Introduced in 2017, the release phase feature is one of heroku’s more recent major features, as heroku dev has seemed to really stabilize and/or stagnate).

The release phase docs mention “running database schema migrations” as a use case, and there are a few ((1), (2), (3)) blog posts on the web suggesting doing exactly that with Rails. Basically as simple as adding release: bundle exec rake db:migrate to your Procfile.

While some of the blog posts do remind you that “If the Release Phase fails the app will not be deployed”, I have found the implications of this to be more confusing in practice than one would originally assume. Particularly because on heroku changing a config var triggers a release; and it can be confusing to notice when such a release has failed.

It pays to consider the details a bit so you understand what’s going on, and possibly consider somewhat more complicated release logic than simply calling out to rake db:migrate.

1) What if a config var change makes your Rails app unable to boot?

I don’t know how unusual this is, but I actually had a real-world bug like this when in the process of setting up our heroku app. Without confusing things with the details, we can simulate such a bug simply by putting this in, say, config/application.rb:

if ENV['FAIL_TO_BOOT']
  raise "I am refusing to boot"
end

Obviously my real bug was weirder, but the result was the same — with some settings of one or more heroku configuration variables, the app would raise an exception during boot. And we hadn’t noticed this in testing, before deploying to heroku.

Now, on heroku, using CLI or web dashboard, set the config var FAIL_TO_BOOT to “true”.

Without a release phase, what happens?

  • The release is successful! If you look at the release in the dashboard (“Activity” tab) or heroku releases, it shows up as successful. Which means heroku brings up new dynos and shuts down the previous ones, that’s what a release is.
  • The app crashes when heroku tries to start it in the new dynos.
  • The dynos will be in “crashed” state when looked at in heroku ps or dashboard.
  • If a user tries to access the web app, they will get the generic heroku-level “could not start app” error screen (unless you’ve customized your heroku error screens, as usual).
  • You can look in your heroku logs to see the error and stack trace that prevented app boot.

Downside: your app is down.

Upside: It is pretty obvious that your app is down, and (relatively) why.

With a db:migrate release phase, what happens?

The Rails db:migrate rake task has a dependency on the rails :environment task, meaning it boots the Rails app before executing. You just changed your config variable FAIL_TO_BOOT: true such that the Rails app can’t boot. Changing the config variable triggered a release.

As part of the release, the db:migrate release phase is run… which fails.

  • The release is not succesful, it failed.
  • You don’t get any immediate feedback to that effect in response to your heroku config:add command or on the dashboard GUI in the “settings” tab. You may go about your business assuming it succeeded.
  • If you look at the release in heroku releases or dashboard “activity” tab you will see it failed.
  • You do get an email that it failed. Maybe you notice it right away, or maybe you notice it later, and have to figure out “wait, which release failed? And what were the effects of that? Should I be worried?”
  • The effects are:
    • The config variable appears changed in heroku’s dashboard or in response to heroku config:get etc.
    • The old dynos without the config variable change are still running. They don’t have the change. If you open a one-off dyno, it will be using the old release, and have the old (eg) ENV[‘FAIL_TO_BOOT’] value.
    • ANY subsequent attempts at a releases will keep fail, so long as the app is in a state (based on teh current config variables) that it can’t boot.

Again, this really happened to me! It is a fairly confusing situation.

Upside: Your app is actually still up, even though you broke it, the old release that is running is still running, that’s good?

Downside: It’s really confusing what happened. You might not notice at first. Things remain in a messed up inconsistent and confusing state until you notice, figure out what’s going on, what release caused it, and how to fix it.

It’s a bit terrifying that any config variable change could do this. But I guess most people don’t run into it like I did, since I haven’t seen it mentioned?

2) A heroku pg:promote is a config variable change, that will create a release in which db:migrate release phase fails.

heroku pg:promote is a command that will change which of multiple attached heroku postgreses are attached as the “primary” database, pointed to by the DATABASE_URL config variable.

For a typical app with only one database, you still might use pg:promote for a database upgrade process; for setting up or changing a postgres high-availability leader/follower; or, for what I was experimenting with it for, using heroku’s postgres-log-based rollback feature.

I had assumed that pg:promote was a zero-downtime operation. But, in debugging it’s interaction with my release phase, I noticed that pg:promote actually creates TWO heroku releases.

  1. First it creates a release labelled Detach DATABASE , in which there is no DATABASE_URL configuration variable at all.
  2. Then it creates another release labelled Attach DATABASE in which the DATABASE_URL configuration variable is defined to it’s new value.

Why does it do this instead of one release that just changes the DATABASE_URL? I don’t know. My app (like most Rails and probably other apps) can’t actually function without DATABASE_URL set, so if that first release ever actually runs, it will just error out. Does this mean there’s an instant with a “bad” release deployed, that pg:promote isn’t actually zero-downtime? I am not sure, it doens’t seem right (I did file a heroku support ticket asking….).

But under normal circumstances, either it’s not a problem, or most people(?) don’t notice.

But what if you have a db:migrate release phase?

When it tries to do release (1) above, that release will fail. Because it tries to run db:migrate, and it can’t do that without a DATABASE_URL set, so it raises, the release phase exits in an error condition, the release fails.

Actually what happens is without DATABASE_URL set, the Rails app will assume a postgres URL in a “default” location, try to connect to, and fail, with an error message (hello googlers?), like:

ActiveRecord::ConnectionNotEstablished: could not connect to server: No such file or directory
	Is the server running locally and accepting
	connections on Unix domain socket "/var/run/postgresql/.s.PGSQL.5432"?

Now, release (2) is coming down the pike seconds later, this is actually fine, and will be zero outage. We had a release that failed (so never was deployed), and seconds later the next correct release succeeds. Great!

The only problem is that we got an email notifying us that release 1 failed, and it’s also visible as failing in the heroku release list, etc.

A “background” (not in response to a git push or other code push to heroku) release failing is already a confusing situation — a”false positives” that actually mean “nothing unexpected or problematic happened, just ignore this and carry on.” is… really not something I want. (I call this the “error notification crying wolf”, right? I try to make sure my error notifications never do it, because it takes your time away from flow unecessarily, and/or makes it much harder to stay vigilant to real errors).

Now, there is a fairly simple solution to this particular problem. Here’s what I did. I changed my heroku release phase from rake db:migrate to a custom rake task, say release: bundle exec rake my_custom_heroku_release_phase, defined like so:

task :my_custom_heroku_release_phase do
if ENV['DATABASE_URL']
Rake::Task["db:migrate"].invoke
else
$stderr.puts "\n!!! WARNING, no ENV['DATABASE_URL'], not running rake db:migrate as part of heroku release !!!\n\n"
end
end
view raw custom.rake hosted with ❤ by GitHub

Now that release (1) above at least won’t fail, it has the same behavior as a “traditional” heroku app without a release phase.

Swallow-and-report all errors?

When a release fails because a release phase has failed as result of a git push to heroku, that’s quite clear and fine!

But the confusion of the “background” release failure, triggered by a config var change, is high enough that part of me wants to just rescue StandardError in there, and prevent a failed release phase from ever exiting with a failure code, so heroku will never use a db:migrate release phase to abort a release.

Just return the behavior to the pre-release-phase heroku behavior — you can put your app in a situation where it will be crashed and not work, but maybe that’s better not a mysterious inconsistent heroku app state that happens in the background and you find out about only through asynchronous email notifications from heroku that are difficult to understand/diagnose. It’s all much more obvious.

On the other hand, if a db:migrate has failed not becuase of some unrelated boot process problem that is going to keep the app from launching too even if it were released, but simply because the db:migrate itself actually failed… you kind of want the release to fail? That’s good? Keep the old release running, not a new release with code that expects a db migration that didn’t happen?

So I’m not really sure.

If you did want to rescue-swallow-and-notify, the custom rake task for your heroku release logic — instead of just telling heroku to run a standard thing like db:migrate on release — is certainly convenient.

Also, do you really always want to db:migrate anyway? What about db:schema:load?

Another alternative… if you are deploying an app with an empty database, standard Rails convention is to run rails db:schema:load instead of db:migrate. The db:migrate will probably work anyway, but will be slower, and somewhat more error-prone.

I guess this could come up on heroku with an initial deploy or (for some reason) a database that’s been nuked and restarted, or perhaps a Heroku “Review app”? (I don’t use those yet)

stevenharman has a solution that actually checks the database, and runs the appropriate rails task depending on state, here in this gist.

I’d probably do it as a rake task instead of a bash file if I were going to do that. I’m not doing it at all yet.

Note that stevenharman’s solution will actually catch a non-existing or non-connectable database and not try to run migrations… but it will print an error message and exit 1 in that case, failing the release — meaning that you will get a failed release in the pg:promote case mentioned above!

Rails auto-scaling on Heroku

We are investigating moving our medium-small-ish Rails app to heroku.

We looked at both the Rails Autoscale add-on available on heroku marketplace, and the hirefire.io service which is not listed on heroku marketplace and I almost didn’t realize it existed.

I guess hirefire.io doesn’t have any kind of a partnership with heroku, but still uses the heroku API to provide an autoscale service. hirefire.io ended up looking more fully-featured and lesser priced than Rails Autoscale; so the main service of this post is just trying to increase visibility of hirefire.io and therefore competition in the field, which benefits us consumers.

Background: Interest in auto-scaling Rails background jobs

At first I didn’t realize there was such a thing as “auto-scaling” on heroku, but once I did, I realized it could indeed save us lots of money.

I am more interested in scaling Rails background workers than I a web workers though — our background workers are busiest when we are doing “ingests” into our digital collections/digital asset management system, so the work is highly variable. Auto-scaling up to more when there is ingest work piling up can give us really nice inget throughput while keeping costs low.

On the other hand, our web traffic is fairly low and probably isn’t going to go up by an order of magnitude (non-profit cultural institution here). And after discovering that a “standard” dyno is just too slow, we will likely be running a performance-m or performance-l anyway — which likely can handle all anticipated traffic on it’s own. If we have an auto-scaling solution, we might configure it for web dynos, but we are especially interested in good features for background scaling.

There is a heroku built-in autoscale feature, but it only works for performance dynos, and won’t do anything for Rails background job dynos, so that was right out.

That could work for Rails bg jobs, the Rails Autoscale add-on on the heroku marketplace; and then we found hirefire.io.

Pricing: Pretty different

hirefire

As of now January 2021, hirefire.io has pretty simple and affordable pricing. $15/month/heroku application. Auto-scaling as many dynos and process types as you like.

hirefire.io by default can only check into your apps metrics to decide if a scaling event can occur once per minute. If you want more frequent than that (up to once every 15 seconds), you have to pay an additional $10/month, for $25/month/heroku application.

Even though it is not a heroku add-on, hirefire does advertise that they bill pro-rated to the second, just like heroku and heroku add-ons.

Rails autoscale

Rails autoscale has a more tiered approach to pricing that is based on number and type of dynos you are scaling. Starting at $9/month for 1-3 standard dynos, the next tier up is $39 for up to 9 standard dynos, all the way up to $279 (!) for 1 to 99 dynos. If you have performance dynos involved, from $39/month for 1-3 performance dynos, up to $599/month for up to 99 performance dynos.

For our anticipated uses… if we only scale bg dynos, I might want to scale from (low) 1 or 2 to (high) 5 or 6 standard dynos, so we’d be at $39/month. Our web dynos are likely to be performance and I wouldn’t want/need to scale more than probably 2, but that puts us into performance dyno tier, so we’re looking at $99/month.

This is of course significantly more expensive than hirefire.io’s flat rate.

Metric Resolution

Since Hirefire had an additional charge for finer than 1-minute resolution on checks for autoscaling, we’ll discuss resolution here in this section too. Rails Autoscale has same resolution for all tiers, and I think it’s generally 10 seconds, so approximately the same as hirefire if you pay the extra $10 for increased resolution.

Configuration

Let’s look at configuration screens to get a sense of feature-sets.

Rails Autoscale

web dynos

To configure web dynos, here’s what you get, with default values:

The metric Rails Autoscale uses for scaling web dynos is time in heroku routing queue, which seems right to me — when things are spending longer in heroku routing queue before getting to a dyno, it means scale up.

worker dynos

For scaling worker dynos, Rails Autoscale can scale dyno type named “worker” — it can understand ruby queuing libraries Sidekiq, Resque, Delayed Job, or Que. I’m not certain if there are options for writing custom adapter code for other backends.

Here’s what the configuration options are — sorry these aren’t the defaults, I’ve already customized them and lost track of what defaults are.

You can see that worker dynos are scaled based on the metric “number of jobs queued”, and you can tell it to only pay attention to certain queues if you want.

Hirefire

Hirefire has far more options for customization than Rails Autoscale, which can make it a bit overwhelming, but also potentially more powerful.

web dynos

You can actually configure as many Heroku process types as you have for autoscale, not just ones named “web” and “worker”. And for each, you have your choice of several metrics to be used as scaling triggers.

For web, I think Queue Time (percentile, average) matches what Rails Autoscale does, configured to percentile, 95, and is probably the best to use unless you have a reason to use another. (“Rails Autoscale tracks the 95th percentile queue time, which for most applications will hover well below the default threshold of 100ms.“)

Here’s what configuration Hirefire makes available if you are scaling on “queue time” like Rails Autoscale, configuration may vary for other metrics.

I think if you fill in the right numbers, you can configure to work equivalently to Rails Autoscale.

worker dynos

If you have more than one heroku process type for workers — say, working on different queues — Hirefire can scale the independently, with entirely separate configuration. This is pretty handy, and I don’t think Rails Autoscale offers this. (update i may be wrong, Rails Autoscale says they do support this, so check on it yourself if it matters to you).

For worker dynos, you could choose to scale based on actual “dyno load”, but I think this is probably mostly for types of processes where there isn’t the ability to look at “number of jobs”. A “number of jobs in queue” like Rails Autoscale does makes a lot more sense to me as an effective metric for scaling queue-based bg workers.

Hirefire’s metric is slightly difererent than Rails Autoscale’s “jobs in queue”. For recognized ruby queue systems (a larger list than Rails Autoscale’s; and you can write your own custom adapter for whatever you like), it actually measures jobs in queue plus workers currently busy. So queued+in-progress, rather than Rails Autoscale’s just queued. I actually have a bit of trouble wrapping my head around the implications of this, but basically, it means that Hirefire’s “jobs in queue” metric strategy is intended to try to scale all the way to emptying your queue, or reaching your max scale limit, whichever comes first. I think this may make sense and work out at least as well or perhaps better than Rails Autoscale’s approach?

Here’s what configuration Hirefire makes available for worker dynos scaling on “job queue” metric.

Since the metric isn’t the same as Rails Autosale, we can’t configure this to work identically. But there are a whole bunch of configuration options, some similar to Rails Autoscale’s.

The most important thing here is that “Ratio” configuration. It may not be obvious, but with the way the hirefire metric works, you are basically meant to configure this to equal the number of workers/threads you have on each dyno. I have it configured to 3 because my heroku worker processes use resque, with resque_pool, configured to run 3 resque workers on each dyno. If you use sidekiq, set ratio to your configured concurrency — or if you are running more than one sidekiq process, processes*concurrency. Basically how many jobs your dyno can be concurrently working is what you should normally set for ‘ratio’.

Hirefire not a heroku plugin

Hirefire isn’t actually a heroku plugin. In addition to that meaning separate invoicing, there can be some other inconveniences.

Since hirefire only can interact with heroku API, for some metrics (including the “queue time” metric that is probably optimal for web dyno scaling) you have to configure your app to log regular statistics to heroku’s “Logplex” system. This can add a lot of noise to your log, and for heroku logging add-ons that are tired based on number of log lines or bytes, can push you up to higher pricing tiers.

If you use paperclip, I think you should be able to use the log filtering feature to solve this, keep that noise out of your logs and avoid impacting data log transfer limits. However, if you ever have cause to look at heroku’s raw logs, that noise will still be there.

Support and Docs

I asked a couple questions of both Hirefire and Rails Autoscale as part of my evaluation, and got back well-informed and easy-to-understand answers quickly from both. Support for both seems to be great.

I would say the documentation is decent-but-not-exhaustive for both products. Hirefire may have slightly more complete documentation.

Other Features?

There are other things you might want to compare, various kinds of observability (bar chart or graph of dynos or observed metrics) and notification. I don’t have time to get into the details (and didn’t actually spend much time exploring them to evaluate), but they seem to offer roughly similar features.

Conclusion

Rails Autoscale is quite a bit more expensive than hirefire.io’s flat rate, once you get past Rails Autoscale’s most basic tier (scaling no more than 3 standard dynos).

It’s true that autoscaling saves you money over not, so even an expensive price could be considered a ‘cut’ of that, and possibly for many ecommerce sites even $99 a month might a drop in the bucket (!)…. but this price difference is so significant with hirefire (which has flat rate regardless of dynos), that it seems to me it would take a lot of additional features/value to justify.

And it’s not clear that Rails Autoscale has any feature advantage. In general, hirefire.io seems to have more features and flexibility.

Until 2021, hirefire.io could only analyze metrics with 1-minute resolution, so perhaps that was a “killer feature”?

Honestly I wonder if this price difference is sustained by Rails Autoscale only because most customers aren’t aware of hirefire.io, it not being listed on the heroku marketplace? Single-invoice billing is handy, but probably not worth $80+ a month. I guess hirefire’s logplex noise is a bit inconvenient?

Or is there something else I’m missing? Pricing competition is good for the consumer.

And are there any other heroku autoscale solutions, that can handle Rails bg job dynos, that I still don’t know about?

update a day after writing djcp on a reddit thread writes:

I used to be a principal engineer for the heroku add-ons program.

One issue with hirefire is they request account level oauth tokens that essentially give them ability to do anything with your apps, where Rails Autoscaling worked with us to create a partnership and integrate with our “official” add-on APIs that limits security concerns and are scoped to the application that’s being scaled.

Part of the reason for hirefire working the way it does is historical, but we’ve supported the endpoints they need to scale for “official” partners for years now.

A lot of heroku customers use hirefire so please don’t think I’m spreading FUD, but you should be aware you’re giving a third party very broad rights to do things to your apps. They probably won’t, of course, but what if there’s a compromise?

“Official” add-on providers are given limited scoped tokens to (mostly) only the actions / endpoints they need, minimizing blast radius if they do get compromised.

You can read some more discussion at that thread.

Gem authors, check your release sizes

Most gems should probably be a couple hundred kb at most. I’m talking about the package actually stored in and downloaded from rubygems by an app using the gem.

After all, source code is just text, and it doesn’t take up much space. OK, maybe some gems have a couple images in there.

But if you are looking at your gem in rubygems and realize that it’s 10MB or bigger… and that it seems to be getting bigger with every release… something is probably wrong and worth looking into it.

One way to look into it is to look at the actual gem package. If you use the handy bundler rake task to release your gem (and I recommend it), you have a ./pkg directory in your source you last released from. Inside it are “.gem” files for each release you’ve made from there, unless you’ve cleaned it up recently.

.gem files are just .tar files it turns out. That have more tar and gz files inside them etc. We can go into it, extract contents, and use the handy unix utility du -sh to see what is taking up all the space.

How I found the bytes

jrochkind-chf kithe (master ?) $ cd pkg

jrochkind-chf pkg (master ?) $ ls
kithe-2.0.0.beta1.gem        kithe-2.0.0.pre.rc1.gem
kithe-2.0.0.gem            kithe-2.0.1.gem
kithe-2.0.0.pre.beta1.gem    kithe-2.0.2.gem

jrochkind-chf pkg (master ?) $ mkdir exploded

jrochkind-chf pkg (master ?) $ cp kithe-2.0.0.gem exploded/kithe-2.0.0.tar

jrochkind-chf pkg (master ?) $ cd exploded

jrochkind-chf exploded (master ?) $ tar -xvf kithe-2.0.0.tar
 x metadata.gz
 x data.tar.gz
 x checksums.yaml.gz

jrochkind-chf exploded (master ?) $  mkdir unpacked_data_tar

jrochkind-chf exploded (master ?) $ tar -xvf data.tar.gz -C unpacked_data_tar/

jrochkind-chf exploded (master ?) $ cd unpacked_data_tar/
/Users/jrochkind/code/kithe/pkg/exploded/unpacked_data_tar

jrochkind-chf unpacked_data_tar (master ?) $ du -sh *
 4.0K    MIT-LICENSE
  12K    README.md
 4.0K    Rakefile
 160K    app
 8.0K    config
  32K    db
 100K    lib
 300M    spec

jrochkind-chf unpacked_data_tar (master ?) $ cd spec

jrochkind-chf spec (master ?) $ du -sh *
 8.0K    derivative_transformers
 300M    dummy
  12K    factories
  24K    indexing
  72K    models
 4.0K    rails_helper.rb
  44K    shrine
  12K    simple_form_enhancements
 8.0K    spec_helper.rb
 188K    test_support
 4.0K    validators

jrochkind-chf spec (master ?) $ cd dummy/

jrochkind-chf dummy (master ?) $ du -sh *
 4.0K    Rakefile
  56K    app
  24K    bin
 124K    config
 4.0K    config.ru
 8.0K    db
 300M    log
 4.0K    package.json
  12K    public
 4.0K    tmp

Doh! In this particular gem, I have a dummy rails app, and it has 300MB of logs, cause I haven’t b bothered trimming them in a while, that are winding up including in the gem release package distributed to rubygems and downloaded by all consumers! Even if they were small, I don’t want these in the released gem package at all!

That’s not good! It only turns into 12MB instead of 300MB, because log files are so compressable and there is compression involved in assembling the rubygems package. But I have no idea how much space it’s actually taking up on consuming applications machines. This is very irresponsible!

What controls what files are included in the gem package?

Your .gemspec file of course. The line s.files = is an array of every file to include in the gem package. Well, plus s.test_files is another array of more files, that aren’t supposed to be necessary to run the gem, but are to test it.

(Rubygems was set up to allow automated *testing* of gems after download, is why test files are included in the release package. I am not sure how useful this is, and who if anyone does it; although I believe that some linux distro packagers try to make use of it, for better or worse).

But nobody wants to list every file in your gem individually, manually editing the array every time you add, remove, or move one. Fortunately, gemspec files are executable ruby code, so you can use ruby as a shortcut.

I have seen two main ways of doing this, with different “gem skeleton generators” taking one of two approaches.

Sometimes a shell out to git is used — the idea is that everything you have checked into your git should be in the gem release package, no more or no less. For instance, one of my gems has this in it, not sure where it came from or who/what generated it.

spec.files = `git ls-files -z`.split("\x0").reject do |f|
 f.match(%r{^(test|spec|features)/})
end

In that case, it wouldn’t have included anything in ./spec already, so this obviously isn’t actually the gem we were looking at before.

But in this case, in addition to using ruby logic to manipulate the results, nothing excluded by your .gitignore file will end up included in your gem package, great!

In kithe we were looking at before, those log files were in the .gitignore (they weren’t in my repo!), so if I had been using that git-shellout technique, they wouldn’t have been included in the ruby release already.

But… I wasn’t. Instead this gem has a gemspec that looks like:

s.test_files = Dir["spec/*/"]

Just include every single file inside ./spec in the test_files list. Oops. Then I get all those log files!

One way to fix

I don’t really know which is to be preferred of the git-shellout approach vs the dir-glob approach. I suspect it is the subject of historical religious wars in rubydom, when there were still more people around to argue about such things. Any opinions? Or another approach?

Without being in the mood to restructure this gemspec in anyway, I just did the simplest thing to keep those log files out…

Dir["spec/*/"].delete_if {|a| a =~ %r{/dummy/log/}}

Build the package without releasing with the handy bundler supplied rake build task… and my gem release package size goes from 12MB to 64K. (which actually kind of sounds like a minimum block size or something, right?)

Phew! That’s a big difference! Sorry for anyone using previous versions and winding up downloading all that cruft! (Actually this particular gem is mostly a proof of concept at this point and I don’t think anyone else is using it).

Check your gem sizes!

I’d be willing to be there are lots of released gems with heavily bloated release packages like this. This isn’t the first one I’ve realized was my fault. Because who pays attention to gem sizes anyway? Apparently not many!

But rubygems does list them, so it’s pretty easy to see. Are your gem release packages multiple megs, when there’s no good reason for them to be? Do they get bigger every release by far more than the bytes of lines of code you think were added? At some point in gem history was there a big jump from hundreds of KB to multiple MB? When nothing particularly actually happened to gem logic to lead to that?

All hints that you might be including things you didn’t mean to include, possibly things that grow each release.

You don’t need to have a dummy rails app in your repo to accidentally do this (I accidentally did it once with a gem that had nothing to do with rails). There could be other kind of log files. Or test coverage or performance metric files, or any other artifacts of your build or your development, especially ones that grow over time — that aren’t actually meant to or needed as part of the gem release package!

It’s good to sanity check your gem release packages now and then. In most cases, your gem release package should be hundreds of KB at most, not MBs. Help keep your users’ installs and builds faster and slimmer!

Updating SolrCloud configuration in ruby

We have an app that uses Solr. We currently run a Solr in legacy “not cloud” mode. Our solr configuration directory is on disk on the Solr server, and it’s up to our processes to get our desired solr configuration there, and to update it when it changes.

We are in the process of moving to a Solr in “SolrCloud mode“, probably via the SearchStax managed Solr service. Our Solr “Cloud” might only have one node, but “SolrCloud mode” gives us access to additional API’s for managing your solr configuration, as opposed to writing it directly to disk (which may not be possible at all in SolrCloud mode? And certainly isn’t using managed SearchStax).

That is, the Solr ConfigSets API, although you might also want to use a few pieces of the Collection Management API for associating a configset with a Solr collection.

Basically, you are taking your desired solr config directory, zipping it up, and uploading it to Solr as a “config set” [or “configset”] with a certain name. Then you can create collections using this config set, or reassign which named configset an existing collection uses.

I wasn’t able to find any existing ruby gems for interacting with these Solr API’s. RSolr is a “ruby client for interacting with solr”, but was written before most of these administrative API’s existed for Solr, and doesn’t seem to have been updated to deal with them (unless I missed it), RSolr seems to be mostly/only about querying solr, and some limited indexing.

But no worries, it’s not too hard to wrap the specific API I want to use in some ruby. Which did seem far better to me than writing the specific HTTP requests each time (and making sure you are dealing with errors etc!). (And yes, I will share the code with you).

I decided I wanted an object that was bound to a particular solr collection at a particular solr instance; and was backed by a particular local directory with solr config. That worked well for my use case, and I wound up with an API that looks like this:

updater = SolrConfigsetUpdater.new(
  solr_url: "https://example.com/solr",
  conf_dir: "./solr/conf",
  collection_name: "myCollection"
)

# will zip up ./solr/conf and upload it as named MyConfigset:
updater.upload("myConfigset")

updater.list #=> ["myConfigSet"]
updater.config_name # what configset name is MyCollection currently configured to use?
# => "oldConfigSet"

# what if we try to delete the one it's using?
updater.delete("oldConfigSet")
# => raises SolrConfigsetUpdater::SolrError with message:
# "Can not delete ConfigSet as it is currently being used by collection [myConfigset]"

# okay let's change it to use the new one and delete the old one

updater.update_config_name("myConfigset")
# now MyCollection uses this new configset, although we possibly
# need to reload the collection to make that so
updater.reload
# now let's delete the one we're not using
updater.delete("oldConfigSet")

OK, great. There were some tricks in there in trying to catch the apparently multiple ways Solr can report different kinds of errors, to make sure Solr-reported errors turn into exceptions ideally with good error messages.

Now, in addition to uploading a configset initially for a collection you are creating to use, the main use case I have is wanting to UPDATE the configuration to new values in an existing collection. Sure, this often requires a reindex afterwards.

If you have the recently released Solr 8.7, it will let you overwrite an existing configset, so this can be done pretty easily.

updater.upload(updater.config_name, overwrite: true)
updater.reload

But prior to Solr 8.7 you can not overwrite an existing configset. And SearchStax doesn’t yet have Solr 8.7. So one way or another, we need to do a dance where we upload the configset under a new name than switch the collection to use it.

Having this updater object that lets us easily execute relevant Solr API lets us easily experiment with different logic flows for this. For instance in a Solr listserv thread, Alex Halovnic suggests a somewhat complicated 8-step process workaround, which we can implement like so:

current_name = updater.config_name
temp_name = "#{current_name}_temp"

updater.create(from: current_name, to: temp_name)
updater.change_config_name(temp_name)
updater.reload
updater.delete(current_name)
updater.upload(configset_name: current_name)
updater.change_config_name(current_name)
updater.reload
updater.delete(temp_name)

That works. But talking to Dann Bohn at Penn State University, he shared a different algorithm, which goes like:

  • Make a cryptographic digest hash of the entire solr directory, which we’re going to use in the configset name.
  • Check if the collection is already using a configset named $name_$digest, which if it already is, you’re done, no change needed.
  • Otherwise, upload the configset with the fingerprint-based name, switch the collection to use it, reload, delete the configset that the collection used to use.

At first this seemed like overkill to me, but after thinking and experimenting with it, I like it! It is really quick to make a digest of a handful of files, that’s not a big deal. (I use first 7 chars of hex SHA256). And even if we had Solr 8.7, I like that we can avoid doing any operation on solr at all if there had been no changes — I really want to use this operation much like a Rails db:migrate, running it on every deploy to make sure the solr schema matches the one in the repo for the depoy.

Dann also shared his open source code with me, which was helpful for seeing how to make the digest, how to make a Zip file in ruby, etc. Thanks Dann!

Sharing my code

So I also wrote some methods to implement those variant updating stragies, Dann’s, and Alex Halovnic’s from the list etc.

I thought about wrapping this all up as a gem, but I didn’t really have the time to make it really good enough for that. My API is a little bit janky, I didn’t spend the extra time think it out really well to minimize the need for future backwards incompat changes like I would if it were a gem. I also couldn’t figure out a great way to write automated tests for this that I would find particularly useful; so in my code base it’s actually not currently test-covered (shhhhh) but in a gem I’d want to solve that somehow.

But I did try to write the code general purpose/flexible so other people could use it for their use cases; I tried to document it to my highest standards; and I put it all in one file which actually might not be the best OO abstraction/design, but makes it easier for you to copy and paste the single file for your own use. :)

So you can find my code here; it is apache-licensed; and you are welcome to copy and paste it and do whatever you like with it, including making a gem yourself if you want. Maybe I’ll get around to making it a gem in the future myself, I dunno, curious if there’s interest.

The SearchStax proprietary API’s

SearchStax has it’s own API’s that can I think be used for updating configsets and setting collections to use certain configsets etc. When I started exploring them, they are’t the worst vendor API’s I’ve seen, but I did find them a bit cumbersome to work with. The auth system involves a lot of steps (why can’t you just create an API Key from the SearchStax Web GUI?).

Overall I found them harder to use than just the standard Solr Cloud API’s, which worked fine in the SearchStax deployment, and have the added bonus of being transferable to any SolrCloud deployment instead of being SearchStax-specific. While the SearchStax docs and support try to steer you to the SearchStax specific API’s, I don’t think there’s really any good reason for this. (Perhaps the custom SearchStax API’s were written long ago when Solr API’s weren’t as complete?)

SearchStax support suggested that the SearchStax APIs were somehow more secure; but my SearchStax Solr API’s are protected behind HTTP basic auth, and if I’ve created basic auth credentials (or IP addr allowlist) those API’s will be available to anyone with auth to access Solr whether I use em or not! And support also suggested that the SearchStax API use would be logged, whereas my direct Solr API use would not be, which seems to be true at least in default setup, I can probably configure solr logging differently, but it just isn’t that important to me for these particular functions.

So after some initial exploration with SearchStax API, I realized that SolrCloud API (which I had never used before) could do everything I need and was more straightforward and transferable to use, and I’m happy with my decision to go with that.

Are you talking to Heroku redis in cleartext or SSL?

In “typical” Redis installation, you might be talking to redis on localhost or on a private network, and clients typically talk to redis in cleartext. Redis doesn’t even natively support communications over SSL. (Or maybe it does now with redis6?)

However, the Heroku redis add-on (the one from Heroku itself) supports SSL connections via “Stunnel”, a tool popular with other redis users use to get SSL redis connections too. (Or maybe via native redis with redis6? Not sure if you’d know the difference, or if it matters).

There are heroku docs on all of this which say:

While you can connect to Heroku Redis without the Stunnel buildpack, it is not recommend. The data traveling over the wire will be unencrypted.

Perhaps especially because on heroku your app does not talk to redis via localhost or on a private network, but on a public network.

But I think I’ve worked on heroku apps before that missed this advice and are still talking to heroku in the clear. I just happened to run across it when I got curious about the REDIS_TLS_URL env/config variable I noticed heroku setting.

Which brings us to another thing, that heroku doc on it is out of date, it doesn’t mention the REDIS_TLS_URL config variable, just the REDIS_URL one. The difference? the TLS version will be a url beginning with rediss:// instead of redis:// , note extra s, which many redis clients use as a convention for “SSL connection to redis probably via stunnel since redis itself doens’t support it”. The redis docs provide ruby and go examples which instead use REDIS_URL and writing code to swap the redis:// for rediss:// and even hard-code port number adjustments, which is silly!

(While I continue to be very impressed with heroku as a product, I keep running into weird things like this outdated documentation, that does not match my experience/impression of heroku’s all-around technical excellence, and makes me worry if heroku is slipping…).

The docs also mention a weird driver: ruby arg for initializing the Redis client that I’m not sure what it is and it doesn’t seem necessary.

The docs are correct that you have to tell the ruby Redis client not to try to verify SSL keys against trusted root certs, and this implementation uses a self-signed cert. Otherwise you will get an error that looks like: OpenSSL::SSL::SSLError: SSL_connect returned=1 errno=0 state=error: certificate verify failed (self signed certificate in certificate chain)

So, can be as simple as:

redis_client = Redis.new(url: ENV['REDIS_TLS_URL'], ssl_params: { verify_mode: OpenSSL::SSL::VERIFY_NONE })

$redis = redis_client
# and/or
Resque.redis = redis_client

I don’t use sidekiq on this project currently, but to get the SSL connection with VERIFY_NONE, looking at sidekiq docs maybe on sidekiq docs you might have to(?):

redis_conn = proc {
  Redis.new(url: ENV['REDIS_TLS_URL'], ssl_params: { verify_mode: OpenSSL::SSL::VERIFY_NONE })
}

Sidekiq.configure_client do |config|
  config.redis = ConnectionPool.new(size: 5, &redis_conn)
end

Sidekiq.configure_server do |config|
  config.redis = ConnectionPool.new(size: 25, &redis_conn)
end

(Not sure what values you should pick for connection pool size).

While the sidekiq docs mention heroku in passing, they don’t mention need for SSL connections — I think awareness of this heroku feature and their recommendation you use it may not actually be common!

Update: Beware REDIS_URL can also be rediss

On one of my apps I saw a REDIS_URL which used redis: and a REDIS_TLS_URL which uses (secure) rediss:.

But on another app, it provides *only* a REDIS_URL, which is rediss — meaning you have to set the verify_mode: OpenSSL::SSL::VERIFY_NONE when passing it to ruby redis client. So you have to be prepared to do this with REDIS_URL values too — I think it shouldn’t hurt to set the ssl_params option even if you pass it a non-ssl redis: url, so just set it all the time?

This second app was heroku-20 stack, and the first was heroku-18 stack, is that the difference? No idea.

Documented anywhere? I doubt it. Definitely seems sloppy for what I expect of heroku, making me get a bit suspicious of whether heroku is sticking to the really impressive level of technical excellence and documentation I expect from them.

So, your best bet is to check for both REDIS_TLS_URL and REDIS_URL, prefering the TLS one if present, realizing the REDIS_URL can have a rediss:// value in it too.

The heroku docs also say you don’t get secure TLS redis connection on “hobby” plans, but I”m not sure that’s actually true anymore on heroku-20? Not trusting the docs is not a good sign.

Comparing performance of a Rails app on different Heroku formations

I develop a “digital collections” or “asset management” app, which manages and makes digitized historical objects and their descriptions available to the public, from the collections here at the Science History Institute.

The app receives relatively low level of traffic (according to Google Analytics, around 25K pageviews a month), although we want it to be able to handle spikes without falling down. It is not the most performance-optimized app, it does have some relatively slow responses and can be RAM-hungry. But it works adequately on our current infrastructure: Web traffic is handled on a single AWS EC2 t2.medium instance, with 10 passenger processes (free version of passenger, so no multi-threading).

We are currently investigating the possibility of moving our infrastructure to heroku. After realizing that heroku standard dynos did not seem to have the performance characteristics I had expected, I decided to approach performance testing more methodically, to compare different heroku dyno formations to each other and to our current infrastructure. Our basic research question is probably What heroku formation do we need to have similar performance to our existing infrastructure?

I am not an expert at doing this — I did some research, read some blog posts, did some thinking, and embarked on this. I am going to lead you through how I approached this and what I found. Feedback or suggestions are welcome. The most surprising result I found was much poorer performance from heroku standard dynos than I expected, and specifically that standard dynos would not match performance of present infrastructure.

What URLs to use in test

Some older load-testing tools only support testing one URL over and over. I decided I wanted to test a larger sample list of URLs — to be a more “realistic” load, and also because repeatedly requesting only one URL might accidentally use caches in ways you aren’t expecting giving you unrepresentative results. (Our app does not currently use fragment caching, but caches you might not even be thinking about include postgres’s built-in automatic caches, or passenger’s automatic turbocache (which I don’t think we have turned on)).

My initial thought to get a list of such URLs from our already-in-production app from production logs, to get a sample of what real traffic looks like. There were a couple barriers for me to using production logs as URLs:

  1. Some of those URLs might require authentication, or be POST requests. The bulk of our app’s traffic is GET requests available without authentication, and I didn’t feel like the added complexity of setting up anything else in a load traffic was worthwhile.
  2. Our app on heroku isn’t fully functional yet. Without having connected it to a Solr or background job workers, only certain URLs are available.

In fact, a large portion of our traffic is an “item” or “work” detail page like this one. Additionally, those are the pages that can be the biggest performance challenge, since the current implementation includes a thumbnail for every scanned page or other image, so response time unfortunately scales with number of pages in an item.

So I decided a good list of URLs was simply a representative same of those “work detail” pages. In fact, rather than completely random sample, I took the 50 largest/slowest work pages, and then added in another 150 randomly chosen from our current ~8K pages. And gave them all a randomly shuffled order.

In our app, every time a browser requests a work detail page, the JS on that page makes an additional request for a JSON document that powers our page viewer. So for each of those 200 work detail pages, I added the JSON request URL, for a more “realistic” load, and 400 total URLs.

Performance: “base speed” vs “throughput under load”

Thinking about it, I realized there were two kinds of “performance” or “speed” to think about.

You might just have a really slow app, to exagerate let’s say typical responses are 5 seconds. That’s under low/no-traffic, a single browser is the only thing interacting with the app, it makes a single request, and has to wait 5 seconds for a response.

That number might be changed by optimizations or performance regressions in your code (including your dependencies). It might also be changed by moving or changing hardware or virtualization environment — including giving your database more CPU/RAM resources, etc.

But that number will not change by horizontally scaling your deployment — adding more puma or passenger processes or threads, scaling out hosts with a load balancer or heroku dynos. None of that will change this base speed because it’s just how long the app takes to prepare a response when not under load, how slow it is in a test only one web worker , where adding web workers won’t matter because they won’t be used.

Then there’s what happens to the app actually under load by multiple users at once. The base speed is kind of a lower bound on throughput under load — page response time is never going to get better than 5s for our hypothetical very slow app (without changing the underlying base speed). But it can get a lot worse if it’s hammered by traffic. This throughput under load can be effected not only by changing base speed, but also by various forms of horizontal scaling — how many puma or passenger processes you have with how many threads each, and how many CPUs they have access to, as well as number of heroku dynos or other hosts behind a load balancer.

(I had been thinking about this distinction already, but Nate Berkopec’s great blog post on scaling Rails apps gave me the “speed” vs “throughout” terminology to use).

For my condition, we are not changing the code at all. But we are changing the host architecture from a manual EC2 t2.medium to heroku dynos (of various possible types) in a way that could effect base speed, and we’re also changing our scaling architecture in a way that could change throughput under load on top of that — from one t2.medium with 10 passenger process to possibly multiple heroku dynos behind heroku’s load balancer, and also (for Reasons) switching from free passenger to trying puma with multiple threads per process. (we are running puma 5 with new experimental performance features turned on).

So we’ll want to get a sense of base speed of the various host choices, and also look at how throughput under load changes based on various choices.

Benchmarking tool: wrk

We’re going to use wrk.

There are LOTS of choices for HTTP benchmarking/load testing, with really varying complexity and from different eras of web history. I got a bit overwhelmed by it, but settled on wrk. Some other choices didn’t have all the features we need (some way to test a list of URLs, with at least some limited percentile distribution reporting). Others were much more flexible and complicated and I had trouble even figuring out how to use them!

wrk does need a custom lua script in order to handle a list of URLs. I found a nice script here, and modified it slightly to take filename from an ENV variable, and not randomly shuffle input list.

It’s a bit confusing understanding the meaning of “threads” vs “connections” in wrk arguments. This blog post from appfolio clears it up a bit. I decided to leave threads set to 1, and vary connections for load — so -c1 -t1 is a “one URL at a time” setting we can use to test “base speed”, and we can benchmark throughput under load by increasing connections.

We want to make sure we run the test for long enough to touch all 400 URLs in our list at least once, even in the slower setups, to have a good comparison — ideally it would be go through the list more than once, but for my own ergonomics I had to get through a lot of tests so ended up less tha ideal. (Should I have put fewer than 400 URLs in? Not sure).

Conclusions in advance

As benchmarking posts go (especially when I’m the one writing them), I’m about to drop a lot of words and data on you. So to maximize the audience that sees the conclusions (because they surprise me, and I want feedback/pushback on them), I’m going to give you some conclusions up front.

Our current infrastructure has web app on a single EC2 t2.medium, which is a burstable EC2 type — our relatively low-traffic app does not exhaust it’s burst credits. Measuring base speed (just one concurrent request at a time), we found that performance dynos seem to have about the CPU speed of a bursting t2.medium (just a hair slower).

But standard dynos are as a rule 2 to 3 times slower; additionally they are highly variable, and that variability can be over hours/days. A 3 minute period can have measured response times 2 or more times slower than another 3 minute period a couple hours later. But they seem to typically be 2-3x slower than our current infrastructure.

Under load, they scale about how you’d expect if you knew how many CPUs are present, no real surprises. Our existing t2.medium has two CPUs, so can handle 2 simultaneous requests as fast as 1, and after that degrades linearly.

A single performance-L ($500/month) has 4 CPUs (8 hyperthreads), so scales under load much better than our current infrastructure.

A single performance-M ($250/month) has only 1 CPU (!), so scales pretty terribly under load.

Testing scaling with 4 standard-2x’s ($200/month total), we see that it scales relatively evenly. Although lumpily because of variability, and it starts out so much worse performing that even as it scales “evenly” it’s still out-performed by all other arcchitectures. :( (At these relatively fast median response times you might say it’s still fast enough who cares, but in our fat tail of slower pages it gets more distressing).

Now we’ll give you lots of measurements, or you can skip all that to my summary discussion or conclusions for our own project at the end.

Let’s compare base speed

OK, let’s get to actual measurements! For “base speed” measurements, we’ll be telling wrk to use only one connection and one thread.

Existing t2.medium: base speed

Our current infrastructure is one EC2 t2.medium. This EC2 instance type has two vCPUs and 4GB of RAM. On that single EC2 instance, we run passenger (free not enterprise) set to have 10 passenger processes, although the base speed test with only one connection should only touch one of the workers. The t2 is a “burstable” type, and we do always have burst credits (this is not a high traffic app; verified we never exhausted burst credits in these tests), so our test load may be taking advantage of burst cpu.

$ URLS=./sample_works.txt  wrk -c 1 -t 1 -d 3m --timeout 20s --latency -s load_test/multiplepaths.lua.txt https://[current staging server]
 multiplepaths: Found 400 paths
 multiplepaths: Found 400 paths
 Running 3m test @ https://staging-digital.sciencehistory.org
   1 threads and 1 connections
   Thread Stats   Avg      Stdev     Max   +/- Stdev
     Latency   311.00ms  388.11ms   2.37s    86.45%
     Req/Sec    11.89      8.96    40.00     69.95%
   Latency Distribution
      50%   90.99ms
      75%  453.40ms
      90%  868.81ms
      99%    1.72s
   966 requests in 3.00m, 177.43MB read
 Requests/sec:      5.37
 Transfer/sec:      0.99MB

I’m actually feeling pretty good about those numbers on our current infrastructure! 90ms median, not bad, and even 453ms 75th percentile is not too bad. Now, our test load involves some JSON responses that are quicker to deliver than corresponding HTML page, but still pretty good. The 90th/99th/and max request (2.37s) aren’t great, but I knew I had some slow pages, this matches my previous understanding of how slow they are in our current infrastructure.

90th percentile is ~9 times 50th percenile.

I don’t have an understanding of why the two different Req/Sec and Requests/Sec values are so different, and don’t totally understand what to do with the Stdev and +/- Stdev values, so I’m just going to be sticking to looking at the latency percentiles, I think “latency” could also be called “response times” here.

But ok, this is our baseline for this workload. And doing this 3 minute test at various points over the past few days, I can say it’s nicely regular and consistent, occasionally I got a slower run, but 50th percentile was usually 90ms–105ms, right around there.

Heroku standard-2x: base speed

From previous mucking about, I learned I can only reliably fit one puma worker in a standard-1x, and heroku says “we typically recommend a minimum of 2 processes, if possible” (for routing algorithmic reasons when scaled to multiple dynos), so I am just starting at a standard-2x with two puma workers each with 5 threads, matching heroku recommendations for a standard-2x dyno.

So one thing I discovered is that bencharks from a heroku standard dyno are really variable, but here are typical ones:

$ heroku dyno:resize
 type     size         qty  cost/mo
 ───────  ───────────  ───  ───────
 web      Standard-2X  1    50

$ heroku config:get --shell WEB_CONCURRENCY RAILS_MAX_THREADS
 WEB_CONCURRENCY=2
 RAILS_MAX_THREADS=5

$ URLS=./sample_works.txt  wrk -c 1 -t 1 -d 3m --timeout 20s --latency -s load_test/multiplepaths.lua.txt https://scihist-digicoll.herokuapp.com/
 multiplepaths: Found 400 paths
 multiplepaths: Found 400 paths
 Running 3m test @ https://scihist-digicoll.herokuapp.com/
   1 threads and 1 connections
   Thread Stats   Avg      Stdev     Max   +/- Stdev
     Latency   645.08ms  768.94ms   4.41s    85.52%
     Req/Sec     5.78      4.36    20.00     72.73%
   Latency Distribution
      50%  271.39ms
      75%  948.00ms
      90%    1.74s
      99%    3.50s
   427 requests in 3.00m, 74.51MB read
 Requests/sec:      2.37
 Transfer/sec:    423.67KB

I had heard that heroku standard dynos would have variable performance, because they are shared multi-tenant resources. I had been thinking of this like during a 3 minute test I might see around the same median with more standard deviation — but instead, what it looks like to me is that running this benchmark on Monday at 9am might give very different results than at 9:50am or Tuesday at 2pm. The variability is over a way longer timeframe than my 3 minute test — so that’s something learned.

Running this here and there over the past week, the above results seem to me typical of what I saw. (To get better than “seem typical” on this resource, you’d have to run a test, over several days or a week I think, probably not hammering the server the whole time, to get a sense of actual statistical distribution of the variability).

I sometimes saw tests that were quite a bit slower than this, up to a 500ms median. I rarely if ever saw results too much faster than this on a standard-2x. 90th percentile is ~6x median, less than my current infrastructure, but that still gets up there to 1.74 instead of 864ms.

This typical one is quite a bit slower than than our current infrastructure, our median response time is 3x the latency, with 90th and max being around 2x. This was worse than I expected.

Heroku performance-m: base speed

Although we might be able to fit more puma workers in RAM, we’re running a single-connection base speed test, so it shouldn’t matter to, and we won’t adjust it.

$ heroku dyno:resize
 type     size           qty  cost/mo
 ───────  ─────────────  ───  ───────
 web      Performance-M  1    250

$ heroku config:get --shell WEB_CONCURRENCY RAILS_MAX_THREADS
 WEB_CONCURRENCY=2
 RAILS_MAX_THREADS=5

$ URLS=./sample_works.txt  wrk -c 1 -t 1 -d 3m --timeout 20s --latency -s load_test/multiplepaths.lua.txt https://scihist-digicoll.herokuapp.com/
 multiplepaths: Found 400 paths
 multiplepaths: Found 400 paths
 Running 3m test @ https://scihist-digicoll.herokuapp.com/
   1 threads and 1 connections
   Thread Stats   Avg      Stdev     Max   +/- Stdev
     Latency   377.88ms  481.96ms   3.33s    86.57%
     Req/Sec    10.36      7.78    30.00     37.03%
   Latency Distribution
      50%  117.62ms
      75%  528.68ms
      90%    1.02s
      99%    2.19s
   793 requests in 3.00m, 145.70MB read
 Requests/sec:      4.40
 Transfer/sec:    828.70KB

This is a lot closer to the ballpark of our current infrastructure. It’s a bit slower (117ms median intead of 90ms median), but in running this now and then over the past week it was remarkably, thankfully, consistent. Median and 99th percentile are both 28% slower (makes me feel comforted that those numbers are the same in these two runs!), that doesn’t bother me so much if it’s predictable and regular, which it appears to be. The max appears to me still a little bit less regular on heroku for some reason, since performance is supposed to be non-shared AWS resources, you wouldn’t expect it to be, but slow requests are slow, ok.

90th percentile is ~9x median, about the same as my current infrastructure.

heroku performance-l: base speed

$ heroku dyno:resize
 type     size           qty  cost/mo
 ───────  ─────────────  ───  ───────
 web      Performance-L  1    500

$ heroku config:get --shell WEB_CONCURRENCY RAILS_MAX_THREADS
 WEB_CONCURRENCY=2
 RAILS_MAX_THREADS=5

URLS=./sample_works.txt  wrk -c 1 -t 1 -d 3m --timeout 20s --latency -s load_test/multiplepaths.lua.txt https://scihist-digicoll.herokuapp.com/
 multiplepaths: Found 400 paths
 multiplepaths: Found 400 paths
 Running 3m test @ https://scihist-digicoll.herokuapp.com/
   1 threads and 1 connections
   Thread Stats   Avg      Stdev     Max   +/- Stdev
     Latency   471.29ms  658.35ms   5.15s    87.98%
     Req/Sec    10.18      7.78    30.00     36.20%
   Latency Distribution
      50%  123.08ms
      75%  635.00ms
      90%    1.30s
      99%    2.86s
   704 requests in 3.00m, 130.43MB read
 Requests/sec:      3.91
 Transfer/sec:    741.94KB

No news is good news, it looks very much like performance-m, which is exactly what we expected, because this isn’t a load test. It tells us that performance-m and performance-l seem to have similar CPU speeds and similar predictable non-variable regularity, which is what I find running this test periodically over a week.

90th percentile is ~10x median, about the same as current infrastructure.

The higher Max speed is just evidence of what I mentioned, the speed of slowest request did seem to vary more than on our manual t2.medium, can’t really explain why.

Summary: Base speed

Not sure how helpful this visualization is, charting 50th, 75th, and 90th percentile responses across architectures.

But basically: performance dynos perform similarly to my (bursting) t2.medium. Can’t explain why performance-l seems slightly slower than performance-m, might be just incidental variation when I ran the tests.

The standard-2x is about twice as slow as my (bursting) t2.medium. Again recall standard-2x results varied a lot every time I ran them, the one I reported seems “typical” to me, that’s not super scientific, admittedly, but I’m confident that standard-2x are a lot slower in median response times than my current infrastructure.

Throughput under load

Ok, now we’re going to test using wrk to use more connections. In fact, I’ll test each setup with various number of connections, and graph the result, to get a sense of how each formation can handle throughput under load. (This means a lot of minutes to get all these results, at 3 minutes per number of connection test, per formation!).

An additional thing we can learn from this test, on heroku we can look at how much RAM is being used after a load test, to get a sense of the app’s RAM usage under traffic to understand the maximum number of puma workers we might be able to fit in a given dyno.

Existing t2.medium: Under load

A t2.medium has 4G of RAM and 2 CPUs. We run 10 passenger workers (no multi-threading, since we are free, rather than enterprise, passenger). So what do we expect? With 2 CPUs and more than 2 workers, I’d expect it to handle 2 simultaneous streams of requests almost as well as 1; 3-10 should be quite a bit slower because they are competing for the 2 CPUs. Over 10, performance will probably become catastrophic.

2 connections are exactly flat with 1, as expected for our two CPUs, hooray!

Then it goes up at a strikingly even line. Going over 10 (to 12) simultaneous connections doesn’t matter, even though we’ve exhausted our workers, I guess at this point there’s so much competition for the two CPUs already.

The slope of this curve is really nice too actually. Without load, our median response time is 100ms, but even at a totally overloaded 12 overloaded connections, it’s only 550ms, which actually isn’t too bad.

We can make a graph that in addition to median also has 75th, 90th, and 99th percentile response time on it:

It doesn’t tell us too much; it tells us the upper percentiles rise at about the same rate as the median. At 1 simultaneous connection 90th percentile of 846ms is about 9 times the median of 93ms; at 10 requests the 90th percentile of 3.6 seconds is about 8 times the median of 471ms.

This does remind us that under load when things get slow, this has more of a disastrous effect on already slow requests than fast requests. When not under load, even our 90th percentile was kind of sort of barley acceptable at 846ms, but under load at 3.6 seconds it really isn’t.

Single Standard-2X dyno: Under load

A standard-2X dyno has 1G of RAM. The (amazing, excellent, thanks schneems) heroku puma guide suggests running two puma workers with 5 threads each. At first I wanted to try running three workers, which seemed to fit into available RAM — but under heavy load-testing I was getting Heroku R14 Memory Quota Exceeded errors, so we’ll just stick with the heroku docs recommendations. Two workers with 5 threads each fit with plenty of headroom.

A standard-2x dyno is runs on shared (multi-tenant) underlying Amazon virtual hardware. So while it is running on hardware with 4 CPUs (each of which can run two “hyperthreads“), the puma doc suggests “it is best to assume only one process can execute at a time” on standard dynos.

What do we expect? Well, if it really only had one CPU, it would immediately start getting bad at 2 simulataneous connections, and just get worse from there. When we exceed the two worker count, will it get even worse? What about when we exceed the 10 thread (2 workers * 5 threads) count?

You’d never run just one dyno if you were expecting this much traffic, you’d always horizontally scale. This very artificial test is just to get a sense of it’s characteristics.

Also, we remember that standard-2x’s are just really variable; I could get much worse or better runs than this, but graphed numbers from a run that seemed typical.

Well, it really does act like 1 CPU, 2 simultaneous connections is immediately a lot worse than 1.

The line isn’t quite as straight as in our existing t2.medium, but it’s still pretty straight; I’d attribute the slight lumpiness to just the variability of shared-architecture standard dyno, and figure it would get perfectly straight with more data.

It degrades at about the same rate of our baseline t2.medium, but when you start out slower, that’s more disastrous. Our t2.medium at an overloaded 10 simultaneous requests is 473ms (pretty tolerable actually), 5 times the median at one request only. This standard-2x has a median response time of 273 ms at only one simultaneous request, and at an overloaded 10 requests has a median response time also about 5x worse, but that becomes a less tolerable 1480ms.

Does also graphing the 75th, 90th, and 99th percentile tell us much?

Eh, I think the lumpiness is still just standard shared-architecture variability.

The rate of “getting worse” as we add more overloaded connections is actually a bit better than it was on our t2.medium, but since it already starts out so much slower, we’ll just call it a wash. (On t2.medium, 90th percentile without load is 846ms and under an overloaded 10 connections 3.6s. On this single standard-2x, it’s 1.8s and 5.2s).

I’m not sure how much these charts with various percentiles on them tell us, I’ll not include them for every architecture hence.

standard-2x, 4 dynos: Under load

OK, realistically we already know you shouldn’t have just one standard-2x dyno under that kind of load. You’d scale out, either manually or perhaps using something like the neat Rails Autoscale add-on.

Let’s measure with 4 dynos. Each is still running 2 puma workers, with 5 threads each.

What do we expect? Hm, treating each dyno as if it has only one CPU, we’d expect it to be able to handle traffic pretty levelly up to 4 simultenous connections, distributed to 4 dynos. It’s going to do worse after that, but up to 8 there is still one puma worker per connection so it might get even worse after 8?

Well… I think that actually is relatively flat from 1 to 4 simultaneous connections, except for lumpiness from variability. But lumpiness from variability is huge! We’re talking 250ms median measured at 1 connection, up to 369ms measured median at 2, down to 274ms at 3.

And then maybe yeah, a fairly shallow slope up to 8 simutaneous connections than steeper.

But it’s all fairly shallow slope compared to our base t2.medium. At 8 connections (after which we pretty much max out), the standard-2x median of 464ms is only 1.8 times the median at 1 conection. Compared to the t2.median increase of 3.7 times.

As we’d expect, scaling out to 4 dynos (with four cpus/8 hyperthreads) helps us scale well — the problem is the baseline is so slow to begin (with very high bounds of variability making it regularly even slower).

performance-m: Under load

A performance-m has 2.5 GB of memory. It only has one physical CPU, although two “vCPUs” (two hyperthreads) — and these are all your apps, it is not shared.

By testing under load, I demonstrated I could actually fit 12 workers on there without any memory limit errors. But is there any point to doing that with only 1/2 CPUs? Under a bit of testing, it appeared not.

The heroku puma docs recommend only 2 processes with 5 threads. You could do a whole little mini-experiment just trying to measure/optimize process/thread count on performance-m! We’ve already got too much data here, but in some experimentation it looked to me like 5 processes with 2 threads each performed better (and certainly no worse) than 2 processes with 5 threads — if you’ve got the RAM just sitting there anyway (as we do), why not?

I actually tested with 6 puma processes with 2 threads each. There is still a large amount of RAM headroom we aren’t going to use even under load.

What do we expect? Well, with the 2 “hyperthreads” perhaps it can handle 2 simultaneous requests nearly as well as 1 (or not?); after that, we expect it to degrade quickly same as our original t2.medium did.

It an handle 2 connections slightly better than you’d expect if there really was only 1 CPU, so I guess a hyperthread does give you something. Then the slope picks up, as you’d expect; and it looks like it does get steeper after 4 simultaneous connections, yup.

performance-l: Under load

A performance-l ($500/month) costs twice as much as a performance-m ($250/month), but has far more than twice as much resources. performance-l has a whopping 14GB of RAM compared to performance-m’s 2.5GB; and performance-l has 4 real CPUs/hyperthreads available to use (visible using the nproc technique in the heroku puma article.

Because we have plenty of RAM to do so, we’re going to run 10 worker processes to match our original t2.medium’s. We still ran with 2 threads, just cause it seems like maybe you should never run a puma worker with only one thread? But who knows, maybe 10 workers with 1 thread each would perform better; plenty of room (but not plenty of my energy) for yet more experimentation.

What do we expect? The graph should be pretty flat up to 4 simultaneous connections, then it should start getting worse, pretty evenly as simultaneous connections rise all the way up to 12.

It is indeed pretty flat up to 4 simultaneous connections. Then up to 8 it’s still not too bad — median at 8 is only ~1.5 median at 1(!). Then it gets worse after 8 (oh yeah, 8 hyperthreads?).

But the slope is wonderfully shallow all the way. Even at 12 simultaneous connections, the median response time of 266ms is only 2.5x what it was at one connection. (In our original t2.medium, at 12 simultaneous connections median response time was over 5x what it was at 1 connection).

This thing is indeed a monster.

Summary Comparison: Under load

We showed a lot of graphs that look similar, but they all had different sclaes on the y-axis. Let’s plot median response times under load of all architectures on the same graph, and see what we’re really dealing with.

The blue t2.medium is our baseline, what we have now. We can see that there isn’t really a similar heroku option, we have our choice of better or worse.

The performance-l is just plain better than what we have now. It starts out performing about the same as what we have now for 1 or 2 simultaneous connections, but then scales so much flatter.

The performance-m also starts out about thesame, but sccales so much worse than even what we have now. (it’s that 1 real CPU instead of 2, I guess?).

The standard-2x scaled to 4 dynos… has it’s own characteristics. It’s baseline is pretty terrible, it’s 2 to 3 times as slow as what we have now even not under load. But then it scales pretty well, since it’s 4 dynos after all, it doesn’t get worse as fast as performance-m does. But it started out so bad, that it remains far worse than our original t2.medium even under load. Adding more dynos to standard-2x will help it remain steady under even higher load, but won’t help it’s underlying problem that it’s just slower than everyone else.

Discussion: Thoughts and Surprises

  • I had been thinking of a t2.medium (even with burst) as “typical” (it is after all much slower than my 2015 Macbook), and has been assuming (in retrospect with no particular basis) that a heroku standard dyno would perform similarly.
    • Most discussion and heroku docs, as well as the naming itself, suggest that a ‘standard’ dyno is, well, standard, and performance dynos are for “super scale, high traffic apps”, which is not me.
    • But in fact, heroku standard dynos are much slower and more variable in performance than a bursting t2.medium. I suspect they are slower than other options you might consider non-heroku “typical” options.



  • My conclusion is honestly that “standard” dynos are really “for very fast, well-optimized apps that can handle slow and variable CPU” and “performance” dynos are really “standard, matching the CPU speeds you’d get from a typical non-heroku option”. But this is not how they are documented or usually talked about. Are other people having really different experiences/conclusions than me? If so, why, or where have I gone wrong?
    • This of course has implications for estimating your heroku budget if considering switching over. :(
    • If you have a well-optimized fast app, say even 95th percentile is 200ms (on bursting t2.medium), then you can handle standard slowness — so what your 95th percentile is now 600ms (and during some time periods even much slower, 1s or worse, due to variability). That’s not so bad for a 95th percentile.
    • One way to get a very fast is of course caching. There is lots of discussion of using caching in Rails, sometimes the message (explicit or implicit) is “you have to use lots of caching to get reasonable performance cause Rails is so slow.” What if many of these people are on heroku, and it’s really you have to use lots of caching to get reasonable performance on heroku standard dyno??
    • I personally don’t think caching is maintenance free; in my experience properly doing cache invalidation and dealing with significant processing spikes needed when you choose to invalidate your entire cache (cause cached HTML needs to change) lead to real maintenance/development cost. I have not needed caching to meet my performance goals on present architecture.
    • Everyone doesn’t necessarily have the same performance goals/requirements. Mine of a low-traffic non-commercial site are are maybe more modest, I just need users not to be super annoyed. But whatever your performance goals, you’re going to have to spend more time on optimization on a heroku standard than something with much faster CPU — like a standard affordable mid-tier EC2. Am I wrong?


  • One significant factor on heroku standard dyno performance is that they use shared/multi-tenant infrastructure. I wonder if they’ve actually gotten lower performance over time, as many customers (who you may be sharing with) have gotten better at maximizing their utilization, so the shared CPUs are typically more busy? Like a frog boiling, maybe nobody noticed that standard dynos have become lower performance? I dunno, brainstorming.
    • Or maybe there are so many apps that start on heroku instead of switcching from somewhere else, that people just don’t realize that standard dynos are much slower than other low/mid-tier options?
    • I was expecting to pay a premium for heroku — but even standard-2x’s are a significant premium over paying for t2.medium EC2 yourself, one I found quite reasonable…. performance dynos are of course even more premium.


  • I had a sort of baked-in premise that most Rails apps are “IO-bound”, they spend more time waiting on IO than using CPU. I don’t know where I got that idea, I heard it once a long time ago and it became part of my mental model. I now do not believe this is true true of my app, and I do not in fact believe it is true of most Rails apps in 2020. I would hypothesize that most Rails apps today are in fact CPU-bound.

  • The performance-m dyno only has one CPU. I had somehow also been assuming that it would have two CPUs — I’m not sure why, maybe just because at that price! It would be a much better deal with two CPUs.
    • Instead we have a huge jump from $250 performance-m to $500 performance-l that has 4x the CPUs and ~5x the RAM.
    • So it doesn’t make financial sense to have more than one performance-m dyno, you might as well go to performance-l. But this really complicates auto-scaling, whether using Heroku’s feature , or the awesome Rails Autoscale add-on. I am not sure I can afford a performance-l all the time, and a performance-m might be sufficient most of the time. But if 20% of the time I’m going to need more (or even 5%, or even unexpectedly-mentioned-in-national-media), it would be nice to set things up to autoscale up…. I guess to financially irrational 2 or more performance-m’s? :(

  • The performance-l is a very big machine, that is significantly beefier than my current infrastructure. And has far more RAM than I need/can use with only 4 physical cores. If I consider standard dynos to be pretty effectively low tier (as I do), heroku to me is kind of missing mid-tier options. A 2 CPU option at 2.5G or 5G of RAM would make a lot of sense to me, and actually be exactly what I need… really I think performance-m would make more sense with 2 CPUs at it’s existing already-premium price point, and to be called a “performance” dyno. . Maybe heroku is intentionally trying set options to funnel people to the highest-priced performance-l.

Conclusion: What are we going to do?

In my investigations of heroku, my opinion of the developer UX and general service quality only increases. It’s a great product, that would increase our operational capacity and reliability, and substitute for so many person-hours of sysadmin/operational time if we were self-managing (even on cloud architecture like EC2).

But I had originally been figuring we’d use standard dynos (even more affordably, possibly auto-scaled with Rails Autoscale plugin), and am disappointed that they end up looking so much lower performance than our current infrastructure.

Could we use them anyway? Response time going from 100ms to 300ms — hey, 300ms is still fine, even if I’m sad to lose those really nice numbers I got from a bit of optimization. But this app has a wide long-tail ; our 75th percentile going from 450ms to 1s, our 90th percentile going from 860ms to 1.74s and our 99th going from 2.3s to 4.4s — a lot harder to swallow. Especially when we know that due to standard dyno variability, a slow-ish page that on my present architecture is reliably 1.5s, could really be anywhere from 3 to 9(!) on heroku.

I would anticipate having to spend a lot more developer time on optimization on heroku standard dynos — or, i this small over-burdened non-commercial shop, not prioritizing that (or not having the skills for it), and having our performance just get bad.

So I’m really reluctant to suggest moving our app to heroku with standard dynos.

A performance-l dyno is going to let us not have to think about performance any more than we do now, while scaling under high-traffic better than we do now — I suspect we’d never need to scale to more than one performance-l dyno. But it’s pricey for us.

A performance-m dyno has a base-speed that’s fine, but scales very poorly and unaffordably. Doesn’t handle an increase in load very well as one dyno, and to get more CPUs you have to pay far too much (especially compared to standard dynos I had been assuming I’d use).

So I don’t really like any of my options. If we do heroku, maybe we’ll try a performance-m, and “hope” our traffic is light enough that a single one will do? Maybe with Rails autoscale for traffic spikes, even though 2 performance-m dynos isn’t financially efficient? If we are scaling to 2 (or more!) performance-m’s more than very occasionally, switch to performance-l, which means we need to make sure we have the budget for it?

faster_s3_url: Optimized S3 url generation in ruby

Subsequent to my previous investigation about S3 URL generation performance, I ended up writing a gem with optimized implementations of S3 URL generation.

github: faster_s3_url

It has no dependencies (not even aws-sdk). It can speed up both public and presigned URL generation by around an order of magnitude. In benchmarks on my 2015 MacBook compared to aws-sdk-s3: public URLs from 180 in 10ms to 2200 in 10ms; presigned URLs from 10 in 10ms to 300 in 10ms (!!).

While if you are only generating a couple S3 URLs at a time you probably wouldn’t notice aws-sdk-ruby’s poor performance, if you are generating even just hundreds at a time, and especially for presigned URLs, it can really make a difference.

faster_s3_url supports the full API for aws-sdk-s3 presigned URLs , including custom params like response_content_disposition. It’s tests actually test that results match what aws-sdk-s3 would generate.

For shrine users, faster_s3_url includes a Shrine storage sub-class that can be drop-in replacement of Shrine::Storage::S3 to just have all your S3 URL generations via shrine be using the optimized implementation.

Key in giving me the confidence to think I could pull off an independent S3 presigned URL implementation was WeTransfer’s wt_s3_signer gem be succesful. wt_s3_signer makes some assumptions/restrictions to get even higher performance than faster_s3_url (two or three times as fast) — but the restrictions/assumptions and API to get that performance weren’t suitable for use cases, so I implemented my own.