Managed Solr SaaS Options

I was recently looking for managed Solr “software-as-a-service” (SaaS) options, and had trouble figuring out what was out there. So I figured I’d share what I learned. Even though my knowledge here is far from exhaustive, and I have only looked seriously at one of the ones I found.

The only managed Solr options I found were: WebSolr; SearchStax; and OpenSolr.

Of these, i think WebSolr and SearchStax are more well-known, I couldn’t find anyone with experience with OpenSolr, which perhaps is newer.

Of them all, SearchStax is the only one I actually took for a test drive, so will have the most to say about.

Why we were looking

We run a fairly small-scale app, whose infrastructure is currently 4 self-managed AWS EC2 instances, running respectively: 1) A rails web app 2) Bg workers for the rails web app 3) Postgres, and 4) Solr.

Oh yeah, there’s also a redis running one of those servers, on #3 with pg or #4 with solr, I forget.

Currently we manage this all ourselves, right on the EC2. But we’re looking to move as much as we can into “managed” servers. Perhaps we’ll move to Heroku. Perhaps we’ll use hatchbox. Or if we do stay on AWS resources we manage directly, we’d look at things like using an AWS RDS Postgres instead of installing it on an EC2 ourselves, an AWS ElastiCache for Redis, maybe look into Elastic Beanstalk, etc.

But no matter what we do, we need a Solr, and we’d like to get it managed. Hatchbox has no special Solr support, AWS doesn’t have a Solr service, Heroku does have a solr add-on but you can also use any Solr with it and we’ll get to that later.

Our current Solr use is pretty small scale. We don’t run “SolrCloud mode“, just legacy ordinary Solr. We only have around 10,000 documents in there (tiny for Solr), our index size is only 70MB. Our traffic is pretty low — when I tried to figure out how low, it doesn’t seem we have sufficient logging turned on to answer that specifically but using proxy metrics to guess I’d say 20K-40K requests a day, query as well as add.

This is a pretty small Solr installation, although it is used centrally for the primary functions of the (fairly low-traffic) app. It currently runs on an EC2 t3a.small, which is a “burstable” EC2 type with only 2G of RAM. It does have two vCPUs (that is one core with ‘hyperthreading’). The t3a.small EC2 instance only costs $14/month on-demand price! We know we’ll be paying more for managed Solr, but we want to do get out of the business of managing servers — we no longer really have the staff for it.

WebSolr (didn’t actually try out)

WebSolr is the only managed Solr currently listed as a Heroku add-on. It is also available as a managed Solr independent of heroku.

The pricing in the heroku plans vs the independent plans seems about the same. As a heroku add-on there is a $20 “staging” plan that doesn’t exist in the independent plans. (Unlike some other heroku add-ons, no time-limited free plan is available for WebSolr). But once we go up from there, the plans seem to line up.

Starting at: $59/month for:

  • 1 million document limit
  • 40K requests/day
  • 1 index
  • 954MB storage
  • 5 concurrent requests limit (this limit is not mentioned on the independent pricing page?)

Next level up is $189/month for:

  • 5 million document limit
  • 150K requests/day
  • 4.6GB storage
  • 10 concurrent request limit (again concurrent request limits aren’t mentioned on independent pricing page)

As you can see, WebSolr has their plans metered by usage.

$59/month is around the price range we were hoping for (we’ll need two, one for staging one for production). Our small solr is well under 1 million documents and ~1GB storage, and we do only use one index at present. However, the 40K requests/day limit I’m not sure about, even if we fit under it, we might be pushing up against it.

And the “concurrent request” limit simply isn’t one I’m even used to thinking about. On a self-managed Solr it hasn’t really come up. What does “concurrent” mean exactly in this case, how is it measured? With 10 puma web workers and sometimes a possibly multi-threaded batch index going on, could we exceed a limit of 4? Seems plausible. What happens when they are exceeded? Your Solr request results in an HTTP 429 error!

Do I need to now write the app to rescue those gracefully, or use connection pooling to try to avoid them, or something? Having to rewrite the way our app functions for a particular managed solr is the last thing we want to do. (Although it’s not entirely clear if those connection limits exist on the non-heroku-plugin plans, I suspect they do?).

And in general, I’m not thrilled with the way the pricing works here, and the price points. I am positive for a lot of (eg) heroku customers an additional $189*2=$378/month is peanuts not even worth accounting for, but for us, a small non-profit whose app’s traffic does not scale with revenue, that starts to be real money.

It is not clear to me if WebSolr installations (at “standard” plans) are set up in “SolrCloud mode” or not; I’m not sure what API’s exist for uploading your custom schema.xml (which we’d need to do), or if they expect you to do this only manually through a web UI (that would not be good); I’m not sure if you can upload custom solrconfig.xml settings (this may be running on a shared solr instance with standard solrconfig.xml?).

Basically, all of this made WebSolr not the first one we looked at.

Does it matter if we’re on heroku using a managed Solr that’s not a Heroku plugin?

I don’t think so.

In some cases, you can get a better price from a Heroku plug-in than you could get from that same vendor not on heroku or other competitors. But that doesn’t seem to be the case here, and other that that does it matter?

Well, all heroku plug-ins are required to bill you by-the-minute, which is nice but not really crucial, other forms of billing could also be okay at the right price.

With a heroku add-on, your billing is combined into one heroku invoice, no need to give a credit card to anyone else, and it can be tracked using heroku tools. Which is certainly convenient and a plus, but not essential if the best tool for the job is not a heroku add-on.

And as a heroku add-on, WebSolr provides a WEBSOLR_URL heroku config/env variable automatically to code running on heroku. OK, that’s kind of nice, but it’s not a big deal to set a SOLR_URL heroku config manually referencing the appropriate address. I suppose as a heroku add-on, WebSolr also takes care of securing and authenticating connections between the heroku dynos and the solr, so we need to make sure we have a reasonable way to do this from any alternative.

SearchStax (did take it for a spin)

SearchStax’s pricing tiers are not based on metering usage. There are no limits based on requests/day or concurrent connections. SearchStax runs on dedicated-to-you individual Solr instances (I would guess running on dedicated-to-you individual (eg) EC2, but I’m not sure). Instead the pricing is based on size of host running Solr.

You can choose to run on instances deployed to AWS, Google Cloud, or Azure. We’ll be sticking to AWS (the others, I think, have a slight price premium).

While SearchStax gives you a pricing pages that looks like the “new-way-of-doing-things” transparent pricing, in fact there isn’t really enough info on public pages to see all the price points and understand what you’re getting, there is still a kind of “talk to a salesperson who has a price sheet” thing going on.

What I think I have figured out from talking to a salesperson and support, is that the “Silver” plans (“Starting at $19 a month”, although we’ll say more about that in a bit) are basically: We give you a Solr, we don’t don’t provide any technical support for Solr.

While the “Gold” plans “from $549/month” are actually about paying for Solr consultants to set up and tune your schema/index etc. That is not something we need, and $549+/month is way more than the price range we are looking for.

While the SearchStax pricing/plan pages kind of imply the “Silver” plan is not suitable for production, in fact there is no real reason not to use it for production I think, and the salesperson I talked to confirmed that — just reaffirming that you were on your own managing the Solr configuration/setup. That’s fine, that’s what we want, we just don’t want to mangage the OS or set up the Solr or upgrade it etc. The Silver plans have no SLA, but as far as I can tell their uptime is just fine. The Silver plans only guarantees 72-hour support response time — but for the couple support tickets I filed asking questions while under a free 14-day trial (oh yeah that’s available), I got prompt same-day responses, and knowledgeable responses that answered my questions.

So a “silver” plan is what we are interested in, but the pricing is not actually transparent.

$19/month is for the smallest instance available, and IF you prepay/contract for a year. They call that small instance an NDN1 and it has 1GB of RAM and 8GB of storage. If you pay-as-you-go instead of contracting for a year, that already jumps to $40/month. (That price is available on the trial page).

When you are paying-as-you-go, you are actually billed per-day, which might not be as nice as heroku’s per-minute, but it’s pretty okay, and useful if you need to bring up a temporary solr instance as part of a migration/upgrade or something like that.

The next step up is an “NDN2” which has 2G of RAM and 16GB of storage, and has an ~$80/month pay-as-you-go — you can find that price if you sign-up for a free trial. The discount price price for an annual contract is a discount similar to the NDN1 50%, $40/month — that price I got only from a salesperson, I don’t know if it’s always stable.

It only occurs to me now that they don’t tell you how many CPUs are available.

I’m not sure if I can fit our Solr in the 1G NDN1, but I am sure I can fit it in the 2G NDN2 with some headroom, so I didn’t look at plans above that — but they are available, still under “silver”, with prices going up accordingly.

All SearchStax solr instances run in “SolrCloud” mode — these NDN1 and NDN2 ones we’re looking at just run one node with one zookeeper, but still in cloud mode. There are also “silver” plans available with more than one node in a “high availability” configuration, but the prices start going up steeply, and we weren’t really interested in that.

Because it’s SolrCloud mode though, you can use the standard Solr API for uploading your configuration. It’s just Solr! So no arbitrary usage limits, no features disabled.

The SearchStax web console seems competently implemented; it let’s you create and delete individual Solr “deployments”, manage accounts to login to console (on “silver” plan you only get two, or can pay $10/month/account for more, nah), and set up auth for a solr deployment. They support IP-based authentication or HTTP Basic Auth to the Solr (no limit to how many Solr Basic Auth accounts you can create). HTTP Basic Auth is great for us, because trying to do IP-based from somewhere like heroku isn’t going to work. All Solrs are available over HTTPS/SSL — great!

SearchStax also has their own proprietary HTTP API that lets you do most anything, including creating/destroying deployments, managing Solr basic auth users, basically everything. There is some API that duplicates the Solr Cloud API for adding configsets, I don’t think there’s a good reason to use it instead of standard SolrCloud API, although their docs try to point you to it. There’s even some kind of webhooks for alerts! (which I haven’t really explored).

Basically, SearchStax just seems to be a sane and rational managed Solr option, it has all the features you’d expect/need/want for dealing with such. The prices seem reasonable-ish, generally more affordable than WebSolr, especially if you stay in “silver” and “one node”.

At present, we plan to move forward with it.

OpenSolr (didn’t look at it much)

I have the least to say about this, have spent the least time with it, after spending time with SearchStax and seeing it met our needs. But I wanted to make sure to mention it, because it’s the only other managed Solr I am even aware of. Definitely curious to hear from any users.

Here is the pricing page.

The prices seem pretty decent, perhaps even cheaper than SearchStax, although it’s unclear to me what you get. Does “0 Solr Clusters” mean that it’s not SolrCloud mode? After seeing how useful SolrCloud APIs are for management (and having this confirmed by many of my peers in other libraries/museums/archives who choose to run SolrCloud), I wouldn’t want to do without it. So I guess that pushes us to “executive” tier? Which at $50/month (billed yearly!) is still just fine, around the same as SearchStax.

But they do limit you to one solr index; I prefer SearchStax’s model of just giving you certain host resources and do what you want with it. It does say “shared infrastructure”.

Might be worth investigating, curious to hear more from anyone who did.

Now, what about ElasticSearch?

We’re using Solr mostly because that’s what various collaborative and open source projects in the library/museum/archive world have been doing for years, since before ElasticSearch even existed. So there are various open source libraries and toolsets available that we’re using.

But for whatever reason, there seem to be SO MANY MORE managed ElasticSearch SaaS available. At possibly much cheaper pricepoints. Is this because the ElasticSearch market is just bigger? Or is ElasticSearch easier/cheaper to run in a SaaS environment? Or what? I don’t know.

But there’s the controversial AWS ElasticSearch Service; there’s the Elastic Cloud “from the creators of ElasticSearch”. On Heroku that lists one Solr add-on, there are THREE ElasticSearch add-ons listed: ElasticCloud, Bonsai ElasticSearch, and SearchBox ElasticSearch.

If you just google “managed ElasticSearch” you immediately see 3 or 4 other names.

I don’t know enough about ElasticSearch to evaluate them. There seem on first glance at pricing pages to be more affordable, but I may not know what I’m comparing and be looking at tiers that aren’t actually usable for anything or will have hidden fees.

But I know there are definitely many more managed ElasticSearch SaaS than Solr.

I think ElasticSearch probably does everything our app needs. If I were to start from scratch, I would definitely consider ElasticSearch over Solr just based on how many more SaaS options there are. While it would require some knowledge-building (I have developed a lot of knowlege of Solr and zero of ElasticSearch) and rewriting some parts of our stack, I might still consider switching to ES in the future, we don’t do anything too too complicated with Solr that would be too too hard to switch to ES, probably.

Gem authors, check your release sizes

Most gems should probably be a couple hundred kb at most. I’m talking about the package actually stored in and downloaded from rubygems by an app using the gem.

After all, source code is just text, and it doesn’t take up much space. OK, maybe some gems have a couple images in there.

But if you are looking at your gem in rubygems and realize that it’s 10MB or bigger… and that it seems to be getting bigger with every release… something is probably wrong and worth looking into it.

One way to look into it is to look at the actual gem package. If you use the handy bundler rake task to release your gem (and I recommend it), you have a ./pkg directory in your source you last released from. Inside it are “.gem” files for each release you’ve made from there, unless you’ve cleaned it up recently.

.gem files are just .tar files it turns out. That have more tar and gz files inside them etc. We can go into it, extract contents, and use the handy unix utility du -sh to see what is taking up all the space.

How I found the bytes

jrochkind-chf kithe (master ?) $ cd pkg

jrochkind-chf pkg (master ?) $ ls
kithe-2.0.0.beta1.gem        kithe-2.0.0.pre.rc1.gem
kithe-2.0.0.gem            kithe-2.0.1.gem
kithe-2.0.0.pre.beta1.gem    kithe-2.0.2.gem

jrochkind-chf pkg (master ?) $ mkdir exploded

jrochkind-chf pkg (master ?) $ cp kithe-2.0.0.gem exploded/kithe-2.0.0.tar

jrochkind-chf pkg (master ?) $ cd exploded

jrochkind-chf exploded (master ?) $ tar -xvf kithe-2.0.0.tar
 x metadata.gz
 x data.tar.gz
 x checksums.yaml.gz

jrochkind-chf exploded (master ?) $  mkdir unpacked_data_tar

jrochkind-chf exploded (master ?) $ tar -xvf data.tar.gz -C unpacked_data_tar/

jrochkind-chf exploded (master ?) $ cd unpacked_data_tar/

jrochkind-chf unpacked_data_tar (master ?) $ du -sh *
 4.0K    Rakefile
 160K    app
 8.0K    config
  32K    db
 100K    lib
 300M    spec

jrochkind-chf unpacked_data_tar (master ?) $ cd spec

jrochkind-chf spec (master ?) $ du -sh *
 8.0K    derivative_transformers
 300M    dummy
  12K    factories
  24K    indexing
  72K    models
 4.0K    rails_helper.rb
  44K    shrine
  12K    simple_form_enhancements
 8.0K    spec_helper.rb
 188K    test_support
 4.0K    validators

jrochkind-chf spec (master ?) $ cd dummy/

jrochkind-chf dummy (master ?) $ du -sh *
 4.0K    Rakefile
  56K    app
  24K    bin
 124K    config
 8.0K    db
 300M    log
 4.0K    package.json
  12K    public
 4.0K    tmp

Doh! In this particular gem, I have a dummy rails app, and it has 300MB of logs, cause I haven’t b bothered trimming them in a while, that are winding up including in the gem release package distributed to rubygems and downloaded by all consumers! Even if they were small, I don’t want these in the released gem package at all!

That’s not good! It only turns into 12MB instead of 300MB, because log files are so compressable and there is compression involved in assembling the rubygems package. But I have no idea how much space it’s actually taking up on consuming applications machines. This is very irresponsible!

What controls what files are included in the gem package?

Your .gemspec file of course. The line s.files = is an array of every file to include in the gem package. Well, plus s.test_files is another array of more files, that aren’t supposed to be necessary to run the gem, but are to test it.

(Rubygems was set up to allow automated *testing* of gems after download, is why test files are included in the release package. I am not sure how useful this is, and who if anyone does it; although I believe that some linux distro packagers try to make use of it, for better or worse).

But nobody wants to list every file in your gem individually, manually editing the array every time you add, remove, or move one. Fortunately, gemspec files are executable ruby code, so you can use ruby as a shortcut.

I have seen two main ways of doing this, with different “gem skeleton generators” taking one of two approaches.

Sometimes a shell out to git is used — the idea is that everything you have checked into your git should be in the gem release package, no more or no less. For instance, one of my gems has this in it, not sure where it came from or who/what generated it.

spec.files = `git ls-files -z`.split("\x0").reject do |f|

In that case, it wouldn’t have included anything in ./spec already, so this obviously isn’t actually the gem we were looking at before.

But in this case, in addition to using ruby logic to manipulate the results, nothing excluded by your .gitignore file will end up included in your gem package, great!

In kithe we were looking at before, those log files were in the .gitignore (they weren’t in my repo!), so if I had been using that git-shellout technique, they wouldn’t have been included in the ruby release already.

But… I wasn’t. Instead this gem has a gemspec that looks like:

s.test_files = Dir["spec/*/"]

Just include every single file inside ./spec in the test_files list. Oops. Then I get all those log files!

One way to fix

I don’t really know which is to be preferred of the git-shellout approach vs the dir-glob approach. I suspect it is the subject of historical religious wars in rubydom, when there were still more people around to argue about such things. Any opinions? Or another approach?

Without being in the mood to restructure this gemspec in anyway, I just did the simplest thing to keep those log files out…

Dir["spec/*/"].delete_if {|a| a =~ %r{/dummy/log/}}

Build the package without releasing with the handy bundler supplied rake build task… and my gem release package size goes from 12MB to 64K. (which actually kind of sounds like a minimum block size or something, right?)

Phew! That’s a big difference! Sorry for anyone using previous versions and winding up downloading all that cruft! (Actually this particular gem is mostly a proof of concept at this point and I don’t think anyone else is using it).

Check your gem sizes!

I’d be willing to be there are lots of released gems with heavily bloated release packages like this. This isn’t the first one I’ve realized was my fault. Because who pays attention to gem sizes anyway? Apparently not many!

But rubygems does list them, so it’s pretty easy to see. Are your gem release packages multiple megs, when there’s no good reason for them to be? Do they get bigger every release by far more than the bytes of lines of code you think were added? At some point in gem history was there a big jump from hundreds of KB to multiple MB? When nothing particularly actually happened to gem logic to lead to that?

All hints that you might be including things you didn’t mean to include, possibly things that grow each release.

You don’t need to have a dummy rails app in your repo to accidentally do this (I accidentally did it once with a gem that had nothing to do with rails). There could be other kind of log files. Or test coverage or performance metric files, or any other artifacts of your build or your development, especially ones that grow over time — that aren’t actually meant to or needed as part of the gem release package!

It’s good to sanity check your gem release packages now and then. In most cases, your gem release package should be hundreds of KB at most, not MBs. Help keep your users’ installs and builds faster and slimmer!

Updating SolrCloud configuration in ruby

We have an app that uses Solr. We currently run a Solr in legacy “not cloud” mode. Our solr configuration directory is on disk on the Solr server, and it’s up to our processes to get our desired solr configuration there, and to update it when it changes.

We are in the process of moving to a Solr in “SolrCloud mode“, probably via the SearchStax managed Solr service. Our Solr “Cloud” might only have one node, but “SolrCloud mode” gives us access to additional API’s for managing your solr configuration, as opposed to writing it directly to disk (which may not be possible at all in SolrCloud mode? And certainly isn’t using managed SearchStax).

That is, the Solr ConfigSets API, although you might also want to use a few pieces of the Collection Management API for associating a configset with a Solr collection.

Basically, you are taking your desired solr config directory, zipping it up, and uploading it to Solr as a “config set” [or “configset”] with a certain name. Then you can create collections using this config set, or reassign which named configset an existing collection uses.

I wasn’t able to find any existing ruby gems for interacting with these Solr API’s. RSolr is a “ruby client for interacting with solr”, but was written before most of these administrative API’s existed for Solr, and doesn’t seem to have been updated to deal with them (unless I missed it), RSolr seems to be mostly/only about querying solr, and some limited indexing.

But no worries, it’s not too hard to wrap the specific API I want to use in some ruby. Which did seem far better to me than writing the specific HTTP requests each time (and making sure you are dealing with errors etc!). (And yes, I will share the code with you).

I decided I wanted an object that was bound to a particular solr collection at a particular solr instance; and was backed by a particular local directory with solr config. That worked well for my use case, and I wound up with an API that looks like this:

updater =
  solr_url: "",
  conf_dir: "./solr/conf",
  collection_name: "myCollection"

# will zip up ./solr/conf and upload it as named MyConfigset:

updater.list #=> ["myConfigSet"]
updater.config_name # what configset name is MyCollection currently configured to use?
# => "oldConfigSet"

# what if we try to delete the one it's using?
# => raises SolrConfigsetUpdater::SolrError with message:
# "Can not delete ConfigSet as it is currently being used by collection [myConfigset]"

# okay let's change it to use the new one and delete the old one

# now MyCollection uses this new configset, although we possibly
# need to reload the collection to make that so
# now let's delete the one we're not using

OK, great. There were some tricks in there in trying to catch the apparently multiple ways Solr can report different kinds of errors, to make sure Solr-reported errors turn into exceptions ideally with good error messages.

Now, in addition to uploading a configset initially for a collection you are creating to use, the main use case I have is wanting to UPDATE the configuration to new values in an existing collection. Sure, this often requires a reindex afterwards.

If you have the recently released Solr 8.7, it will let you overwrite an existing configset, so this can be done pretty easily.

updater.upload(updater.config_name, overwrite: true)

But prior to Solr 8.7 you can not overwrite an existing configset. And SearchStax doesn’t yet have Solr 8.7. So one way or another, we need to do a dance where we upload the configset under a new name than switch the collection to use it.

Having this updater object that lets us easily execute relevant Solr API lets us easily experiment with different logic flows for this. For instance in a Solr listserv thread, Alex Halovnic suggests a somewhat complicated 8-step process workaround, which we can implement like so:

current_name = updater.config_name
temp_name = "#{current_name}_temp"

updater.create(from: current_name, to: temp_name)
updater.upload(configset_name: current_name)

That works. But talking to Dann Bohn at Penn State University, he shared a different algorithm, which goes like:

  • Make a cryptographic digest hash of the entire solr directory, which we’re going to use in the configset name.
  • Check if the collection is already using a configset named $name_$digest, which if it already is, you’re done, no change needed.
  • Otherwise, upload the configset with the fingerprint-based name, switch the collection to use it, reload, delete the configset that the collection used to use.

At first this seemed like overkill to me, but after thinking and experimenting with it, I like it! It is really quick to make a digest of a handful of files, that’s not a big deal. (I use first 7 chars of hex SHA256). And even if we had Solr 8.7, I like that we can avoid doing any operation on solr at all if there had been no changes — I really want to use this operation much like a Rails db:migrate, running it on every deploy to make sure the solr schema matches the one in the repo for the depoy.

Dann also shared his open source code with me, which was helpful for seeing how to make the digest, how to make a Zip file in ruby, etc. Thanks Dann!

Sharing my code

So I also wrote some methods to implement those variant updating stragies, Dann’s, and Alex Halovnic’s from the list etc.

I thought about wrapping this all up as a gem, but I didn’t really have the time to make it really good enough for that. My API is a little bit janky, I didn’t spend the extra time think it out really well to minimize the need for future backwards incompat changes like I would if it were a gem. I also couldn’t figure out a great way to write automated tests for this that I would find particularly useful; so in my code base it’s actually not currently test-covered (shhhhh) but in a gem I’d want to solve that somehow.

But I did try to write the code general purpose/flexible so other people could use it for their use cases; I tried to document it to my highest standards; and I put it all in one file which actually might not be the best OO abstraction/design, but makes it easier for you to copy and paste the single file for your own use. :)

So you can find my code here; it is apache-licensed; and you are welcome to copy and paste it and do whatever you like with it, including making a gem yourself if you want. Maybe I’ll get around to making it a gem in the future myself, I dunno, curious if there’s interest.

The SearchStax proprietary API’s

SearchStax has it’s own API’s that can I think be used for updating configsets and setting collections to use certain configsets etc. When I started exploring them, they are’t the worst vendor API’s I’ve seen, but I did find them a bit cumbersome to work with. The auth system involves a lot of steps (why can’t you just create an API Key from the SearchStax Web GUI?).

Overall I found them harder to use than just the standard Solr Cloud API’s, which worked fine in the SearchStax deployment, and have the added bonus of being transferable to any SolrCloud deployment instead of being SearchStax-specific. While the SearchStax docs and support try to steer you to the SearchStax specific API’s, I don’t think there’s really any good reason for this. (Perhaps the custom SearchStax API’s were written long ago when Solr API’s weren’t as complete?)

SearchStax support suggested that the SearchStax APIs were somehow more secure; but my SearchStax Solr API’s are protected behind HTTP basic auth, and if I’ve created basic auth credentials (or IP addr allowlist) those API’s will be available to anyone with auth to access Solr whether I use em or not! And support also suggested that the SearchStax API use would be logged, whereas my direct Solr API use would not be, which seems to be true at least in default setup, I can probably configure solr logging differently, but it just isn’t that important to me for these particular functions.

So after some initial exploration with SearchStax API, I realized that SolrCloud API (which I had never used before) could do everything I need and was more straightforward and transferable to use, and I’m happy with my decision to go with that.

Are you talking to Heroku redis in cleartext or SSL?

In “typical” Redis installation, you might be talking to redis on localhost or on a private network, and clients typically talk to redis in cleartext. Redis doesn’t even natively support communications over SSL. (Or maybe it does now with redis6?)

However, the Heroku redis add-on (the one from Heroku itself) supports SSL connections via “Stunnel”, a tool popular with other redis users use to get SSL redis connections too. (Or maybe via native redis with redis6? Not sure if you’d know the difference, or if it matters).

There are heroku docs on all of this which say:

While you can connect to Heroku Redis without the Stunnel buildpack, it is not recommend. The data traveling over the wire will be unencrypted.

Perhaps especially because on heroku your app does not talk to redis via localhost or on a private network, but on a public network.

But I think I’ve worked on heroku apps before that missed this advice and are still talking to heroku in the clear. I just happened to run across it when I got curious about the REDIS_TLS_URL env/config variable I noticed heroku setting.

Which brings us to another thing, that heroku doc on it is out of date, it doesn’t mention the REDIS_TLS_URL config variable, just the REDIS_URL one. The difference? the TLS version will be a url beginning with rediss:// instead of redis:// , note extra s, which many redis clients use as a convention for “SSL connection to redis probably via stunnel since redis itself doens’t support it”. The redis docs provide ruby and go examples which instead use REDIS_URL and writing code to swap the redis:// for rediss:// and even hard-code port number adjustments, which is silly!

(While I continue to be very impressed with heroku as a product, I keep running into weird things like this outdated documentation, that does not match my experience/impression of heroku’s all-around technical excellence, and makes me worry if heroku is slipping…).

The docs also mention a weird driver: ruby arg for initializing the Redis client that I’m not sure what it is and it doesn’t seem necessary.

The docs are correct that you have to tell the ruby Redis client not to try to verify SSL keys against trusted root certs, and this implementation uses a self-signed cert. Otherwise you will get an error that looks like: OpenSSL::SSL::SSLError: SSL_connect returned=1 errno=0 state=error: certificate verify failed (self signed certificate in certificate chain)

So, can be as simple as:

redis_client = ENV['REDIS_TLS_URL'], ssl_params: { verify_mode: OpenSSL::SSL::VERIFY_NONE })

$redis = redis_client
# and/or
Resque.redis = redis_client

I don’t use sidekiq on this project currently, but to get the SSL connection with VERIFY_NONE, looking at sidekiq docs maybe on sidekiq docs you might have to(?):

redis_conn = proc { ENV['REDIS_TLS_URL'], ssl_params: { verify_mode: OpenSSL::SSL::VERIFY_NONE })

Sidekiq.configure_client do |config|
  config.redis = 5, &redis_conn)

Sidekiq.configure_server do |config|
  config.redis = 25, &redis_conn)

(Not sure what values you should pick for connection pool size).

While the sidekiq docs mention heroku in passing, they don’t mention need for SSL connections — I think awareness of this heroku feature and their recommendation you use it may not actually be common!

Update: Beware REDIS_URL can also be rediss

On one of my apps I saw a REDIS_URL which used redis: and a REDIS_TLS_URL which uses (secure) rediss:.

But on another app, it provides *only* a REDIS_URL, which is rediss — meaning you have to set the verify_mode: OpenSSL::SSL::VERIFY_NONE when passing it to ruby redis client. So you have to be prepared to do this with REDIS_URL values too — I think it shouldn’t hurt to set the ssl_params option even if you pass it a non-ssl redis: url, so just set it all the time?

This second app was heroku-20 stack, and the first was heroku-18 stack, is that the difference? No idea.

Documented anywhere? I doubt it. Definitely seems sloppy for what I expect of heroku, making me get a bit suspicious of whether heroku is sticking to the really impressive level of technical excellence and documentation I expect from them.

So, your best bet is to check for both REDIS_TLS_URL and REDIS_URL, prefering the TLS one if present, realizing the REDIS_URL can have a rediss:// value in it too.

The heroku docs also say you don’t get secure TLS redis connection on “hobby” plans, but I”m not sure that’s actually true anymore on heroku-20? Not trusting the docs is not a good sign.

Comparing performance of a Rails app on different Heroku formations

I develop a “digital collections” or “asset management” app, which manages and makes digitized historical objects and their descriptions available to the public, from the collections here at the Science History Institute.

The app receives relatively low level of traffic (according to Google Analytics, around 25K pageviews a month), although we want it to be able to handle spikes without falling down. It is not the most performance-optimized app, it does have some relatively slow responses and can be RAM-hungry. But it works adequately on our current infrastructure: Web traffic is handled on a single AWS EC2 t2.medium instance, with 10 passenger processes (free version of passenger, so no multi-threading).

We are currently investigating the possibility of moving our infrastructure to heroku. After realizing that heroku standard dynos did not seem to have the performance characteristics I had expected, I decided to approach performance testing more methodically, to compare different heroku dyno formations to each other and to our current infrastructure. Our basic research question is probably What heroku formation do we need to have similar performance to our existing infrastructure?

I am not an expert at doing this — I did some research, read some blog posts, did some thinking, and embarked on this. I am going to lead you through how I approached this and what I found. Feedback or suggestions are welcome. The most surprising result I found was much poorer performance from heroku standard dynos than I expected, and specifically that standard dynos would not match performance of present infrastructure.

What URLs to use in test

Some older load-testing tools only support testing one URL over and over. I decided I wanted to test a larger sample list of URLs — to be a more “realistic” load, and also because repeatedly requesting only one URL might accidentally use caches in ways you aren’t expecting giving you unrepresentative results. (Our app does not currently use fragment caching, but caches you might not even be thinking about include postgres’s built-in automatic caches, or passenger’s automatic turbocache (which I don’t think we have turned on)).

My initial thought to get a list of such URLs from our already-in-production app from production logs, to get a sample of what real traffic looks like. There were a couple barriers for me to using production logs as URLs:

  1. Some of those URLs might require authentication, or be POST requests. The bulk of our app’s traffic is GET requests available without authentication, and I didn’t feel like the added complexity of setting up anything else in a load traffic was worthwhile.
  2. Our app on heroku isn’t fully functional yet. Without having connected it to a Solr or background job workers, only certain URLs are available.

In fact, a large portion of our traffic is an “item” or “work” detail page like this one. Additionally, those are the pages that can be the biggest performance challenge, since the current implementation includes a thumbnail for every scanned page or other image, so response time unfortunately scales with number of pages in an item.

So I decided a good list of URLs was simply a representative same of those “work detail” pages. In fact, rather than completely random sample, I took the 50 largest/slowest work pages, and then added in another 150 randomly chosen from our current ~8K pages. And gave them all a randomly shuffled order.

In our app, every time a browser requests a work detail page, the JS on that page makes an additional request for a JSON document that powers our page viewer. So for each of those 200 work detail pages, I added the JSON request URL, for a more “realistic” load, and 400 total URLs.

Performance: “base speed” vs “throughput under load”

Thinking about it, I realized there were two kinds of “performance” or “speed” to think about.

You might just have a really slow app, to exagerate let’s say typical responses are 5 seconds. That’s under low/no-traffic, a single browser is the only thing interacting with the app, it makes a single request, and has to wait 5 seconds for a response.

That number might be changed by optimizations or performance regressions in your code (including your dependencies). It might also be changed by moving or changing hardware or virtualization environment — including giving your database more CPU/RAM resources, etc.

But that number will not change by horizontally scaling your deployment — adding more puma or passenger processes or threads, scaling out hosts with a load balancer or heroku dynos. None of that will change this base speed because it’s just how long the app takes to prepare a response when not under load, how slow it is in a test only one web worker , where adding web workers won’t matter because they won’t be used.

Then there’s what happens to the app actually under load by multiple users at once. The base speed is kind of a lower bound on throughput under load — page response time is never going to get better than 5s for our hypothetical very slow app (without changing the underlying base speed). But it can get a lot worse if it’s hammered by traffic. This throughput under load can be effected not only by changing base speed, but also by various forms of horizontal scaling — how many puma or passenger processes you have with how many threads each, and how many CPUs they have access to, as well as number of heroku dynos or other hosts behind a load balancer.

(I had been thinking about this distinction already, but Nate Berkopec’s great blog post on scaling Rails apps gave me the “speed” vs “throughout” terminology to use).

For my condition, we are not changing the code at all. But we are changing the host architecture from a manual EC2 t2.medium to heroku dynos (of various possible types) in a way that could effect base speed, and we’re also changing our scaling architecture in a way that could change throughput under load on top of that — from one t2.medium with 10 passenger process to possibly multiple heroku dynos behind heroku’s load balancer, and also (for Reasons) switching from free passenger to trying puma with multiple threads per process. (we are running puma 5 with new experimental performance features turned on).

So we’ll want to get a sense of base speed of the various host choices, and also look at how throughput under load changes based on various choices.

Benchmarking tool: wrk

We’re going to use wrk.

There are LOTS of choices for HTTP benchmarking/load testing, with really varying complexity and from different eras of web history. I got a bit overwhelmed by it, but settled on wrk. Some other choices didn’t have all the features we need (some way to test a list of URLs, with at least some limited percentile distribution reporting). Others were much more flexible and complicated and I had trouble even figuring out how to use them!

wrk does need a custom lua script in order to handle a list of URLs. I found a nice script here, and modified it slightly to take filename from an ENV variable, and not randomly shuffle input list.

It’s a bit confusing understanding the meaning of “threads” vs “connections” in wrk arguments. This blog post from appfolio clears it up a bit. I decided to leave threads set to 1, and vary connections for load — so -c1 -t1 is a “one URL at a time” setting we can use to test “base speed”, and we can benchmark throughput under load by increasing connections.

We want to make sure we run the test for long enough to touch all 400 URLs in our list at least once, even in the slower setups, to have a good comparison — ideally it would be go through the list more than once, but for my own ergonomics I had to get through a lot of tests so ended up less tha ideal. (Should I have put fewer than 400 URLs in? Not sure).

Conclusions in advance

As benchmarking posts go (especially when I’m the one writing them), I’m about to drop a lot of words and data on you. So to maximize the audience that sees the conclusions (because they surprise me, and I want feedback/pushback on them), I’m going to give you some conclusions up front.

Our current infrastructure has web app on a single EC2 t2.medium, which is a burstable EC2 type — our relatively low-traffic app does not exhaust it’s burst credits. Measuring base speed (just one concurrent request at a time), we found that performance dynos seem to have about the CPU speed of a bursting t2.medium (just a hair slower).

But standard dynos are as a rule 2 to 3 times slower; additionally they are highly variable, and that variability can be over hours/days. A 3 minute period can have measured response times 2 or more times slower than another 3 minute period a couple hours later. But they seem to typically be 2-3x slower than our current infrastructure.

Under load, they scale about how you’d expect if you knew how many CPUs are present, no real surprises. Our existing t2.medium has two CPUs, so can handle 2 simultaneous requests as fast as 1, and after that degrades linearly.

A single performance-L ($500/month) has 4 CPUs (8 hyperthreads), so scales under load much better than our current infrastructure.

A single performance-M ($250/month) has only 1 CPU (!), so scales pretty terribly under load.

Testing scaling with 4 standard-2x’s ($200/month total), we see that it scales relatively evenly. Although lumpily because of variability, and it starts out so much worse performing that even as it scales “evenly” it’s still out-performed by all other arcchitectures. :( (At these relatively fast median response times you might say it’s still fast enough who cares, but in our fat tail of slower pages it gets more distressing).

Now we’ll give you lots of measurements, or you can skip all that to my summary discussion or conclusions for our own project at the end.

Let’s compare base speed

OK, let’s get to actual measurements! For “base speed” measurements, we’ll be telling wrk to use only one connection and one thread.

Existing t2.medium: base speed

Our current infrastructure is one EC2 t2.medium. This EC2 instance type has two vCPUs and 4GB of RAM. On that single EC2 instance, we run passenger (free not enterprise) set to have 10 passenger processes, although the base speed test with only one connection should only touch one of the workers. The t2 is a “burstable” type, and we do always have burst credits (this is not a high traffic app; verified we never exhausted burst credits in these tests), so our test load may be taking advantage of burst cpu.

$ URLS=./sample_works.txt  wrk -c 1 -t 1 -d 3m --timeout 20s --latency -s load_test/multiplepaths.lua.txt https://[current staging server]
 multiplepaths: Found 400 paths
 multiplepaths: Found 400 paths
 Running 3m test @
   1 threads and 1 connections
   Thread Stats   Avg      Stdev     Max   +/- Stdev
     Latency   311.00ms  388.11ms   2.37s    86.45%
     Req/Sec    11.89      8.96    40.00     69.95%
   Latency Distribution
      50%   90.99ms
      75%  453.40ms
      90%  868.81ms
      99%    1.72s
   966 requests in 3.00m, 177.43MB read
 Requests/sec:      5.37
 Transfer/sec:      0.99MB

I’m actually feeling pretty good about those numbers on our current infrastructure! 90ms median, not bad, and even 453ms 75th percentile is not too bad. Now, our test load involves some JSON responses that are quicker to deliver than corresponding HTML page, but still pretty good. The 90th/99th/and max request (2.37s) aren’t great, but I knew I had some slow pages, this matches my previous understanding of how slow they are in our current infrastructure.

90th percentile is ~9 times 50th percenile.

I don’t have an understanding of why the two different Req/Sec and Requests/Sec values are so different, and don’t totally understand what to do with the Stdev and +/- Stdev values, so I’m just going to be sticking to looking at the latency percentiles, I think “latency” could also be called “response times” here.

But ok, this is our baseline for this workload. And doing this 3 minute test at various points over the past few days, I can say it’s nicely regular and consistent, occasionally I got a slower run, but 50th percentile was usually 90ms–105ms, right around there.

Heroku standard-2x: base speed

From previous mucking about, I learned I can only reliably fit one puma worker in a standard-1x, and heroku says “we typically recommend a minimum of 2 processes, if possible” (for routing algorithmic reasons when scaled to multiple dynos), so I am just starting at a standard-2x with two puma workers each with 5 threads, matching heroku recommendations for a standard-2x dyno.

So one thing I discovered is that bencharks from a heroku standard dyno are really variable, but here are typical ones:

$ heroku dyno:resize
 type     size         qty  cost/mo
 ───────  ───────────  ───  ───────
 web      Standard-2X  1    50

$ heroku config:get --shell WEB_CONCURRENCY RAILS_MAX_THREADS

$ URLS=./sample_works.txt  wrk -c 1 -t 1 -d 3m --timeout 20s --latency -s load_test/multiplepaths.lua.txt
 multiplepaths: Found 400 paths
 multiplepaths: Found 400 paths
 Running 3m test @
   1 threads and 1 connections
   Thread Stats   Avg      Stdev     Max   +/- Stdev
     Latency   645.08ms  768.94ms   4.41s    85.52%
     Req/Sec     5.78      4.36    20.00     72.73%
   Latency Distribution
      50%  271.39ms
      75%  948.00ms
      90%    1.74s
      99%    3.50s
   427 requests in 3.00m, 74.51MB read
 Requests/sec:      2.37
 Transfer/sec:    423.67KB

I had heard that heroku standard dynos would have variable performance, because they are shared multi-tenant resources. I had been thinking of this like during a 3 minute test I might see around the same median with more standard deviation — but instead, what it looks like to me is that running this benchmark on Monday at 9am might give very different results than at 9:50am or Tuesday at 2pm. The variability is over a way longer timeframe than my 3 minute test — so that’s something learned.

Running this here and there over the past week, the above results seem to me typical of what I saw. (To get better than “seem typical” on this resource, you’d have to run a test, over several days or a week I think, probably not hammering the server the whole time, to get a sense of actual statistical distribution of the variability).

I sometimes saw tests that were quite a bit slower than this, up to a 500ms median. I rarely if ever saw results too much faster than this on a standard-2x. 90th percentile is ~6x median, less than my current infrastructure, but that still gets up there to 1.74 instead of 864ms.

This typical one is quite a bit slower than than our current infrastructure, our median response time is 3x the latency, with 90th and max being around 2x. This was worse than I expected.

Heroku performance-m: base speed

Although we might be able to fit more puma workers in RAM, we’re running a single-connection base speed test, so it shouldn’t matter to, and we won’t adjust it.

$ heroku dyno:resize
 type     size           qty  cost/mo
 ───────  ─────────────  ───  ───────
 web      Performance-M  1    250

$ heroku config:get --shell WEB_CONCURRENCY RAILS_MAX_THREADS

$ URLS=./sample_works.txt  wrk -c 1 -t 1 -d 3m --timeout 20s --latency -s load_test/multiplepaths.lua.txt
 multiplepaths: Found 400 paths
 multiplepaths: Found 400 paths
 Running 3m test @
   1 threads and 1 connections
   Thread Stats   Avg      Stdev     Max   +/- Stdev
     Latency   377.88ms  481.96ms   3.33s    86.57%
     Req/Sec    10.36      7.78    30.00     37.03%
   Latency Distribution
      50%  117.62ms
      75%  528.68ms
      90%    1.02s
      99%    2.19s
   793 requests in 3.00m, 145.70MB read
 Requests/sec:      4.40
 Transfer/sec:    828.70KB

This is a lot closer to the ballpark of our current infrastructure. It’s a bit slower (117ms median intead of 90ms median), but in running this now and then over the past week it was remarkably, thankfully, consistent. Median and 99th percentile are both 28% slower (makes me feel comforted that those numbers are the same in these two runs!), that doesn’t bother me so much if it’s predictable and regular, which it appears to be. The max appears to me still a little bit less regular on heroku for some reason, since performance is supposed to be non-shared AWS resources, you wouldn’t expect it to be, but slow requests are slow, ok.

90th percentile is ~9x median, about the same as my current infrastructure.

heroku performance-l: base speed

$ heroku dyno:resize
 type     size           qty  cost/mo
 ───────  ─────────────  ───  ───────
 web      Performance-L  1    500

$ heroku config:get --shell WEB_CONCURRENCY RAILS_MAX_THREADS

URLS=./sample_works.txt  wrk -c 1 -t 1 -d 3m --timeout 20s --latency -s load_test/multiplepaths.lua.txt
 multiplepaths: Found 400 paths
 multiplepaths: Found 400 paths
 Running 3m test @
   1 threads and 1 connections
   Thread Stats   Avg      Stdev     Max   +/- Stdev
     Latency   471.29ms  658.35ms   5.15s    87.98%
     Req/Sec    10.18      7.78    30.00     36.20%
   Latency Distribution
      50%  123.08ms
      75%  635.00ms
      90%    1.30s
      99%    2.86s
   704 requests in 3.00m, 130.43MB read
 Requests/sec:      3.91
 Transfer/sec:    741.94KB

No news is good news, it looks very much like performance-m, which is exactly what we expected, because this isn’t a load test. It tells us that performance-m and performance-l seem to have similar CPU speeds and similar predictable non-variable regularity, which is what I find running this test periodically over a week.

90th percentile is ~10x median, about the same as current infrastructure.

The higher Max speed is just evidence of what I mentioned, the speed of slowest request did seem to vary more than on our manual t2.medium, can’t really explain why.

Summary: Base speed

Not sure how helpful this visualization is, charting 50th, 75th, and 90th percentile responses across architectures.

But basically: performance dynos perform similarly to my (bursting) t2.medium. Can’t explain why performance-l seems slightly slower than performance-m, might be just incidental variation when I ran the tests.

The standard-2x is about twice as slow as my (bursting) t2.medium. Again recall standard-2x results varied a lot every time I ran them, the one I reported seems “typical” to me, that’s not super scientific, admittedly, but I’m confident that standard-2x are a lot slower in median response times than my current infrastructure.

Throughput under load

Ok, now we’re going to test using wrk to use more connections. In fact, I’ll test each setup with various number of connections, and graph the result, to get a sense of how each formation can handle throughput under load. (This means a lot of minutes to get all these results, at 3 minutes per number of connection test, per formation!).

An additional thing we can learn from this test, on heroku we can look at how much RAM is being used after a load test, to get a sense of the app’s RAM usage under traffic to understand the maximum number of puma workers we might be able to fit in a given dyno.

Existing t2.medium: Under load

A t2.medium has 4G of RAM and 2 CPUs. We run 10 passenger workers (no multi-threading, since we are free, rather than enterprise, passenger). So what do we expect? With 2 CPUs and more than 2 workers, I’d expect it to handle 2 simultaneous streams of requests almost as well as 1; 3-10 should be quite a bit slower because they are competing for the 2 CPUs. Over 10, performance will probably become catastrophic.

2 connections are exactly flat with 1, as expected for our two CPUs, hooray!

Then it goes up at a strikingly even line. Going over 10 (to 12) simultaneous connections doesn’t matter, even though we’ve exhausted our workers, I guess at this point there’s so much competition for the two CPUs already.

The slope of this curve is really nice too actually. Without load, our median response time is 100ms, but even at a totally overloaded 12 overloaded connections, it’s only 550ms, which actually isn’t too bad.

We can make a graph that in addition to median also has 75th, 90th, and 99th percentile response time on it:

It doesn’t tell us too much; it tells us the upper percentiles rise at about the same rate as the median. At 1 simultaneous connection 90th percentile of 846ms is about 9 times the median of 93ms; at 10 requests the 90th percentile of 3.6 seconds is about 8 times the median of 471ms.

This does remind us that under load when things get slow, this has more of a disastrous effect on already slow requests than fast requests. When not under load, even our 90th percentile was kind of sort of barley acceptable at 846ms, but under load at 3.6 seconds it really isn’t.

Single Standard-2X dyno: Under load

A standard-2X dyno has 1G of RAM. The (amazing, excellent, thanks schneems) heroku puma guide suggests running two puma workers with 5 threads each. At first I wanted to try running three workers, which seemed to fit into available RAM — but under heavy load-testing I was getting Heroku R14 Memory Quota Exceeded errors, so we’ll just stick with the heroku docs recommendations. Two workers with 5 threads each fit with plenty of headroom.

A standard-2x dyno is runs on shared (multi-tenant) underlying Amazon virtual hardware. So while it is running on hardware with 4 CPUs (each of which can run two “hyperthreads“), the puma doc suggests “it is best to assume only one process can execute at a time” on standard dynos.

What do we expect? Well, if it really only had one CPU, it would immediately start getting bad at 2 simulataneous connections, and just get worse from there. When we exceed the two worker count, will it get even worse? What about when we exceed the 10 thread (2 workers * 5 threads) count?

You’d never run just one dyno if you were expecting this much traffic, you’d always horizontally scale. This very artificial test is just to get a sense of it’s characteristics.

Also, we remember that standard-2x’s are just really variable; I could get much worse or better runs than this, but graphed numbers from a run that seemed typical.

Well, it really does act like 1 CPU, 2 simultaneous connections is immediately a lot worse than 1.

The line isn’t quite as straight as in our existing t2.medium, but it’s still pretty straight; I’d attribute the slight lumpiness to just the variability of shared-architecture standard dyno, and figure it would get perfectly straight with more data.

It degrades at about the same rate of our baseline t2.medium, but when you start out slower, that’s more disastrous. Our t2.medium at an overloaded 10 simultaneous requests is 473ms (pretty tolerable actually), 5 times the median at one request only. This standard-2x has a median response time of 273 ms at only one simultaneous request, and at an overloaded 10 requests has a median response time also about 5x worse, but that becomes a less tolerable 1480ms.

Does also graphing the 75th, 90th, and 99th percentile tell us much?

Eh, I think the lumpiness is still just standard shared-architecture variability.

The rate of “getting worse” as we add more overloaded connections is actually a bit better than it was on our t2.medium, but since it already starts out so much slower, we’ll just call it a wash. (On t2.medium, 90th percentile without load is 846ms and under an overloaded 10 connections 3.6s. On this single standard-2x, it’s 1.8s and 5.2s).

I’m not sure how much these charts with various percentiles on them tell us, I’ll not include them for every architecture hence.

standard-2x, 4 dynos: Under load

OK, realistically we already know you shouldn’t have just one standard-2x dyno under that kind of load. You’d scale out, either manually or perhaps using something like the neat Rails Autoscale add-on.

Let’s measure with 4 dynos. Each is still running 2 puma workers, with 5 threads each.

What do we expect? Hm, treating each dyno as if it has only one CPU, we’d expect it to be able to handle traffic pretty levelly up to 4 simultenous connections, distributed to 4 dynos. It’s going to do worse after that, but up to 8 there is still one puma worker per connection so it might get even worse after 8?

Well… I think that actually is relatively flat from 1 to 4 simultaneous connections, except for lumpiness from variability. But lumpiness from variability is huge! We’re talking 250ms median measured at 1 connection, up to 369ms measured median at 2, down to 274ms at 3.

And then maybe yeah, a fairly shallow slope up to 8 simutaneous connections than steeper.

But it’s all fairly shallow slope compared to our base t2.medium. At 8 connections (after which we pretty much max out), the standard-2x median of 464ms is only 1.8 times the median at 1 conection. Compared to the t2.median increase of 3.7 times.

As we’d expect, scaling out to 4 dynos (with four cpus/8 hyperthreads) helps us scale well — the problem is the baseline is so slow to begin (with very high bounds of variability making it regularly even slower).

performance-m: Under load

A performance-m has 2.5 GB of memory. It only has one physical CPU, although two “vCPUs” (two hyperthreads) — and these are all your apps, it is not shared.

By testing under load, I demonstrated I could actually fit 12 workers on there without any memory limit errors. But is there any point to doing that with only 1/2 CPUs? Under a bit of testing, it appeared not.

The heroku puma docs recommend only 2 processes with 5 threads. You could do a whole little mini-experiment just trying to measure/optimize process/thread count on performance-m! We’ve already got too much data here, but in some experimentation it looked to me like 5 processes with 2 threads each performed better (and certainly no worse) than 2 processes with 5 threads — if you’ve got the RAM just sitting there anyway (as we do), why not?

I actually tested with 6 puma processes with 2 threads each. There is still a large amount of RAM headroom we aren’t going to use even under load.

What do we expect? Well, with the 2 “hyperthreads” perhaps it can handle 2 simultaneous requests nearly as well as 1 (or not?); after that, we expect it to degrade quickly same as our original t2.medium did.

It an handle 2 connections slightly better than you’d expect if there really was only 1 CPU, so I guess a hyperthread does give you something. Then the slope picks up, as you’d expect; and it looks like it does get steeper after 4 simultaneous connections, yup.

performance-l: Under load

A performance-l ($500/month) costs twice as much as a performance-m ($250/month), but has far more than twice as much resources. performance-l has a whopping 14GB of RAM compared to performance-m’s 2.5GB; and performance-l has 4 real CPUs/hyperthreads available to use (visible using the nproc technique in the heroku puma article.

Because we have plenty of RAM to do so, we’re going to run 10 worker processes to match our original t2.medium’s. We still ran with 2 threads, just cause it seems like maybe you should never run a puma worker with only one thread? But who knows, maybe 10 workers with 1 thread each would perform better; plenty of room (but not plenty of my energy) for yet more experimentation.

What do we expect? The graph should be pretty flat up to 4 simultaneous connections, then it should start getting worse, pretty evenly as simultaneous connections rise all the way up to 12.

It is indeed pretty flat up to 4 simultaneous connections. Then up to 8 it’s still not too bad — median at 8 is only ~1.5 median at 1(!). Then it gets worse after 8 (oh yeah, 8 hyperthreads?).

But the slope is wonderfully shallow all the way. Even at 12 simultaneous connections, the median response time of 266ms is only 2.5x what it was at one connection. (In our original t2.medium, at 12 simultaneous connections median response time was over 5x what it was at 1 connection).

This thing is indeed a monster.

Summary Comparison: Under load

We showed a lot of graphs that look similar, but they all had different sclaes on the y-axis. Let’s plot median response times under load of all architectures on the same graph, and see what we’re really dealing with.

The blue t2.medium is our baseline, what we have now. We can see that there isn’t really a similar heroku option, we have our choice of better or worse.

The performance-l is just plain better than what we have now. It starts out performing about the same as what we have now for 1 or 2 simultaneous connections, but then scales so much flatter.

The performance-m also starts out about thesame, but sccales so much worse than even what we have now. (it’s that 1 real CPU instead of 2, I guess?).

The standard-2x scaled to 4 dynos… has it’s own characteristics. It’s baseline is pretty terrible, it’s 2 to 3 times as slow as what we have now even not under load. But then it scales pretty well, since it’s 4 dynos after all, it doesn’t get worse as fast as performance-m does. But it started out so bad, that it remains far worse than our original t2.medium even under load. Adding more dynos to standard-2x will help it remain steady under even higher load, but won’t help it’s underlying problem that it’s just slower than everyone else.

Discussion: Thoughts and Surprises

  • I had been thinking of a t2.medium (even with burst) as “typical” (it is after all much slower than my 2015 Macbook), and has been assuming (in retrospect with no particular basis) that a heroku standard dyno would perform similarly.
    • Most discussion and heroku docs, as well as the naming itself, suggest that a ‘standard’ dyno is, well, standard, and performance dynos are for “super scale, high traffic apps”, which is not me.
    • But in fact, heroku standard dynos are much slower and more variable in performance than a bursting t2.medium. I suspect they are slower than other options you might consider non-heroku “typical” options.

  • My conclusion is honestly that “standard” dynos are really “for very fast, well-optimized apps that can handle slow and variable CPU” and “performance” dynos are really “standard, matching the CPU speeds you’d get from a typical non-heroku option”. But this is not how they are documented or usually talked about. Are other people having really different experiences/conclusions than me? If so, why, or where have I gone wrong?
    • This of course has implications for estimating your heroku budget if considering switching over. :(
    • If you have a well-optimized fast app, say even 95th percentile is 200ms (on bursting t2.medium), then you can handle standard slowness — so what your 95th percentile is now 600ms (and during some time periods even much slower, 1s or worse, due to variability). That’s not so bad for a 95th percentile.
    • One way to get a very fast is of course caching. There is lots of discussion of using caching in Rails, sometimes the message (explicit or implicit) is “you have to use lots of caching to get reasonable performance cause Rails is so slow.” What if many of these people are on heroku, and it’s really you have to use lots of caching to get reasonable performance on heroku standard dyno??
    • I personally don’t think caching is maintenance free; in my experience properly doing cache invalidation and dealing with significant processing spikes needed when you choose to invalidate your entire cache (cause cached HTML needs to change) lead to real maintenance/development cost. I have not needed caching to meet my performance goals on present architecture.
    • Everyone doesn’t necessarily have the same performance goals/requirements. Mine of a low-traffic non-commercial site are are maybe more modest, I just need users not to be super annoyed. But whatever your performance goals, you’re going to have to spend more time on optimization on a heroku standard than something with much faster CPU — like a standard affordable mid-tier EC2. Am I wrong?

  • One significant factor on heroku standard dyno performance is that they use shared/multi-tenant infrastructure. I wonder if they’ve actually gotten lower performance over time, as many customers (who you may be sharing with) have gotten better at maximizing their utilization, so the shared CPUs are typically more busy? Like a frog boiling, maybe nobody noticed that standard dynos have become lower performance? I dunno, brainstorming.
    • Or maybe there are so many apps that start on heroku instead of switcching from somewhere else, that people just don’t realize that standard dynos are much slower than other low/mid-tier options?
    • I was expecting to pay a premium for heroku — but even standard-2x’s are a significant premium over paying for t2.medium EC2 yourself, one I found quite reasonable…. performance dynos are of course even more premium.

  • I had a sort of baked-in premise that most Rails apps are “IO-bound”, they spend more time waiting on IO than using CPU. I don’t know where I got that idea, I heard it once a long time ago and it became part of my mental model. I now do not believe this is true true of my app, and I do not in fact believe it is true of most Rails apps in 2020. I would hypothesize that most Rails apps today are in fact CPU-bound.

  • The performance-m dyno only has one CPU. I had somehow also been assuming that it would have two CPUs — I’m not sure why, maybe just because at that price! It would be a much better deal with two CPUs.
    • Instead we have a huge jump from $250 performance-m to $500 performance-l that has 4x the CPUs and ~5x the RAM.
    • So it doesn’t make financial sense to have more than one performance-m dyno, you might as well go to performance-l. But this really complicates auto-scaling, whether using Heroku’s feature , or the awesome Rails Autoscale add-on. I am not sure I can afford a performance-l all the time, and a performance-m might be sufficient most of the time. But if 20% of the time I’m going to need more (or even 5%, or even unexpectedly-mentioned-in-national-media), it would be nice to set things up to autoscale up…. I guess to financially irrational 2 or more performance-m’s? :(

  • The performance-l is a very big machine, that is significantly beefier than my current infrastructure. And has far more RAM than I need/can use with only 4 physical cores. If I consider standard dynos to be pretty effectively low tier (as I do), heroku to me is kind of missing mid-tier options. A 2 CPU option at 2.5G or 5G of RAM would make a lot of sense to me, and actually be exactly what I need… really I think performance-m would make more sense with 2 CPUs at it’s existing already-premium price point, and to be called a “performance” dyno. . Maybe heroku is intentionally trying set options to funnel people to the highest-priced performance-l.

Conclusion: What are we going to do?

In my investigations of heroku, my opinion of the developer UX and general service quality only increases. It’s a great product, that would increase our operational capacity and reliability, and substitute for so many person-hours of sysadmin/operational time if we were self-managing (even on cloud architecture like EC2).

But I had originally been figuring we’d use standard dynos (even more affordably, possibly auto-scaled with Rails Autoscale plugin), and am disappointed that they end up looking so much lower performance than our current infrastructure.

Could we use them anyway? Response time going from 100ms to 300ms — hey, 300ms is still fine, even if I’m sad to lose those really nice numbers I got from a bit of optimization. But this app has a wide long-tail ; our 75th percentile going from 450ms to 1s, our 90th percentile going from 860ms to 1.74s and our 99th going from 2.3s to 4.4s — a lot harder to swallow. Especially when we know that due to standard dyno variability, a slow-ish page that on my present architecture is reliably 1.5s, could really be anywhere from 3 to 9(!) on heroku.

I would anticipate having to spend a lot more developer time on optimization on heroku standard dynos — or, i this small over-burdened non-commercial shop, not prioritizing that (or not having the skills for it), and having our performance just get bad.

So I’m really reluctant to suggest moving our app to heroku with standard dynos.

A performance-l dyno is going to let us not have to think about performance any more than we do now, while scaling under high-traffic better than we do now — I suspect we’d never need to scale to more than one performance-l dyno. But it’s pricey for us.

A performance-m dyno has a base-speed that’s fine, but scales very poorly and unaffordably. Doesn’t handle an increase in load very well as one dyno, and to get more CPUs you have to pay far too much (especially compared to standard dynos I had been assuming I’d use).

So I don’t really like any of my options. If we do heroku, maybe we’ll try a performance-m, and “hope” our traffic is light enough that a single one will do? Maybe with Rails autoscale for traffic spikes, even though 2 performance-m dynos isn’t financially efficient? If we are scaling to 2 (or more!) performance-m’s more than very occasionally, switch to performance-l, which means we need to make sure we have the budget for it?

Deep Dive: Moving ruby projects from Travis to Github Actions for CI

So this is one of my super wordy posts, if that’s not your thing abort now, but some people like them. We’ll start with a bit of context, then get to some detailed looks at Github Actions features I used to replace my travis builds, with example config files and examination of options available.

For me, by “Continuous Integration” (CI), I mostly mean “Running automated tests automatically, on your code repo, as you develop”, on every PR and sometimes with scheduled runs. Other people may mean more expansive things by “CI”.

For a lot of us, our first experience with CI was when Travis-ci started to become well-known, maybe 8 years ago or so. Travis was free for open source, and so darn easy to set up and use — especially for Rails projects, it was a time when it still felt like most services focused on docs and smooth fit for ruby and Rails specifically. I had heard of doing CI, but as a developer in a very small and non-profit shop, I want to spend time writing code not setting up infrastructure, and would have had to get any for-cost service approved up the chain from our limited budget. But it felt like I could almost just flip a switch and have Travis on ruby or rails projects working — and for free!

Free for open source wasn’t entirely selfless, I think it’s part of what helped Travis literally define the market. (Btw, I think they were the first to invent the idea of a “badge” URL for a github readme?) Along with an amazing Developer UX (which is today still a paragon), it just gave you no reason not to use it. And then once using it, it started to seem insane to not have CI testing, nobody would ever again want to develop software without the build status on every PR before merge.

Travis really set a high bar for ease of use in a developer tool, you didn’t need to think about it much, it just did what you needed, and told you what you needed to know in it’s read-outs. I think it’s an impressive engineering product. But then.

End of an era

Travis will no longer be supporting open source projects with free CI.

The free open source travis projects originally ran on, with paid commercial projects on In May 2018, they announced they’d be unifying these on only, but with no announced plan that the policy for free open source would change. This migration seemed to proceed very slowly though.

Perhaps because it was part of preparing the company for a sale, in Jan 2019 it was announced private equity firm Idera had bought travis. At the time the announcement said “We will continue to maintain a free, hosted service for open source projects,” but knowing what “private equity” usually means, some were concerned for the future. (HN discussion).

While the FAQ on the migration to still says that should remain reliable until projects are fully migrated, in fact over the past few months projects largely stopped building, as travis apparently significantly reduced resources on the platform. Some people began manually migrating their free open source projects to where builds still worked. But, while the FAQ also still says “Will Travis CI be getting rid of free users? Travis CI will continue to offer a free tier for public or open-source repositories on” — in fact, travis announced that they are ending the free service for open source. The “free tier” is a limited trial (available not just to open source), and when it expires, you can pay, or apply to a special program for an extension, over and over again.

They are contradicting themselves enough that while I’m not sure exactly what is going to happen, but no longer trust them as a service.

Enter Github Actions

I work mostly on ruby and Rails projects. They are all open source, almost all of them use travis. So while (once moved to they are all currently working, it’s time to start moving them somewhere else, before I have dozens of projects with broken CI and still don’t know how to move them. And the new needs to be free — many of these projects are zero-budget old-school “volunteer” or “informal multi-institutional collaboration” open source.

There might be several other options, but the one I chose is Github Actions — my sense that it had gotten mature enough to start approaching travis level of polish, and all of my projects are github-hosted, and Github Actions is free for unlimited use for open source. (pricing page; Aug 2019 announcement of free for open source). And we are really fortunate that it became mature and stable in time for travis to withdraw open source support (if travis had been a year earlier, we’d be in trouble).

Github Actions is really powerful. It is built to do probably WAY MORE than travis does, definitely way beyond “automated testing” to various flows for deployment and artifact release, to really just about any kind of process for managing your project you want. The logic you can write almost unlimited, all running on github’s machines.

As a result though…. I found it a bit overwhelming to get started. The Github Actions docs are just overwhelmingly abstract, there is so much there, you can almost anything — but I don’t actually want to learn a new platform, I just want to get automated test CI for my ruby project working! There are some language/project speccific Guides available, for node.js, python, a few different Java setups — but not for ruby or Rails! My how Rails has fallen, from when most services like this would be focusing on Rails use cases first. :(

There are some third part guides available that might focus on ruby/rails, but one of the problems is that Actions has been evolving for a few years with some pivots, so it’s easy to find outdated instructions. One I found helpful orientation was this Drifting Ruby screencast. This screencast showed me there is a kind of limited web UI with integrated docs searcher — but i didn’t end up using it, I just created the text config file by hand, same as I would have for travis. Github provides templates for “ruby” or “ruby gem”, but the Drifting Ruby sccreencast said “these won’t really work for our ruby on rails application so we’ll have to set up one manually”, so that’s what I did too. ¯\_(ツ)_/¯

But the cost of all the power github Actions provides is… there are a lot more switches and dials to understand and get right (and maintain over time and across multiple projects). I’m not someone who likes copy-paste without understanding it, so I spent some time trying to understand the relevant options and alternatives; in the process I found some things I might have otherwise copy-pasted from other people’s examples that could be improved. So I give you the results of my investigations, to hopefully save you some time, if wordy comprehensive reports are up your alley.

A Simple Test Workflow: ruby gem, test with multiple ruby versions

Here’s a file for a fairly simple test workflow. You can see it’s in the repo at .github/workflows. The name of the file doesn’t matter — while this one is called ruby.yml, i’ve since moved over to naming the file to match the name: key in the workflow for easier traceability, so I would have called it ci.yml instead.


You can see we say that this workflow should be run on any push to master branch, and also for any pull_request at all. Many other examples I’ve seen define pull_request: branches: ["main"], which seems to mean only run on Pull Requests with main as the base. While that’s most of my PR’s, if there is ever a PR that uses another branch as a base for whatever reason, I still want to run CI! While hypothetically you should be able leave branches out to mean “any branch”, I only got it to work by explicitly saying branches: ["**"]


For this gem, we want to run CI on multiple ruby versions. You can see we define them here. This works similarly to travis matrixes. If you have more than one matrix variable defined, the workflow will run for every combination of variables (hence the name “matrix”).

        ruby: [ '2.4.4', '2.5.1', '2.6.1', '2.7.0', 'jruby-', 'jruby-' ]

In a given run, the current value of the matrix variables is available in github actions “context”, which you can acccess as eg ${{ matrix.ruby }}. You can see how I use that in the name, so that the job will show up with it’s ruby version in it.

    name: Ruby ${{ matrix.ruby }}

Ruby install

While Github itself provides an action for ruby install, it seems most people are using this third-party action. Which we reference as `ruby/setup-ruby@v1`.

You can see we use the matrix.ruby context to tell the setup-ruby action what version of ruby to install, which works because our matrix values are the correct values recognized by the action. Which are documented in the README, but note that values like jruby-head are also supported.

Note, although it isn’t clearly documented, you can say 2.4 to mean “latest available 2.4.x” (rather than it meaning “2.4.0”), which is hugely useful, and I’ve switched to doing that. I don’t believe that was available via travis/rvm ruby install feature.

For a project that isn’t testing under multiple rubies, if we left out the with: ruby-version, the action will conveniently use a .ruby-version file present in the repo.

Note you don’t need to put a gem install bundler into your workflow yourself, while I’m not sure it’s clearly documented, I found the ruby/setup-ruby action would do this for you (installing the latest available bundler, instead of using whatever was packaged with ruby version), btw regardless of whether you are using the bundler-cache feature (see below).

Note on How Matrix Jobs Show Up to Github

With travis, testing for multiple ruby or rails versions with a matrix, we got one (or, well, actually two) jobs showing up on the Github PR:

Each of those lines summaries a collection of matrix jobs (eg different ruby versions). If any of the individual jobs without the matrix failed, the whole build would show up as failed. Success or failure, you could click on “Details” to see each job and it’s status:

I thought this worked pretty well — especially for “green” builds I really don’t need to see the details on the PR, the summary is great, and if I want to see the details I can click through, great.

With Github Actions, each matrix job shows up directly on the PR. If you have a large matrix, it can be… a lot. Some of my projects have way more than 6. On PR:

Maybe it’s just because I was used to it, but I preferred the Travis way. (This also makes me think maybe I should change the name key in my workflow to say eg CI: Ruby 2.4.4 to be more clear? Oops, tried that, it just looks even weirder in other GH contexts, not sure.)

Oh, also, that travis way of doing the build twice, once for “pr” and once for “push”? Github Actions doesn’t seem to do that, it just does one, I think corresponding to travis “push”. While the travis feature seemed technically smart, I’m not sure I ever actually saw one of these builds pass while the other failed in any of my projects, I probably won’t miss it.


Did you have a README badge for travis? Don’t forget to swap it for equivalent in Github Actions.

The image url looks like:$OWNER/$REPOSITORY/workflows/$WORKFLOW_NAME/badge.svg?branch=master, where $WORKFLOW_NAME of course has to be URL-escaped if it ocntains spaces etc.

The github page at, if you select a particular workflow/branch, does, like travis, give you a badge URL/markdown you can copy/paste if you click on the three-dots and then “Create status badge”. Unlike travis, what it gives you to copy/paste is just image markdown, it doesn’t include a link.

But I definitely want the badge to link to viewing the results of the last build in the UI. So I do it manually. Limit to the speciifc workflow and branch that you made the badge for in the UI then just copy and paste the URL from the browser. A bit confusing markdown to construct manually, here’s what it ended up looking like for me:

I copy and paste that from an existing project when I need it in a new one. :shrug:

Require CI to merge PR?

However, that difference in how jobs show up to Github, the way each matrix job shows up separately now, has an even more negative impact on requiring CI success to merge a PR.

If you want to require that CI passes before merging a PR, you configure that at under “Branch protection rules”.When you click “Add Rule”, you can/must choose WHICH jobs are “required”.

For travis, that’d be those two “master” jobs, but for the new system, every matrix job shows up separately — in fact, if you’ve been messing with job names trying to get it right as I have, you have any job name that was ever used in the last 7 days, and they don’t have the Github workflow name appended to them or anything (another reason to put github workflow name in the job name?).

But the really problematic part is that if you edit your list of jobs in the matrix — adding or removing ruby versions as one does, or even just changing the name that shows up for a job — you have to go back to this screen to add or remove jobs as a “required status check”.

That seems really unworkable to me, I’m not sure how it hasn’t been a major problem already for users. It would be better if we could configure “all the checks in the WORKFLOW, whatever they may be”, or perhaps best of all if we could configure a check as required in the workflow YML file, the same place we’re defining it, just a required_before_merge key you could set to true or use a matrix context to define or whatever.

I’m currently not requiring status checks for merge on most of my projects (even though i did with travis), because I was finding it unmanageable to keep the job names sync’d, especially as I get used to Github Actions and kept tweaking things in a way that would change job names. So that’s a bit annoying.

fail_fast: false

By default, if one of the matrix jobs fails, Github Acitons will cancel all remaining jobs, not bother to run them at all. After all, you know the build is going to fail if one job fails, what do you need those others for?

Well, for my use case, it is pretty annoying to be told, say, “Job for ruby 2.7.0 failed, we can’t tell you whether the other ruby versions would have passed or failed or not” — the first thing I want to know is if failed on all ruby versions or just 2.7.0, so now I’d have to spend extra time figuring that out manually? No thanks.

So I set `fail_fast: false` on all of my workflows, to disable this behavior.

Note that travis had a similar (opt-in) fast_finish feature, which worked subtly different: Travis would report failure to Github on first failure (and notify, I think), but would actually keep running all jobs. So when I saw a failure, I could click through to ‘details’ to see which (eg) ruby versions passed, from the whole matrix. This does work for me, so I’d chose to opt-in to that travis feature. Unfortunately, the Github Actions subtle difference in effect makes it not desirable to me.

Note You may see some people referencing a Github Actions continue-on-error feature. I found the docs confusing, but after experimentation what this really does is mark a job as successful even when it fails. It shows up in all GH UI as succeeeded even when it failed, the only way to know it failed would be to click through to the actual build log to see failure in the logged console. I think “continue on error” is a weird name for this; it is not useful to me with regard to fine-tuning fail-fast; or honestly in any other use case I can think of that I have.

Bundle cache?

bundle install can take 60+ seconds, and be a significant drag on your build (not to mention a lot of load on rubygems servers from all these builds). So when travis introduced a feature to cache: bundler: true, it was very popular.

True to form, Github Actions gives you a generic caching feature you can try to configure for your particular case (npm, bundler, whatever), instead of an out of the box feature “just do the right thing you for bundler, you figure it out”.

The ruby/setup-ruby third-party action has a built-in feature to cache bundler installs for you, but I found that it does not work right if you do not have a Gemfile.lock checked into the repo. (Ie, for most any gem, rather than app, project). It will end up re-using cached dependencies even if there are new releases of some of your dependencies, which is a big problem for how I use CI for a gem — I expect it to always be building with latest releases of dependencies, so I can find out of one breaks the build. This may get fixed in the action.

If you have an app (rather than gem) with a Gemfile.lock checked into repo, the bundler-cache: true feature should be just fine.

Otherwise, Github has some suggestions for using the generic cache feature for ruby bundler (search for “ruby – bundler” on this page) — but I actually don’t believe they will work right without a Gemfile.lock checked into the repo either.

Starting from that example, and using the restore-keys feature, I think it should be possible to design a use that works much like travis’s bundler cache did, and works fine without a checked-in Gemfile.lock. We’d want it to use a cache from the most recent previous (similar job), and then run bundle install anyway, and then cache the results again at the end always to be available for the next run.

But I haven’t had time to work that out, so for now my gem builds are simply not using bundler caching. (my gem builds tend to take around 60 seconds to do a bundle install, so that’s in every build now, could be worse).

update nov 27: The ruby/ruby-setup action should be fixed to properly cache-bust when you don’t have a Gemfile.lock checked in. If you are using a matrix for ruby version, as below, you must set the ruby version by setting the BUNDLE_GEMFILE env variable rather than the way we did it below, and there is is a certain way Github Action requires/provides you do that, it’s not just export. See the issue in ruby/ruby-setup project.

Notifications: Not great

Travis has really nice defaults for notifications: The person submitting the PR would get an email generally only on status changes (from pass to fail or fail to pass) rather than on every build. And travis would even figure out what email to send to based on what email you used in your git commits. (Originally perhaps a workaround to lack of Github API at travis’ origin, I found it a nice feature). And then travis has sophisticated notification customization available on a per-repo basis.

Github notifications are unfortunately much more basic and limited. The only notification settings avaialable are for your entire account at, “GitHub Actions”. So they apply to all github workflows in all projects, there are no workflow- or project-specific settings. You can set to receive notification via web push or email or both or neither. You can receive notifications for all builds or only failed builds. That’s it.

The author of a PR is the one who receives the notifications, same as in travis. You will get notifications for every single build, even repeated successes or failures in a series.

I’m not super happy with the notification options. I may end up just turning off Github Actions notifications entirely for my account.

Hypothetically, someone could probably write a custom Github action to give you notifications exactly how travis offered — after all, travis was using public GH API that should be available to any other author, and I think should be usable from within an action. But when I started to think through it, while it seemed an interesting project, I realized it was definitely beyond the “spare hobby time” I was inclined to give to it at present, especially not being much of a JS developer (the language of custom GH actions, generally). (While you can list third-party actions on the github “marketplace”, I don’t think there’s a way to charge for them). .

There are custom third-party actions available to do things like notify slack for build completion; I haven’t looked too much into any of them, beyond seeing that I didn’t see any that would be “like travis defaults”.

A more complicated gem: postgres, and Rails matrix

Let’s move to a different example workflow file, in a different gem. You can see I called this one ci.yml, matching it’s name: CI, to have less friction for a developer (including future me) trying to figure out what’s going on.

This gem does have rails as a dependency and does test against it, but isn’t actually a Rails engine as it happens. It also needs to test against Postgres, not just sqlite3.

Scheduled Builds

At one point travis introduced a feature for scheduling (eg) weekly builds even when no PR/commit had been made. I enthusiastically adopted this for my gem projects. Why?

Gem releases are meant to work on a variety of different ruby versions and different exact versions of dependencies (including Rails). Sometimes a new release of ruby or rails will break the build, and you want to know about that and fix it. With CI builds happening only on new code, you find out about this with some random new code that is unlikely to be related to the failure; and you only find out about it on the next “new code” that triggers a build after a dependency release, which on some mature and stable gems could be a long time after the actual dependency release that broke it.

So scheduled builds for gems! (I have no purpose for scheduled test runs on apps).

Github Actions does have this feature. Hooray. One problem is that you will receive no notification of the result of the scheduled build, success or failure. :( I suppose you could include a third-party action to notify a fixed email address or Slack or something else; not sure how you’d configure that to apply only to the scheduled builds and not the commit/PR-triggered builds if that’s what you wanted. (Or make an custom action to file a GH issue on failure??? But make sure it doesn’t spam you with issues on repeated failures). I haven’t had the time to investigate this yet.

Also oops just noticed this: “In a public repository, scheduled workflows are automatically disabled when no repository activity has occurred in 60 days.” Which poses some challenges for relying on scheduled builds to make sure a stable slow-moving gem isn’t broken by dependency updates. I definitely am committer on gems that are still in wide use and can go 6-12+ months without a commit, because they are mature/done.

I still have it configured in my workflow; I guess even without notifications it will effect the “badge” on the README, and… maybe i’ll notice? Very far from ideal, work in progress. :(

Rails Matrix

OK, this one needs to test against various ruby versions AND various Rails versions. A while ago I realized that an actual matrix of every ruby combined with every rails was far too many builds. Fortunately, Github Actions supports the same kind of matrix/include syntax as travis, which I use.

          - gemfile: rails_5_0
            ruby: 2.4

          - gemfile: rails_6_0
            ruby: 2.7

I use the appraisal gem to handle setting up testing under multiple rails versions, which I highly recommend. You could use it for testing variant versions of any dependencies, I use it mostly for varying Rails. Appraisal results in a separate Gemfile committed to your repo for each (in my case) rails version, eg ./gemfiles/rails_5_0.gemfile. So those values I use for my gemfile matrix key are actually portions of the Gemfile path I’m going to want to use for each job.

Then we just need to tell bundler, in a given matrix job, to use the gemfile we specified in the matrix. The old-school way to do this is with the BUNDLE_GEMFILE environmental variable, but I found it error-prone to make sure it stayed consistently set in each workflow step. I found that the newer (although not that new!) bundle config set gemfile worked swimmingly! I just set it before the bundle install, it stays set for the rest of the run including the actual test run.

    # [...]
    - name: Bundle install
      run: |
        bundle config set gemfile "${GITHUB_WORKSPACE}/gemfiles/${{ matrix.gemfile }}.gemfile"
        bundle install --jobs 4 --retry 3

Note that single braces are used for ordinary bash syntax to reference the ENV variable ${GITHUB_WORKSPACE}, but double braces for the github actions context value interpolation ${{ matrix.gemfile }}.

Works great! Oh, note how we set the name of the job to include both ruby and rails matrix values, important for it showing up legibly in Github UI: name: ${{ matrix.gemfile }}, ruby ${{ matrix.ruby }}. Because of how we constructed our gemfile matrix, that shows up with job names rails_5_0, ruby 2.7.

Still not using bundler caching in this workflow. As before, we’re concerned about the ruby/setup-ruby built-in bundler-cache feature not working as desired without a Gemfile.lock in the repo. This time, I’m also not sure how to get that feature to play nicely with the variant gemfiles and bundle config set gemfile. Github Actions makes you put together a lot more pieces together yourself compared to travis, there are still things I just postponed figuring out for now.

update jan 11: the ruby/setup-ruby action now includes a ruby version matrix example in it’s README. It does require you use the BUNDLE_GEMFILE env variable, rather than the bundle config set gemfile command I used here. This should ordinarily be fine, but is something to watch out for in case other instructions you are following tries to use bundle config set gemfile instead, for reasons or not.


This project needs to build against a real postgres. That is relatively easy to set up in Github Actions.

Postgres normally by default allows connections on localhost without a username/password set, and my past builds (in travis or locally) took advantage of this to not bother setting one, which then the app didn’t have to know about. But the postgres image used for Github Actions doesn’t allow this, you have to set a username/password. So the section of the workflow that sets up postgres looks like:

         image: postgres:9.4
           POSTGRES_USER: postgres
           POSTGRES_PASSWORD: postgres
         ports: ['5432:5432']

5432 is the default postgres port, we need to set it and map it so it will be available as expected. Note you also can specify whatever version of postgres you want, this one is intentionally testing on one a bit old.

OK now our Rails app that will be executed under rspec needs to know that username and password to use in it’s postgres connection; when before it connected without a username/password. That env under the postgres service image is not actually available to the job steps. I didn’t find any way to DRY the username/password in one place, I had to repeat it in another env block, which I put at the top level of the workflow so it would apply to all steps.

And then I had to alter my database.yml to use those ENV variables, in the test environment. On a local dev machine, if your postgres doens’t have a username/password requirement and you don’t set the ENV variables, it keeps working as before.

I also needed to add host: localhost to the database.yml; before, the absence of the host key meant it used a unix-domain socket (filesystem-located) to connect to postgres, but that won’t work in the Github Actions containerized environment.

Note, there are things you might see in other examples that I don’t believe you need:

  • No need for an apt-get of pg dev libraries. I think everything you need is on the default GH Actions images now.
  • Some examples I’ve seen do a thing with options: --health-cmd pg_isready, my builds seem to be working just fine without it, and less code is less code to maintain.


In travis, I took advantage of the travis allow_failures key in most of my gems.

Why? I am testing against various ruby and Rails versions; I want to test against *future* (pre-release, edge) ruby and rails versions, cause its useful to know if I’m already with no effort passing on them, and I’d like to keep passing on them — but I don’t want to mandate it, or prevent PR merges if the build fails on a pre-release dependency. (After all, it could very well be a bug in the dependency too!)

There is no great equivalent to allow_failures in Github Actions. (Note again, continue_on_error just makes failed jobs look identical to successful jobs, and isn’t very helpful here).

I investigated some alternatives, which I may go into more detail on in a future post, but on one project I am trying a separate workflow just for “future ruby/rails allowed failures” which only checks master commits (not PRs), and has a separate badge on README (which is actually pretty nice for advertising to potential users “Yeah, we ALREADY work on rails edge/6.1.rc1!”). Main downside there is having to copy/paste synchronize what’s really the same workflow in two files.

A Rails app

I have many more number of projects I’m a committer on that are gems, but I spend more of my time on apps, one app in specific.

So here’s an example Github Actions CI workflow for a Rails app.

It mostly remixes the features we’ve already seen. It doesn’t need any matrix. It does need a postgres.

It does need some “OS-level” dependencies — the app does some shell-out to media utilities like vips and ffmpeg, and there are integration tests that utilize this. Easy enough to just install those with apt-get, works swimmingly.

        - name: Install apt dependencies
          run: |
            sudo apt-get -y install libvips-tools ffmpeg mediainfo

Update 25 Nov: My apt-get that worked for a couple weeks started failing for some reason on trying to install a libpulse0 dependency of one of those packages, the solution was doing a sudo apt-get update before the sudo apt-get install. I guess this is always good practice? (That forum post also uses apt install and apt update instead of apt-get install and apt-get update, that I can’t tell you much about, I’m really not a linux admin).

In addition to the bundle install, a modern Rails app using webpacker needs yarn install. This just worked for me — no need to include lines for installing npm itself or yarn or any yarn dependencies, although some examples I find online have them. (My yarn installs seem to happen in ~20 seconds, so I’m not motivated to try to figure out caching for yarn).

And we need to create the test database in the postgres, which I do with RAILS_ENV=test bundle exec rails db:create — typical Rails test setup will then automatically run migrations if needed. There might be other (better?) ways to prep the database, but I was having trouble getting rake db:prepare to work, and didn’t spend the time to debug it, just went with something that worked.

    - name: Set up app
       run: |
         RAILS_ENV=test bundle exec rails db:create
         yarn install

Rails test setup usually ends up running migrations automatically is why I think this worked alone, but you could also throw in a RAILS_ENV=test bundle exec rake db:schema:load if you wanted.

Under travis I had to install chrome with addons: chrome: stable to have it available to use with capybara via the webdrivers gem. No need for installing chrome in Github Actions, some (recent-ish?) version of it is already there as part of the standard Github Actions build image.

In this workflow, you can also see a custom use of the github “cache” action to cache a Solr install that the test setup automatically downloads and sets up. In this case the cache doesn’t actually save us any build time, but is kinder on the apache foundation servers we are downloading from with every build otherwise (and have gotten throttled from in the past).


Github Aciton sis a really impressively powerful product. And it’s totally going to work to replace travis for me.

It’s also probably going to take more of my time to maintain. The trade-off of more power/flexibility and focusing on almost limitless use cases is more things th eindividual project has to get right for their use case. For instance figuring out the right configuration to get caching for bundler or yarn right, instead of just writing cache: { yarn: true, bundler: true}. And when you have to figure it out yourself, you can get it wrong, which when you are working on many projects at once means you have a bunch of places to fix.

The amazingness of third-party action “marketplace” means you have to figure out the right action to use (the third-party ruby/setup-ruby instead of the vendor’s actions/setup-ruby), and again if you change your mind about that you have a bunch of projects to update.

Anyway, it is what it is — and I’m grateful to have such a powerful and in fact relatively easy to use service available for free! I could not really live without CI anymore, and won’t have to!

Oh, and Github Actions is giving me way more (free) simultaneous parallel workers than travis ever did, for my many-job builds!

Unexpected performance characteristics when exploring migrating a Rails app to Heroku

I work at a small non-profit research institute. I work on a Rails app that is a “digital collections” or “digital asset management” app. Basically it manages and provides access (public as well as internal) to lots of files and description about those files, mostly images.

It’s currently deployed on some self-managed Amazon EC2 instances (one for web, one for bg workers, one in which postgres is installed, etc). It gets pretty low-traffic in general web/ecommerce/Rails terms. The app is definitely not very optimized — we know it’s kind of a RAM hog, we know it has many actions whose response time is undesirable. But it works “good enough” on it’s current infrastructure for current use, such that optimizing it hasn’t been the highest priority.

We are considering moving it from self-managed EC2 to heroku, largely because we don’t really have the capacity to manage the infrastructure we currently have, especially after some recent layoffs.

Our Rails app is currently served by passenger on an EC2 t2.medium (4G of RAM).

I expected the performance characteristics moving to heroku “standard” dynos would be about the same as they are on our current infrastructure. But was surprised to see some degradation:

  • Responses seem much slower to come back when deployed, mainly for our slowest actions. Quick actions are just as quick on heroku, but slower ones (or perhaps actions that involve more memory allocations?) are much slower on heroku.
  • The application instances seem to take more RAM running on heroku dynos than they do on our EC2 (this one in particular mystifies me).

I am curious if anyone with more heroku experience has any insight into what’s going on here. I know how to do profiling and performance optimization (I’m more comfortable with profiling CPU time with ruby-prof than I am with trying to profile memory allocations with say derailed_benchmarks). But it’s difficult work, and I wasn’t expecting to have to do more of it as part of a migration to heroku, when performance characteristics were acceptable on our current infrastructure.

Response Times (CPU)

Again, yep, know these are fairly slow response times. But they are “good enough” on current infrastruture (EC2 t2.medium), wasn’t expecting them to get worse on heroku (standard-1x dyno, backed by heroku pg standard-0 ).

Fast pages are about the same, but slow pages (that create a lot of objects in memory?) are a lot slower.

This is not load testing, I am not testing under high traffic or for current requests. This is just accessing demo versions of the app manually one page a time, to see response times when the app is only handling one response at a time. So it’s not about how many web workers are running or fit into RAM or anything; one is sufficient.

ActionExisting EC2 t2.mediumHeroku standard-1x dyno
Slow reporting page that does a few very expensive SQL queries, but they do not return a lot of objects. Rails logging reports: Allocations: 8704~3800ms~3200ms (faster pg?)
Fast page with a few AR/SQL queries returning just a few objects each, a few partials, etc. Rails logging reports: Allocations: 820581-120ms~120ms
A fairly small “item” page, Rails logging reports: Allocations: 40210~200ms~300ms
A medium size item page, loads a lot more AR models, has a larger byte size page response. Allocations: 361292~430ms600-700ms
One of our largest pages, fetches a lot of AR instances, does a lot of allocations, returns a very large page response. Allocations: 19837333000-4000ms5000-7000ms

Fast-ish responses (and from this limited sample, actually responses with few allocations even if slow waiting on IO?) are about the same. But our slowest/highest allocating actions are ~50% slower on heroku? Again, I know these allocations and response times are not great even on our existing infrastructure; but why do they get so much worse on heroku? (No, there were no heroku memory errors or swapping happening).

RAM use of an app instance

We currently deploy with passenger (free), running 10 workers on our 4GB t2.medium.

To compare apples to apples, deployed using passenger on a heroku standard-1x. Just one worker instance (because that’s actually all I can fit on a standard-1x!), to compare size of a single worker from one infrastructure to the other.

On our legacy infrastructure, on a server that’s been up for 8 days of production traffic, passenger-status looks something like this:

  Requests in queue: 0
  * PID: 18187   Sessions: 0       Processed: 1074398   Uptime: 8d 23h 32m 12s
    CPU: 7%      Memory  : 340M    Last used: 1s
  * PID: 18206   Sessions: 0       Processed: 78200   Uptime: 8d 23h 32m 12s
    CPU: 0%      Memory  : 281M    Last used: 22s
  * PID: 18225   Sessions: 0       Processed: 2951    Uptime: 8d 23h 32m 12s
    CPU: 0%      Memory  : 197M    Last used: 8m 8
  * PID: 18244   Sessions: 0       Processed: 258     Uptime: 8d 23h 32m 11s
    CPU: 0%      Memory  : 161M    Last used: 1h 2
  * PID: 18261   Sessions: 0       Processed: 127     Uptime: 8d 23h 32m 11s
    CPU: 0%      Memory  : 158M    Last used: 1h 2
  * PID: 18278   Sessions: 0       Processed: 105     Uptime: 8d 23h 32m 11s
    CPU: 0%      Memory  : 169M    Last used: 3h 2
  * PID: 18295   Sessions: 0       Processed: 96      Uptime: 8d 23h 32m 11s
    CPU: 0%      Memory  : 163M    Last used: 3h 2
  * PID: 18312   Sessions: 0       Processed: 91      Uptime: 8d 23h 32m 11s
    CPU: 0%      Memory  : 169M    Last used: 13h
  * PID: 18329   Sessions: 0       Processed: 92      Uptime: 8d 23h 32m 11s
    CPU: 0%      Memory  : 163M    Last used: 13h
  * PID: 18346   Sessions: 0       Processed: 80      Uptime: 8d 23h 32m 11s
    CPU: 0%      Memory  : 162M    Last used: 13h

We can see, yeah, this app is low traffic, most of those workers don’t see a lot of use. The first worker, which has handled by far the most traffic has a Private RSS of 340M. (Other workers having handled fewer requests much slimmer). Kind of overweight, not sure where all that RAM is going, but it is what it is. I could maybe hope to barely fit 3 workers on a heroku standard-2 (1024M) instance, if these sizes were the same on Heroku.

This is after a week of production use — if I restart passenger on a staging server, and manually access some of my largest, hungriest, most-allocating pages a few times, I can only see Private RSS use of like 270MB.

However, on the heroku standard-1x, with one passenger worker, using the heroku log-runtime-metrics feature to look at memory… private RSS is I believe what should correspond to passenger’s report, and what heroku uses for memory capacity limiting…

Immediately after restarting my app, it’s at sample#memory_total=184.57MB sample#memory_rss=126.04MB. After manually accessing a few of my “hungriest” actions, I see: sample#memory_total=511.36MB sample#memory_rss=453.24MB . Just a few manual requests not a week of production traffic, and 33% more RAM than on my legacy EC2 infrastructure after a week of production traffic. Actually approaching the limits of what can fit in a standard-1x (512MB) dyno as just one worker.

Now, is heroku’s memory measurements being done differently than passenger-status does them? Possibly. It would be nice to compare apples to apples, and passenger hypothetically has a service that would let you access passenger-status results from heroku… but unfortunately I have been unable to get it to work. (Ideas welcome).

Other variations tried on heroku

Trying the heroku gaffneyc/jemalloc build-pack with heroku config:set JEMALLOC_ENABLED=true (still with passenger, one worker instance) doesn’t seem to have made any significant differences, maybe 5% RAM savings or maybe it’s a just a fluke.

Switching to puma (puma5 with the experimental possibly memory-saving features turned on; just one worker with one thread), doesn’t make any difference in response time performance (none expected), but… maybe does reduce RAM usage somehow? After a few sample requests of some of my hungriest pages, I see sample#memory_total=428.11MB sample#memory_rss=371.88MB, still more than my baseline, but not drastically so. (with or without jemalloc buildpack seems to make no difference). Odd.

So what should I conclude?

I know this app could use a fitness regime; but it performs acceptably on current infrastructure.

We are exploring heroku because of staffing capacity issues, hoping to not to have to do so much ops. But if we trade ops for having to spend much time on challenging (not really suitable for junior dev) performance optimization…. that’s not what we were hoping for!

But perhaps I don’t know what I’m doing, and this haphapzard anecdotal comparison is not actually data and I shoudn’t conclude much from it? Let me know, ideally with advice of how to do it better?

Or… are there reasons to expect different performance chracteristics from heroku? Might it be running on underlying AWS infrastructure that has less resources than my t2.medium?

Or, starting to make guess hypotheses, maybe the fact that heroku standard tier does not run on “dedicated” compute resources means I should expect a lot more variance compared to my own t2.medium, and as a result when deploying on heroku you need to optimize more (so the worst case of variance isn’t so bad) than when running on your own EC? That’s maybe just part of what you get with heroku, unless paying for performance dynos, it is even more important to have an good performing app? (yeah, I know I could use more caching, but that of course brings it’s own complexities, I wasn’t expecting to have to add it in as part of a heroku migration).

Or… I find it odd that it seems like slower (or more allocating?) actions are the ones that are worse. Is there any reason that memory allocations would be even more expensive on a heroku standard dyno than on my own EC2 t2.medium?

And why would the app workers seem to use so much more RAM on heroku than on my own EC2 anyway?

Any feedback or ideas welcome!

faster_s3_url: Optimized S3 url generation in ruby

Subsequent to my previous investigation about S3 URL generation performance, I ended up writing a gem with optimized implementations of S3 URL generation.

github: faster_s3_url

It has no dependencies (not even aws-sdk). It can speed up both public and presigned URL generation by around an order of magnitude. In benchmarks on my 2015 MacBook compared to aws-sdk-s3: public URLs from 180 in 10ms to 2200 in 10ms; presigned URLs from 10 in 10ms to 300 in 10ms (!!).

While if you are only generating a couple S3 URLs at a time you probably wouldn’t notice aws-sdk-ruby’s poor performance, if you are generating even just hundreds at a time, and especially for presigned URLs, it can really make a difference.

faster_s3_url supports the full API for aws-sdk-s3 presigned URLs , including custom params like response_content_disposition. It’s tests actually test that results match what aws-sdk-s3 would generate.

For shrine users, faster_s3_url includes a Shrine storage sub-class that can be drop-in replacement of Shrine::Storage::S3 to just have all your S3 URL generations via shrine be using the optimized implementation.

Key in giving me the confidence to think I could pull off an independent S3 presigned URL implementation was WeTransfer’s wt_s3_signer gem be succesful. wt_s3_signer makes some assumptions/restrictions to get even higher performance than faster_s3_url (two or three times as fast) — but the restrictions/assumptions and API to get that performance weren’t suitable for use cases, so I implemented my own.

Delete all S3 key versions with ruby AWS SDK v3

If your S3 bucket is versioned, then deleting an object from s3 will leave a previous version there, as a sort of undo history. You may have a “noncurrent expiration lifecycle policy” set which will delete the old versions after so many days, but within that window, they are there.

What if you were deleting something that accidentally included some kind of sensitive or confidential information, and you really want it gone?

To make matters worse, if your bucket is public, the version is public too, and can be requested by an unauthenticated user that has the URL including a versionID, with a URL that looks something like: To be fair, it would be pretty hard to “guess” this versionID! But if it’s really sensitive info, that might not be good enough.

It was a bit tricky for me to figure out how to do this with the latest version of ruby SDK (as I write, “v3“, googling sometimes gave old versions).

It turns out you need to first retrieve a list of all versions with bucket.object_versions . With no arg, that will return ALL the versions in the bucket, which could be a lot to retrieve, not what you want when focused on just deleting certain things.

If you wanted to delete all versions in a certain “directory”, that’s actually easiest of all:

s3_bucket.object_versions(prefix: "path/to/").batch_delete!

But what if you want to delete all versions from a specific key? As far as I can tell, this is trickier than it should be.

# danger! This may delete more than you wanted
s3_bucket.object_versions(prefix: "path/to/something.doc").batch_delete!

Because of how S3 “paths” (which are really just prefixes) work, that will ALSO delete all versions for path/to/something.doc2 or path/to/something.docdocdoc etc, for anything else with that as a prefix. There probably aren’t keys like that in your bucket, but that seems dangerously sloppy to assume, that’s how we get super weird bugs later.

I guess there’d be no better way than this?

key = "path/to/something.doc"
s3_bucket.object_versions(prefix: key).each do |object_version|
  object_version.delete if object_version.object_key == key

Is there anyone reading this who knows more about this than me, and can say if there’s a better way, or confirm if there isn’t?