Are you talking to Heroku redis in cleartext or SSL?

In “typical” Redis installation, you might be talking to redis on localhost or on a private network, and clients typically talk to redis in cleartext. Redis doesn’t even natively support communications over SSL. (Or maybe it does now with redis6?)

However, the Heroku redis add-on (the one from Heroku itself) supports SSL connections via “Stunnel”, a tool popular with other redis users use to get SSL redis connections too. (Or maybe via native redis with redis6? Not sure if you’d know the difference, or if it matters).

There are heroku docs on all of this which say:

While you can connect to Heroku Redis without the Stunnel buildpack, it is not recommend. The data traveling over the wire will be unencrypted.

Perhaps especially because on heroku your app does not talk to redis via localhost or on a private network, but on a public network.

But I think I’ve worked on heroku apps before that missed this advice and are still talking to heroku in the clear. I just happened to run across it when I got curious about the REDIS_TLS_URL env/config variable I noticed heroku setting.

Which brings us to another thing, that heroku doc on it is out of date, it doesn’t mention the REDIS_TLS_URL config variable, just the REDIS_URL one. The difference? the TLS version will be a url beginning with rediss:// instead of redis:// , note extra s, which many redis clients use as a convention for “SSL connection to redis probably via stunnel since redis itself doens’t support it”. The redis docs provide ruby and go examples which instead use REDIS_URL and writing code to swap the redis:// for rediss:// and even hard-code port number adjustments, which is silly!

(While I continue to be very impressed with heroku as a product, I keep running into weird things like this outdated documentation, that does not match my experience/impression of heroku’s all-around technical excellence, and makes me worry if heroku is slipping…).

The docs also mention a weird driver: ruby arg for initializing the Redis client that I’m not sure what it is and it doesn’t seem necessary.

The docs are correct that you have to tell the ruby Redis client not to try to verify SSL keys against trusted root certs, and this implementation uses a self-signed cert. Otherwise you will get an error that looks like: OpenSSL::SSL::SSLError: SSL_connect returned=1 errno=0 state=error: certificate verify failed (self signed certificate in certificate chain)

So, can be as simple as:

redis_client = Redis.new(url: ENV['REDIS_TLS_URL'], ssl_params: { verify_mode: OpenSSL::SSL::VERIFY_NONE })

$redis = redis_client
# and/or
Resque.redis = redis_client

I don’t use sidekiq on this project currently, but to get the SSL connection with VERIFY_NONE, looking at sidekiq docs maybe on sidekiq docs you might have to(?):

redis_conn = proc {
  Redis.new(url: ENV['REDIS_TLS_URL'], ssl_params: { verify_mode: OpenSSL::SSL::VERIFY_NONE })
}

Sidekiq.configure_client do |config|
  config.redis = ConnectionPool.new(size: 5, &redis_conn)
end

Sidekiq.configure_server do |config|
  config.redis = ConnectionPool.new(size: 25, &redis_conn)
end

(Not sure what values you should pick for connection pool size).

While the sidekiq docs mention heroku in passing, they don’t mention need for SSL connections — I think awareness of this heroku feature and their recommendation you use it may not actually be common!

Update: Beware REDIS_URL can also be rediss

On one of my apps I saw a REDIS_URL which used redis: and a REDIS_TLS_URL which uses (secure) rediss:.

But on another app, it provides *only* a REDIS_URL, which is rediss — meaning you have to set the verify_mode: OpenSSL::SSL::VERIFY_NONE when passing it to ruby redis client. So you have to be prepared to do this with REDIS_URL values too — I think it shouldn’t hurt to set the ssl_params option even if you pass it a non-ssl redis: url, so just set it all the time?

This second app was heroku-20 stack, and the first was heroku-18 stack, is that the difference? No idea.

Documented anywhere? I doubt it. Definitely seems sloppy for what I expect of heroku, making me get a bit suspicious of whether heroku is sticking to the really impressive level of technical excellence and documentation I expect from them.

So, your best bet is to check for both REDIS_TLS_URL and REDIS_URL, prefering the TLS one if present, realizing the REDIS_URL can have a rediss:// value in it too.

The heroku docs also say you don’t get secure TLS redis connection on “hobby” plans, but I”m not sure that’s actually true anymore on heroku-20? Not trusting the docs is not a good sign.

Comparing performance of a Rails app on different Heroku formations

I develop a “digital collections” or “asset management” app, which manages and makes digitized historical objects and their descriptions available to the public, from the collections here at the Science History Institute.

The app receives relatively low level of traffic (according to Google Analytics, around 25K pageviews a month), although we want it to be able to handle spikes without falling down. It is not the most performance-optimized app, it does have some relatively slow responses and can be RAM-hungry. But it works adequately on our current infrastructure: Web traffic is handled on a single AWS EC2 t2.medium instance, with 10 passenger processes (free version of passenger, so no multi-threading).

We are currently investigating the possibility of moving our infrastructure to heroku. After realizing that heroku standard dynos did not seem to have the performance characteristics I had expected, I decided to approach performance testing more methodically, to compare different heroku dyno formations to each other and to our current infrastructure. Our basic research question is probably What heroku formation do we need to have similar performance to our existing infrastructure?

I am not an expert at doing this — I did some research, read some blog posts, did some thinking, and embarked on this. I am going to lead you through how I approached this and what I found. Feedback or suggestions are welcome. The most surprising result I found was much poorer performance from heroku standard dynos than I expected, and specifically that standard dynos would not match performance of present infrastructure.

What URLs to use in test

Some older load-testing tools only support testing one URL over and over. I decided I wanted to test a larger sample list of URLs — to be a more “realistic” load, and also because repeatedly requesting only one URL might accidentally use caches in ways you aren’t expecting giving you unrepresentative results. (Our app does not currently use fragment caching, but caches you might not even be thinking about include postgres’s built-in automatic caches, or passenger’s automatic turbocache (which I don’t think we have turned on)).

My initial thought to get a list of such URLs from our already-in-production app from production logs, to get a sample of what real traffic looks like. There were a couple barriers for me to using production logs as URLs:

  1. Some of those URLs might require authentication, or be POST requests. The bulk of our app’s traffic is GET requests available without authentication, and I didn’t feel like the added complexity of setting up anything else in a load traffic was worthwhile.
  2. Our app on heroku isn’t fully functional yet. Without having connected it to a Solr or background job workers, only certain URLs are available.

In fact, a large portion of our traffic is an “item” or “work” detail page like this one. Additionally, those are the pages that can be the biggest performance challenge, since the current implementation includes a thumbnail for every scanned page or other image, so response time unfortunately scales with number of pages in an item.

So I decided a good list of URLs was simply a representative same of those “work detail” pages. In fact, rather than completely random sample, I took the 50 largest/slowest work pages, and then added in another 150 randomly chosen from our current ~8K pages. And gave them all a randomly shuffled order.

In our app, every time a browser requests a work detail page, the JS on that page makes an additional request for a JSON document that powers our page viewer. So for each of those 200 work detail pages, I added the JSON request URL, for a more “realistic” load, and 400 total URLs.

Performance: “base speed” vs “throughput under load”

Thinking about it, I realized there were two kinds of “performance” or “speed” to think about.

You might just have a really slow app, to exagerate let’s say typical responses are 5 seconds. That’s under low/no-traffic, a single browser is the only thing interacting with the app, it makes a single request, and has to wait 5 seconds for a response.

That number might be changed by optimizations or performance regressions in your code (including your dependencies). It might also be changed by moving or changing hardware or virtualization environment — including giving your database more CPU/RAM resources, etc.

But that number will not change by horizontally scaling your deployment — adding more puma or passenger processes or threads, scaling out hosts with a load balancer or heroku dynos. None of that will change this base speed because it’s just how long the app takes to prepare a response when not under load, how slow it is in a test only one web worker , where adding web workers won’t matter because they won’t be used.

Then there’s what happens to the app actually under load by multiple users at once. The base speed is kind of a lower bound on throughput under load — page response time is never going to get better than 5s for our hypothetical very slow app (without changing the underlying base speed). But it can get a lot worse if it’s hammered by traffic. This throughput under load can be effected not only by changing base speed, but also by various forms of horizontal scaling — how many puma or passenger processes you have with how many threads each, and how many CPUs they have access to, as well as number of heroku dynos or other hosts behind a load balancer.

(I had been thinking about this distinction already, but Nate Berkopec’s great blog post on scaling Rails apps gave me the “speed” vs “throughout” terminology to use).

For my condition, we are not changing the code at all. But we are changing the host architecture from a manual EC2 t2.medium to heroku dynos (of various possible types) in a way that could effect base speed, and we’re also changing our scaling architecture in a way that could change throughput under load on top of that — from one t2.medium with 10 passenger process to possibly multiple heroku dynos behind heroku’s load balancer, and also (for Reasons) switching from free passenger to trying puma with multiple threads per process. (we are running puma 5 with new experimental performance features turned on).

So we’ll want to get a sense of base speed of the various host choices, and also look at how throughput under load changes based on various choices.

Benchmarking tool: wrk

We’re going to use wrk.

There are LOTS of choices for HTTP benchmarking/load testing, with really varying complexity and from different eras of web history. I got a bit overwhelmed by it, but settled on wrk. Some other choices didn’t have all the features we need (some way to test a list of URLs, with at least some limited percentile distribution reporting). Others were much more flexible and complicated and I had trouble even figuring out how to use them!

wrk does need a custom lua script in order to handle a list of URLs. I found a nice script here, and modified it slightly to take filename from an ENV variable, and not randomly shuffle input list.

It’s a bit confusing understanding the meaning of “threads” vs “connections” in wrk arguments. This blog post from appfolio clears it up a bit. I decided to leave threads set to 1, and vary connections for load — so -c1 -t1 is a “one URL at a time” setting we can use to test “base speed”, and we can benchmark throughput under load by increasing connections.

We want to make sure we run the test for long enough to touch all 400 URLs in our list at least once, even in the slower setups, to have a good comparison — ideally it would be go through the list more than once, but for my own ergonomics I had to get through a lot of tests so ended up less tha ideal. (Should I have put fewer than 400 URLs in? Not sure).

Conclusions in advance

As benchmarking posts go (especially when I’m the one writing them), I’m about to drop a lot of words and data on you. So to maximize the audience that sees the conclusions (because they surprise me, and I want feedback/pushback on them), I’m going to give you some conclusions up front.

Our current infrastructure has web app on a single EC2 t2.medium, which is a burstable EC2 type — our relatively low-traffic app does not exhaust it’s burst credits. Measuring base speed (just one concurrent request at a time), we found that performance dynos seem to have about the CPU speed of a bursting t2.medium (just a hair slower).

But standard dynos are as a rule 2 to 3 times slower; additionally they are highly variable, and that variability can be over hours/days. A 3 minute period can have measured response times 2 or more times slower than another 3 minute period a couple hours later. But they seem to typically be 2-3x slower than our current infrastructure.

Under load, they scale about how you’d expect if you knew how many CPUs are present, no real surprises. Our existing t2.medium has two CPUs, so can handle 2 simultaneous requests as fast as 1, and after that degrades linearly.

A single performance-L ($500/month) has 4 CPUs (8 hyperthreads), so scales under load much better than our current infrastructure.

A single performance-M ($250/month) has only 1 CPU (!), so scales pretty terribly under load.

Testing scaling with 4 standard-2x’s ($200/month total), we see that it scales relatively evenly. Although lumpily because of variability, and it starts out so much worse performing that even as it scales “evenly” it’s still out-performed by all other arcchitectures. :( (At these relatively fast median response times you might say it’s still fast enough who cares, but in our fat tail of slower pages it gets more distressing).

Now we’ll give you lots of measurements, or you can skip all that to my summary discussion or conclusions for our own project at the end.

Let’s compare base speed

OK, let’s get to actual measurements! For “base speed” measurements, we’ll be telling wrk to use only one connection and one thread.

Existing t2.medium: base speed

Our current infrastructure is one EC2 t2.medium. This EC2 instance type has two vCPUs and 4GB of RAM. On that single EC2 instance, we run passenger (free not enterprise) set to have 10 passenger processes, although the base speed test with only one connection should only touch one of the workers. The t2 is a “burstable” type, and we do always have burst credits (this is not a high traffic app; verified we never exhausted burst credits in these tests), so our test load may be taking advantage of burst cpu.

$ URLS=./sample_works.txt  wrk -c 1 -t 1 -d 3m --timeout 20s --latency -s load_test/multiplepaths.lua.txt https://[current staging server]
 multiplepaths: Found 400 paths
 multiplepaths: Found 400 paths
 Running 3m test @ https://staging-digital.sciencehistory.org
   1 threads and 1 connections
   Thread Stats   Avg      Stdev     Max   +/- Stdev
     Latency   311.00ms  388.11ms   2.37s    86.45%
     Req/Sec    11.89      8.96    40.00     69.95%
   Latency Distribution
      50%   90.99ms
      75%  453.40ms
      90%  868.81ms
      99%    1.72s
   966 requests in 3.00m, 177.43MB read
 Requests/sec:      5.37
 Transfer/sec:      0.99MB

I’m actually feeling pretty good about those numbers on our current infrastructure! 90ms median, not bad, and even 453ms 75th percentile is not too bad. Now, our test load involves some JSON responses that are quicker to deliver than corresponding HTML page, but still pretty good. The 90th/99th/and max request (2.37s) aren’t great, but I knew I had some slow pages, this matches my previous understanding of how slow they are in our current infrastructure.

90th percentile is ~9 times 50th percenile.

I don’t have an understanding of why the two different Req/Sec and Requests/Sec values are so different, and don’t totally understand what to do with the Stdev and +/- Stdev values, so I’m just going to be sticking to looking at the latency percentiles, I think “latency” could also be called “response times” here.

But ok, this is our baseline for this workload. And doing this 3 minute test at various points over the past few days, I can say it’s nicely regular and consistent, occasionally I got a slower run, but 50th percentile was usually 90ms–105ms, right around there.

Heroku standard-2x: base speed

From previous mucking about, I learned I can only reliably fit one puma worker in a standard-1x, and heroku says “we typically recommend a minimum of 2 processes, if possible” (for routing algorithmic reasons when scaled to multiple dynos), so I am just starting at a standard-2x with two puma workers each with 5 threads, matching heroku recommendations for a standard-2x dyno.

So one thing I discovered is that bencharks from a heroku standard dyno are really variable, but here are typical ones:

$ heroku dyno:resize
 type     size         qty  cost/mo
 ───────  ───────────  ───  ───────
 web      Standard-2X  1    50

$ heroku config:get --shell WEB_CONCURRENCY RAILS_MAX_THREADS
 WEB_CONCURRENCY=2
 RAILS_MAX_THREADS=5

$ URLS=./sample_works.txt  wrk -c 1 -t 1 -d 3m --timeout 20s --latency -s load_test/multiplepaths.lua.txt https://scihist-digicoll.herokuapp.com/
 multiplepaths: Found 400 paths
 multiplepaths: Found 400 paths
 Running 3m test @ https://scihist-digicoll.herokuapp.com/
   1 threads and 1 connections
   Thread Stats   Avg      Stdev     Max   +/- Stdev
     Latency   645.08ms  768.94ms   4.41s    85.52%
     Req/Sec     5.78      4.36    20.00     72.73%
   Latency Distribution
      50%  271.39ms
      75%  948.00ms
      90%    1.74s
      99%    3.50s
   427 requests in 3.00m, 74.51MB read
 Requests/sec:      2.37
 Transfer/sec:    423.67KB

I had heard that heroku standard dynos would have variable performance, because they are shared multi-tenant resources. I had been thinking of this like during a 3 minute test I might see around the same median with more standard deviation — but instead, what it looks like to me is that running this benchmark on Monday at 9am might give very different results than at 9:50am or Tuesday at 2pm. The variability is over a way longer timeframe than my 3 minute test — so that’s something learned.

Running this here and there over the past week, the above results seem to me typical of what I saw. (To get better than “seem typical” on this resource, you’d have to run a test, over several days or a week I think, probably not hammering the server the whole time, to get a sense of actual statistical distribution of the variability).

I sometimes saw tests that were quite a bit slower than this, up to a 500ms median. I rarely if ever saw results too much faster than this on a standard-2x. 90th percentile is ~6x median, less than my current infrastructure, but that still gets up there to 1.74 instead of 864ms.

This typical one is quite a bit slower than than our current infrastructure, our median response time is 3x the latency, with 90th and max being around 2x. This was worse than I expected.

Heroku performance-m: base speed

Although we might be able to fit more puma workers in RAM, we’re running a single-connection base speed test, so it shouldn’t matter to, and we won’t adjust it.

$ heroku dyno:resize
 type     size           qty  cost/mo
 ───────  ─────────────  ───  ───────
 web      Performance-M  1    250

$ heroku config:get --shell WEB_CONCURRENCY RAILS_MAX_THREADS
 WEB_CONCURRENCY=2
 RAILS_MAX_THREADS=5

$ URLS=./sample_works.txt  wrk -c 1 -t 1 -d 3m --timeout 20s --latency -s load_test/multiplepaths.lua.txt https://scihist-digicoll.herokuapp.com/
 multiplepaths: Found 400 paths
 multiplepaths: Found 400 paths
 Running 3m test @ https://scihist-digicoll.herokuapp.com/
   1 threads and 1 connections
   Thread Stats   Avg      Stdev     Max   +/- Stdev
     Latency   377.88ms  481.96ms   3.33s    86.57%
     Req/Sec    10.36      7.78    30.00     37.03%
   Latency Distribution
      50%  117.62ms
      75%  528.68ms
      90%    1.02s
      99%    2.19s
   793 requests in 3.00m, 145.70MB read
 Requests/sec:      4.40
 Transfer/sec:    828.70KB

This is a lot closer to the ballpark of our current infrastructure. It’s a bit slower (117ms median intead of 90ms median), but in running this now and then over the past week it was remarkably, thankfully, consistent. Median and 99th percentile are both 28% slower (makes me feel comforted that those numbers are the same in these two runs!), that doesn’t bother me so much if it’s predictable and regular, which it appears to be. The max appears to me still a little bit less regular on heroku for some reason, since performance is supposed to be non-shared AWS resources, you wouldn’t expect it to be, but slow requests are slow, ok.

90th percentile is ~9x median, about the same as my current infrastructure.

heroku performance-l: base speed

$ heroku dyno:resize
 type     size           qty  cost/mo
 ───────  ─────────────  ───  ───────
 web      Performance-L  1    500

$ heroku config:get --shell WEB_CONCURRENCY RAILS_MAX_THREADS
 WEB_CONCURRENCY=2
 RAILS_MAX_THREADS=5

URLS=./sample_works.txt  wrk -c 1 -t 1 -d 3m --timeout 20s --latency -s load_test/multiplepaths.lua.txt https://scihist-digicoll.herokuapp.com/
 multiplepaths: Found 400 paths
 multiplepaths: Found 400 paths
 Running 3m test @ https://scihist-digicoll.herokuapp.com/
   1 threads and 1 connections
   Thread Stats   Avg      Stdev     Max   +/- Stdev
     Latency   471.29ms  658.35ms   5.15s    87.98%
     Req/Sec    10.18      7.78    30.00     36.20%
   Latency Distribution
      50%  123.08ms
      75%  635.00ms
      90%    1.30s
      99%    2.86s
   704 requests in 3.00m, 130.43MB read
 Requests/sec:      3.91
 Transfer/sec:    741.94KB

No news is good news, it looks very much like performance-m, which is exactly what we expected, because this isn’t a load test. It tells us that performance-m and performance-l seem to have similar CPU speeds and similar predictable non-variable regularity, which is what I find running this test periodically over a week.

90th percentile is ~10x median, about the same as current infrastructure.

The higher Max speed is just evidence of what I mentioned, the speed of slowest request did seem to vary more than on our manual t2.medium, can’t really explain why.

Summary: Base speed

Not sure how helpful this visualization is, charting 50th, 75th, and 90th percentile responses across architectures.

But basically: performance dynos perform similarly to my (bursting) t2.medium. Can’t explain why performance-l seems slightly slower than performance-m, might be just incidental variation when I ran the tests.

The standard-2x is about twice as slow as my (bursting) t2.medium. Again recall standard-2x results varied a lot every time I ran them, the one I reported seems “typical” to me, that’s not super scientific, admittedly, but I’m confident that standard-2x are a lot slower in median response times than my current infrastructure.

Throughput under load

Ok, now we’re going to test using wrk to use more connections. In fact, I’ll test each setup with various number of connections, and graph the result, to get a sense of how each formation can handle throughput under load. (This means a lot of minutes to get all these results, at 3 minutes per number of connection test, per formation!).

An additional thing we can learn from this test, on heroku we can look at how much RAM is being used after a load test, to get a sense of the app’s RAM usage under traffic to understand the maximum number of puma workers we might be able to fit in a given dyno.

Existing t2.medium: Under load

A t2.medium has 4G of RAM and 2 CPUs. We run 10 passenger workers (no multi-threading, since we are free, rather than enterprise, passenger). So what do we expect? With 2 CPUs and more than 2 workers, I’d expect it to handle 2 simultaneous streams of requests almost as well as 1; 3-10 should be quite a bit slower because they are competing for the 2 CPUs. Over 10, performance will probably become catastrophic.

2 connections are exactly flat with 1, as expected for our two CPUs, hooray!

Then it goes up at a strikingly even line. Going over 10 (to 12) simultaneous connections doesn’t matter, even though we’ve exhausted our workers, I guess at this point there’s so much competition for the two CPUs already.

The slope of this curve is really nice too actually. Without load, our median response time is 100ms, but even at a totally overloaded 12 overloaded connections, it’s only 550ms, which actually isn’t too bad.

We can make a graph that in addition to median also has 75th, 90th, and 99th percentile response time on it:

It doesn’t tell us too much; it tells us the upper percentiles rise at about the same rate as the median. At 1 simultaneous connection 90th percentile of 846ms is about 9 times the median of 93ms; at 10 requests the 90th percentile of 3.6 seconds is about 8 times the median of 471ms.

This does remind us that under load when things get slow, this has more of a disastrous effect on already slow requests than fast requests. When not under load, even our 90th percentile was kind of sort of barley acceptable at 846ms, but under load at 3.6 seconds it really isn’t.

Single Standard-2X dyno: Under load

A standard-2X dyno has 1G of RAM. The (amazing, excellent, thanks schneems) heroku puma guide suggests running two puma workers with 5 threads each. At first I wanted to try running three workers, which seemed to fit into available RAM — but under heavy load-testing I was getting Heroku R14 Memory Quota Exceeded errors, so we’ll just stick with the heroku docs recommendations. Two workers with 5 threads each fit with plenty of headroom.

A standard-2x dyno is runs on shared (multi-tenant) underlying Amazon virtual hardware. So while it is running on hardware with 4 CPUs (each of which can run two “hyperthreads“), the puma doc suggests “it is best to assume only one process can execute at a time” on standard dynos.

What do we expect? Well, if it really only had one CPU, it would immediately start getting bad at 2 simulataneous connections, and just get worse from there. When we exceed the two worker count, will it get even worse? What about when we exceed the 10 thread (2 workers * 5 threads) count?

You’d never run just one dyno if you were expecting this much traffic, you’d always horizontally scale. This very artificial test is just to get a sense of it’s characteristics.

Also, we remember that standard-2x’s are just really variable; I could get much worse or better runs than this, but graphed numbers from a run that seemed typical.

Well, it really does act like 1 CPU, 2 simultaneous connections is immediately a lot worse than 1.

The line isn’t quite as straight as in our existing t2.medium, but it’s still pretty straight; I’d attribute the slight lumpiness to just the variability of shared-architecture standard dyno, and figure it would get perfectly straight with more data.

It degrades at about the same rate of our baseline t2.medium, but when you start out slower, that’s more disastrous. Our t2.medium at an overloaded 10 simultaneous requests is 473ms (pretty tolerable actually), 5 times the median at one request only. This standard-2x has a median response time of 273 ms at only one simultaneous request, and at an overloaded 10 requests has a median response time also about 5x worse, but that becomes a less tolerable 1480ms.

Does also graphing the 75th, 90th, and 99th percentile tell us much?

Eh, I think the lumpiness is still just standard shared-architecture variability.

The rate of “getting worse” as we add more overloaded connections is actually a bit better than it was on our t2.medium, but since it already starts out so much slower, we’ll just call it a wash. (On t2.medium, 90th percentile without load is 846ms and under an overloaded 10 connections 3.6s. On this single standard-2x, it’s 1.8s and 5.2s).

I’m not sure how much these charts with various percentiles on them tell us, I’ll not include them for every architecture hence.

standard-2x, 4 dynos: Under load

OK, realistically we already know you shouldn’t have just one standard-2x dyno under that kind of load. You’d scale out, either manually or perhaps using something like the neat Rails Autoscale add-on.

Let’s measure with 4 dynos. Each is still running 2 puma workers, with 5 threads each.

What do we expect? Hm, treating each dyno as if it has only one CPU, we’d expect it to be able to handle traffic pretty levelly up to 4 simultenous connections, distributed to 4 dynos. It’s going to do worse after that, but up to 8 there is still one puma worker per connection so it might get even worse after 8?

Well… I think that actually is relatively flat from 1 to 4 simultaneous connections, except for lumpiness from variability. But lumpiness from variability is huge! We’re talking 250ms median measured at 1 connection, up to 369ms measured median at 2, down to 274ms at 3.

And then maybe yeah, a fairly shallow slope up to 8 simutaneous connections than steeper.

But it’s all fairly shallow slope compared to our base t2.medium. At 8 connections (after which we pretty much max out), the standard-2x median of 464ms is only 1.8 times the median at 1 conection. Compared to the t2.median increase of 3.7 times.

As we’d expect, scaling out to 4 dynos (with four cpus/8 hyperthreads) helps us scale well — the problem is the baseline is so slow to begin (with very high bounds of variability making it regularly even slower).

performance-m: Under load

A performance-m has 2.5 GB of memory. It only has one physical CPU, although two “vCPUs” (two hyperthreads) — and these are all your apps, it is not shared.

By testing under load, I demonstrated I could actually fit 12 workers on there without any memory limit errors. But is there any point to doing that with only 1/2 CPUs? Under a bit of testing, it appeared not.

The heroku puma docs recommend only 2 processes with 5 threads. You could do a whole little mini-experiment just trying to measure/optimize process/thread count on performance-m! We’ve already got too much data here, but in some experimentation it looked to me like 5 processes with 2 threads each performed better (and certainly no worse) than 2 processes with 5 threads — if you’ve got the RAM just sitting there anyway (as we do), why not?

I actually tested with 6 puma processes with 2 threads each. There is still a large amount of RAM headroom we aren’t going to use even under load.

What do we expect? Well, with the 2 “hyperthreads” perhaps it can handle 2 simultaneous requests nearly as well as 1 (or not?); after that, we expect it to degrade quickly same as our original t2.medium did.

It an handle 2 connections slightly better than you’d expect if there really was only 1 CPU, so I guess a hyperthread does give you something. Then the slope picks up, as you’d expect; and it looks like it does get steeper after 4 simultaneous connections, yup.

performance-l: Under load

A performance-l ($500/month) costs twice as much as a performance-m ($250/month), but has far more than twice as much resources. performance-l has a whopping 14GB of RAM compared to performance-m’s 2.5GB; and performance-l has 4 real CPUs/hyperthreads available to use (visible using the nproc technique in the heroku puma article.

Because we have plenty of RAM to do so, we’re going to run 10 worker processes to match our original t2.medium’s. We still ran with 2 threads, just cause it seems like maybe you should never run a puma worker with only one thread? But who knows, maybe 10 workers with 1 thread each would perform better; plenty of room (but not plenty of my energy) for yet more experimentation.

What do we expect? The graph should be pretty flat up to 4 simultaneous connections, then it should start getting worse, pretty evenly as simultaneous connections rise all the way up to 12.

It is indeed pretty flat up to 4 simultaneous connections. Then up to 8 it’s still not too bad — median at 8 is only ~1.5 median at 1(!). Then it gets worse after 8 (oh yeah, 8 hyperthreads?).

But the slope is wonderfully shallow all the way. Even at 12 simultaneous connections, the median response time of 266ms is only 2.5x what it was at one connection. (In our original t2.medium, at 12 simultaneous connections median response time was over 5x what it was at 1 connection).

This thing is indeed a monster.

Summary Comparison: Under load

We showed a lot of graphs that look similar, but they all had different sclaes on the y-axis. Let’s plot median response times under load of all architectures on the same graph, and see what we’re really dealing with.

The blue t2.medium is our baseline, what we have now. We can see that there isn’t really a similar heroku option, we have our choice of better or worse.

The performance-l is just plain better than what we have now. It starts out performing about the same as what we have now for 1 or 2 simultaneous connections, but then scales so much flatter.

The performance-m also starts out about thesame, but sccales so much worse than even what we have now. (it’s that 1 real CPU instead of 2, I guess?).

The standard-2x scaled to 4 dynos… has it’s own characteristics. It’s baseline is pretty terrible, it’s 2 to 3 times as slow as what we have now even not under load. But then it scales pretty well, since it’s 4 dynos after all, it doesn’t get worse as fast as performance-m does. But it started out so bad, that it remains far worse than our original t2.medium even under load. Adding more dynos to standard-2x will help it remain steady under even higher load, but won’t help it’s underlying problem that it’s just slower than everyone else.

Discussion: Thoughts and Surprises

  • I had been thinking of a t2.medium (even with burst) as “typical” (it is after all much slower than my 2015 Macbook), and has been assuming (in retrospect with no particular basis) that a heroku standard dyno would perform similarly.
    • Most discussion and heroku docs, as well as the naming itself, suggest that a ‘standard’ dyno is, well, standard, and performance dynos are for “super scale, high traffic apps”, which is not me.
    • But in fact, heroku standard dynos are much slower and more variable in performance than a bursting t2.medium. I suspect they are slower than other options you might consider non-heroku “typical” options.



  • My conclusion is honestly that “standard” dynos are really “for very fast, well-optimized apps that can handle slow and variable CPU” and “performance” dynos are really “standard, matching the CPU speeds you’d get from a typical non-heroku option”. But this is not how they are documented or usually talked about. Are other people having really different experiences/conclusions than me? If so, why, or where have I gone wrong?
    • This of course has implications for estimating your heroku budget if considering switching over. :(
    • If you have a well-optimized fast app, say even 95th percentile is 200ms (on bursting t2.medium), then you can handle standard slowness — so what your 95th percentile is now 600ms (and during some time periods even much slower, 1s or worse, due to variability). That’s not so bad for a 95th percentile.
    • One way to get a very fast is of course caching. There is lots of discussion of using caching in Rails, sometimes the message (explicit or implicit) is “you have to use lots of caching to get reasonable performance cause Rails is so slow.” What if many of these people are on heroku, and it’s really you have to use lots of caching to get reasonable performance on heroku standard dyno??
    • I personally don’t think caching is maintenance free; in my experience properly doing cache invalidation and dealing with significant processing spikes needed when you choose to invalidate your entire cache (cause cached HTML needs to change) lead to real maintenance/development cost. I have not needed caching to meet my performance goals on present architecture.
    • Everyone doesn’t necessarily have the same performance goals/requirements. Mine of a low-traffic non-commercial site are are maybe more modest, I just need users not to be super annoyed. But whatever your performance goals, you’re going to have to spend more time on optimization on a heroku standard than something with much faster CPU — like a standard affordable mid-tier EC2. Am I wrong?


  • One significant factor on heroku standard dyno performance is that they use shared/multi-tenant infrastructure. I wonder if they’ve actually gotten lower performance over time, as many customers (who you may be sharing with) have gotten better at maximizing their utilization, so the shared CPUs are typically more busy? Like a frog boiling, maybe nobody noticed that standard dynos have become lower performance? I dunno, brainstorming.
    • Or maybe there are so many apps that start on heroku instead of switcching from somewhere else, that people just don’t realize that standard dynos are much slower than other low/mid-tier options?
    • I was expecting to pay a premium for heroku — but even standard-2x’s are a significant premium over paying for t2.medium EC2 yourself, one I found quite reasonable…. performance dynos are of course even more premium.


  • I had a sort of baked-in premise that most Rails apps are “IO-bound”, they spend more time waiting on IO than using CPU. I don’t know where I got that idea, I heard it once a long time ago and it became part of my mental model. I now do not believe this is true true of my app, and I do not in fact believe it is true of most Rails apps in 2020. I would hypothesize that most Rails apps today are in fact CPU-bound.

  • The performance-m dyno only has one CPU. I had somehow also been assuming that it would have two CPUs — I’m not sure why, maybe just because at that price! It would be a much better deal with two CPUs.
    • Instead we have a huge jump from $250 performance-m to $500 performance-l that has 4x the CPUs and ~5x the RAM.
    • So it doesn’t make financial sense to have more than one performance-m dyno, you might as well go to performance-l. But this really complicates auto-scaling, whether using Heroku’s feature , or the awesome Rails Autoscale add-on. I am not sure I can afford a performance-l all the time, and a performance-m might be sufficient most of the time. But if 20% of the time I’m going to need more (or even 5%, or even unexpectedly-mentioned-in-national-media), it would be nice to set things up to autoscale up…. I guess to financially irrational 2 or more performance-m’s? :(

  • The performance-l is a very big machine, that is significantly beefier than my current infrastructure. And has far more RAM than I need/can use with only 4 physical cores. If I consider standard dynos to be pretty effectively low tier (as I do), heroku to me is kind of missing mid-tier options. A 2 CPU option at 2.5G or 5G of RAM would make a lot of sense to me, and actually be exactly what I need… really I think performance-m would make more sense with 2 CPUs at it’s existing already-premium price point, and to be called a “performance” dyno. . Maybe heroku is intentionally trying set options to funnel people to the highest-priced performance-l.

Conclusion: What are we going to do?

In my investigations of heroku, my opinion of the developer UX and general service quality only increases. It’s a great product, that would increase our operational capacity and reliability, and substitute for so many person-hours of sysadmin/operational time if we were self-managing (even on cloud architecture like EC2).

But I had originally been figuring we’d use standard dynos (even more affordably, possibly auto-scaled with Rails Autoscale plugin), and am disappointed that they end up looking so much lower performance than our current infrastructure.

Could we use them anyway? Response time going from 100ms to 300ms — hey, 300ms is still fine, even if I’m sad to lose those really nice numbers I got from a bit of optimization. But this app has a wide long-tail ; our 75th percentile going from 450ms to 1s, our 90th percentile going from 860ms to 1.74s and our 99th going from 2.3s to 4.4s — a lot harder to swallow. Especially when we know that due to standard dyno variability, a slow-ish page that on my present architecture is reliably 1.5s, could really be anywhere from 3 to 9(!) on heroku.

I would anticipate having to spend a lot more developer time on optimization on heroku standard dynos — or, i this small over-burdened non-commercial shop, not prioritizing that (or not having the skills for it), and having our performance just get bad.

So I’m really reluctant to suggest moving our app to heroku with standard dynos.

A performance-l dyno is going to let us not have to think about performance any more than we do now, while scaling under high-traffic better than we do now — I suspect we’d never need to scale to more than one performance-l dyno. But it’s pricey for us.

A performance-m dyno has a base-speed that’s fine, but scales very poorly and unaffordably. Doesn’t handle an increase in load very well as one dyno, and to get more CPUs you have to pay far too much (especially compared to standard dynos I had been assuming I’d use).

So I don’t really like any of my options. If we do heroku, maybe we’ll try a performance-m, and “hope” our traffic is light enough that a single one will do? Maybe with Rails autoscale for traffic spikes, even though 2 performance-m dynos isn’t financially efficient? If we are scaling to 2 (or more!) performance-m’s more than very occasionally, switch to performance-l, which means we need to make sure we have the budget for it?

Deep Dive: Moving ruby projects from Travis to Github Actions for CI

So this is one of my super wordy posts, if that’s not your thing abort now, but some people like them. We’ll start with a bit of context, then get to some detailed looks at Github Actions features I used to replace my travis builds, with example config files and examination of options available.

For me, by “Continuous Integration” (CI), I mostly mean “Running automated tests automatically, on your code repo, as you develop”, on every PR and sometimes with scheduled runs. Other people may mean more expansive things by “CI”.

For a lot of us, our first experience with CI was when Travis-ci started to become well-known, maybe 8 years ago or so. Travis was free for open source, and so darn easy to set up and use — especially for Rails projects, it was a time when it still felt like most services focused on docs and smooth fit for ruby and Rails specifically. I had heard of doing CI, but as a developer in a very small and non-profit shop, I want to spend time writing code not setting up infrastructure, and would have had to get any for-cost service approved up the chain from our limited budget. But it felt like I could almost just flip a switch and have Travis on ruby or rails projects working — and for free!

Free for open source wasn’t entirely selfless, I think it’s part of what helped Travis literally define the market. (Btw, I think they were the first to invent the idea of a “badge” URL for a github readme?) Along with an amazing Developer UX (which is today still a paragon), it just gave you no reason not to use it. And then once using it, it started to seem insane to not have CI testing, nobody would ever again want to develop software without the build status on every PR before merge.

Travis really set a high bar for ease of use in a developer tool, you didn’t need to think about it much, it just did what you needed, and told you what you needed to know in it’s read-outs. I think it’s an impressive engineering product. But then.

End of an era

Travis will no longer be supporting open source projects with free CI.

The free open source travis projects originally ran on travis-ci.org, with paid commercial projects on travis-ci.com. In May 2018, they announced they’d be unifying these on travis-ci.com only, but with no announced plan that the policy for free open source would change. This migration seemed to proceed very slowly though.

Perhaps because it was part of preparing the company for a sale, in Jan 2019 it was announced private equity firm Idera had bought travis. At the time the announcement said “We will continue to maintain a free, hosted service for open source projects,” but knowing what “private equity” usually means, some were concerned for the future. (HN discussion).

While the FAQ on the migration to travis-ci.com still says that travis-ci.org should remain reliable until projects are fully migrated, in fact over the past few months travis-ci.org projects largely stopped building, as travis apparently significantly reduced resources on the platform. Some people began manually migrating their free open source projects to travis-ci.com where builds still worked. But, while the FAQ also still says “Will Travis CI be getting rid of free users? Travis CI will continue to offer a free tier for public or open-source repositories on travis-ci.com” — in fact, travis announced that they are ending the free service for open source. The “free tier” is a limited trial (available not just to open source), and when it expires, you can pay, or apply to a special program for an extension, over and over again.

They are contradicting themselves enough that while I’m not sure exactly what is going to happen, but no longer trust them as a service.

Enter Github Actions

I work mostly on ruby and Rails projects. They are all open source, almost all of them use travis. So while (once moved to travis-ci.com) they are all currently working, it’s time to start moving them somewhere else, before I have dozens of projects with broken CI and still don’t know how to move them. And the new needs to be free — many of these projects are zero-budget old-school “volunteer” or “informal multi-institutional collaboration” open source.

There might be several other options, but the one I chose is Github Actions — my sense that it had gotten mature enough to start approaching travis level of polish, and all of my projects are github-hosted, and Github Actions is free for unlimited use for open source. (pricing page; Aug 2019 announcement of free for open source). And we are really fortunate that it became mature and stable in time for travis to withdraw open source support (if travis had been a year earlier, we’d be in trouble).

Github Actions is really powerful. It is built to do probably WAY MORE than travis does, definitely way beyond “automated testing” to various flows for deployment and artifact release, to really just about any kind of process for managing your project you want. The logic you can write almost unlimited, all running on github’s machines.

As a result though…. I found it a bit overwhelming to get started. The Github Actions docs are just overwhelmingly abstract, there is so much there, you can almost anything — but I don’t actually want to learn a new platform, I just want to get automated test CI for my ruby project working! There are some language/project speccific Guides available, for node.js, python, a few different Java setups — but not for ruby or Rails! My how Rails has fallen, from when most services like this would be focusing on Rails use cases first. :(

There are some third part guides available that might focus on ruby/rails, but one of the problems is that Actions has been evolving for a few years with some pivots, so it’s easy to find outdated instructions. One I found helpful orientation was this Drifting Ruby screencast. This screencast showed me there is a kind of limited web UI with integrated docs searcher — but i didn’t end up using it, I just created the text config file by hand, same as I would have for travis. Github provides templates for “ruby” or “ruby gem”, but the Drifting Ruby sccreencast said “these won’t really work for our ruby on rails application so we’ll have to set up one manually”, so that’s what I did too. ¯\_(ツ)_/¯

But the cost of all the power github Actions provides is… there are a lot more switches and dials to understand and get right (and maintain over time and across multiple projects). I’m not someone who likes copy-paste without understanding it, so I spent some time trying to understand the relevant options and alternatives; in the process I found some things I might have otherwise copy-pasted from other people’s examples that could be improved. So I give you the results of my investigations, to hopefully save you some time, if wordy comprehensive reports are up your alley.

A Simple Test Workflow: ruby gem, test with multiple ruby versions

Here’s a file for a fairly simple test workflow. You can see it’s in the repo at .github/workflows. The name of the file doesn’t matter — while this one is called ruby.yml, i’ve since moved over to naming the file to match the name: key in the workflow for easier traceability, so I would have called it ci.yml instead.

Triggers

You can see we say that this workflow should be run on any push to master branch, and also for any pull_request at all. Many other examples I’ve seen define pull_request: branches: ["main"], which seems to mean only run on Pull Requests with main as the base. While that’s most of my PR’s, if there is ever a PR that uses another branch as a base for whatever reason, I still want to run CI! While hypothetically you should be able leave branches out to mean “any branch”, I only got it to work by explicitly saying branches: ["**"]

Matrix

For this gem, we want to run CI on multiple ruby versions. You can see we define them here. This works similarly to travis matrixes. If you have more than one matrix variable defined, the workflow will run for every combination of variables (hence the name “matrix”).

      matrix:
        ruby: [ '2.4.4', '2.5.1', '2.6.1', '2.7.0', 'jruby-9.1.17.0', 'jruby-9.2.9.0' ]

In a given run, the current value of the matrix variables is available in github actions “context”, which you can acccess as eg ${{ matrix.ruby }}. You can see how I use that in the name, so that the job will show up with it’s ruby version in it.

    name: Ruby ${{ matrix.ruby }}

Ruby install

While Github itself provides an action for ruby install, it seems most people are using this third-party action. Which we reference as `ruby/setup-ruby@v1`.

You can see we use the matrix.ruby context to tell the setup-ruby action what version of ruby to install, which works because our matrix values are the correct values recognized by the action. Which are documented in the README, but note that values like jruby-head are also supported.

Note, although it isn’t clearly documented, you can say 2.4 to mean “latest available 2.4.x” (rather than it meaning “2.4.0”), which is hugely useful, and I’ve switched to doing that. I don’t believe that was available via travis/rvm ruby install feature.

For a project that isn’t testing under multiple rubies, if we left out the with: ruby-version, the action will conveniently use a .ruby-version file present in the repo.

Note you don’t need to put a gem install bundler into your workflow yourself, while I’m not sure it’s clearly documented, I found the ruby/setup-ruby action would do this for you (installing the latest available bundler, instead of using whatever was packaged with ruby version), btw regardless of whether you are using the bundler-cache feature (see below).

Note on How Matrix Jobs Show Up to Github

With travis, testing for multiple ruby or rails versions with a matrix, we got one (or, well, actually two) jobs showing up on the Github PR:

Each of those lines summaries a collection of matrix jobs (eg different ruby versions). If any of the individual jobs without the matrix failed, the whole build would show up as failed. Success or failure, you could click on “Details” to see each job and it’s status:

I thought this worked pretty well — especially for “green” builds I really don’t need to see the details on the PR, the summary is great, and if I want to see the details I can click through, great.

With Github Actions, each matrix job shows up directly on the PR. If you have a large matrix, it can be… a lot. Some of my projects have way more than 6. On PR:

Maybe it’s just because I was used to it, but I preferred the Travis way. (This also makes me think maybe I should change the name key in my workflow to say eg CI: Ruby 2.4.4 to be more clear? Oops, tried that, it just looks even weirder in other GH contexts, not sure.)

Oh, also, that travis way of doing the build twice, once for “pr” and once for “push”? Github Actions doesn’t seem to do that, it just does one, I think corresponding to travis “push”. While the travis feature seemed technically smart, I’m not sure I ever actually saw one of these builds pass while the other failed in any of my projects, I probably won’t miss it.

Badge

Did you have a README badge for travis? Don’t forget to swap it for equivalent in Github Actions.

The image url looks like: https://github.com/$OWNER/$REPOSITORY/workflows/$WORKFLOW_NAME/badge.svg?branch=master, where $WORKFLOW_NAME of course has to be URL-escaped if it ocntains spaces etc.

The github page at https://github.com/owner/repo/actions, if you select a particular workflow/branch, does, like travis, give you a badge URL/markdown you can copy/paste if you click on the three-dots and then “Create status badge”. Unlike travis, what it gives you to copy/paste is just image markdown, it doesn’t include a link.

But I definitely want the badge to link to viewing the results of the last build in the UI. So I do it manually. Limit to the speciifc workflow and branch that you made the badge for in the UI then just copy and paste the URL from the browser. A bit confusing markdown to construct manually, here’s what it ended up looking like for me:

I copy and paste that from an existing project when I need it in a new one. :shrug:

Require CI to merge PR?

However, that difference in how jobs show up to Github, the way each matrix job shows up separately now, has an even more negative impact on requiring CI success to merge a PR.

If you want to require that CI passes before merging a PR, you configure that at https://github.com/acct/project/settings/branches under “Branch protection rules”.When you click “Add Rule”, you can/must choose WHICH jobs are “required”.

For travis, that’d be those two “master” jobs, but for the new system, every matrix job shows up separately — in fact, if you’ve been messing with job names trying to get it right as I have, you have any job name that was ever used in the last 7 days, and they don’t have the Github workflow name appended to them or anything (another reason to put github workflow name in the job name?).

But the really problematic part is that if you edit your list of jobs in the matrix — adding or removing ruby versions as one does, or even just changing the name that shows up for a job — you have to go back to this screen to add or remove jobs as a “required status check”.

That seems really unworkable to me, I’m not sure how it hasn’t been a major problem already for users. It would be better if we could configure “all the checks in the WORKFLOW, whatever they may be”, or perhaps best of all if we could configure a check as required in the workflow YML file, the same place we’re defining it, just a required_before_merge key you could set to true or use a matrix context to define or whatever.

I’m currently not requiring status checks for merge on most of my projects (even though i did with travis), because I was finding it unmanageable to keep the job names sync’d, especially as I get used to Github Actions and kept tweaking things in a way that would change job names. So that’s a bit annoying.

fail_fast: false

By default, if one of the matrix jobs fails, Github Acitons will cancel all remaining jobs, not bother to run them at all. After all, you know the build is going to fail if one job fails, what do you need those others for?

Well, for my use case, it is pretty annoying to be told, say, “Job for ruby 2.7.0 failed, we can’t tell you whether the other ruby versions would have passed or failed or not” — the first thing I want to know is if failed on all ruby versions or just 2.7.0, so now I’d have to spend extra time figuring that out manually? No thanks.

So I set `fail_fast: false` on all of my workflows, to disable this behavior.

Note that travis had a similar (opt-in) fast_finish feature, which worked subtly different: Travis would report failure to Github on first failure (and notify, I think), but would actually keep running all jobs. So when I saw a failure, I could click through to ‘details’ to see which (eg) ruby versions passed, from the whole matrix. This does work for me, so I’d chose to opt-in to that travis feature. Unfortunately, the Github Actions subtle difference in effect makes it not desirable to me.

Note You may see some people referencing a Github Actions continue-on-error feature. I found the docs confusing, but after experimentation what this really does is mark a job as successful even when it fails. It shows up in all GH UI as succeeeded even when it failed, the only way to know it failed would be to click through to the actual build log to see failure in the logged console. I think “continue on error” is a weird name for this; it is not useful to me with regard to fine-tuning fail-fast; or honestly in any other use case I can think of that I have.

Bundle cache?

bundle install can take 60+ seconds, and be a significant drag on your build (not to mention a lot of load on rubygems servers from all these builds). So when travis introduced a feature to cache: bundler: true, it was very popular.

True to form, Github Actions gives you a generic caching feature you can try to configure for your particular case (npm, bundler, whatever), instead of an out of the box feature “just do the right thing you for bundler, you figure it out”.

The ruby/setup-ruby third-party action has a built-in feature to cache bundler installs for you, but I found that it does not work right if you do not have a Gemfile.lock checked into the repo. (Ie, for most any gem, rather than app, project). It will end up re-using cached dependencies even if there are new releases of some of your dependencies, which is a big problem for how I use CI for a gem — I expect it to always be building with latest releases of dependencies, so I can find out of one breaks the build. This may get fixed in the action.

If you have an app (rather than gem) with a Gemfile.lock checked into repo, the bundler-cache: true feature should be just fine.

Otherwise, Github has some suggestions for using the generic cache feature for ruby bundler (search for “ruby – bundler” on this page) — but I actually don’t believe they will work right without a Gemfile.lock checked into the repo either.

Starting from that example, and using the restore-keys feature, I think it should be possible to design a use that works much like travis’s bundler cache did, and works fine without a checked-in Gemfile.lock. We’d want it to use a cache from the most recent previous (similar job), and then run bundle install anyway, and then cache the results again at the end always to be available for the next run.

But I haven’t had time to work that out, so for now my gem builds are simply not using bundler caching. (my gem builds tend to take around 60 seconds to do a bundle install, so that’s in every build now, could be worse).

update nov 27: The ruby/ruby-setup action should be fixed to properly cache-bust when you don’t have a Gemfile.lock checked in. If you are using a matrix for ruby version, as below, you must set the ruby version by setting the BUNDLE_GEMFILE env variable rather than the way we did it below, and there is is a certain way Github Action requires/provides you do that, it’s not just export. See the issue in ruby/ruby-setup project.

Notifications: Not great

Travis has really nice defaults for notifications: The person submitting the PR would get an email generally only on status changes (from pass to fail or fail to pass) rather than on every build. And travis would even figure out what email to send to based on what email you used in your git commits. (Originally perhaps a workaround to lack of Github API at travis’ origin, I found it a nice feature). And then travis has sophisticated notification customization available on a per-repo basis.

Github notifications are unfortunately much more basic and limited. The only notification settings avaialable are for your entire account at https://github.com/settings/notifications, “GitHub Actions”. So they apply to all github workflows in all projects, there are no workflow- or project-specific settings. You can set to receive notification via web push or email or both or neither. You can receive notifications for all builds or only failed builds. That’s it.

The author of a PR is the one who receives the notifications, same as in travis. You will get notifications for every single build, even repeated successes or failures in a series.

I’m not super happy with the notification options. I may end up just turning off Github Actions notifications entirely for my account.

Hypothetically, someone could probably write a custom Github action to give you notifications exactly how travis offered — after all, travis was using public GH API that should be available to any other author, and I think should be usable from within an action. But when I started to think through it, while it seemed an interesting project, I realized it was definitely beyond the “spare hobby time” I was inclined to give to it at present, especially not being much of a JS developer (the language of custom GH actions, generally). (While you can list third-party actions on the github “marketplace”, I don’t think there’s a way to charge for them). .

There are custom third-party actions available to do things like notify slack for build completion; I haven’t looked too much into any of them, beyond seeing that I didn’t see any that would be “like travis defaults”.

A more complicated gem: postgres, and Rails matrix

Let’s move to a different example workflow file, in a different gem. You can see I called this one ci.yml, matching it’s name: CI, to have less friction for a developer (including future me) trying to figure out what’s going on.

This gem does have rails as a dependency and does test against it, but isn’t actually a Rails engine as it happens. It also needs to test against Postgres, not just sqlite3.

Scheduled Builds

At one point travis introduced a feature for scheduling (eg) weekly builds even when no PR/commit had been made. I enthusiastically adopted this for my gem projects. Why?

Gem releases are meant to work on a variety of different ruby versions and different exact versions of dependencies (including Rails). Sometimes a new release of ruby or rails will break the build, and you want to know about that and fix it. With CI builds happening only on new code, you find out about this with some random new code that is unlikely to be related to the failure; and you only find out about it on the next “new code” that triggers a build after a dependency release, which on some mature and stable gems could be a long time after the actual dependency release that broke it.

So scheduled builds for gems! (I have no purpose for scheduled test runs on apps).

Github Actions does have this feature. Hooray. One problem is that you will receive no notification of the result of the scheduled build, success or failure. :( I suppose you could include a third-party action to notify a fixed email address or Slack or something else; not sure how you’d configure that to apply only to the scheduled builds and not the commit/PR-triggered builds if that’s what you wanted. (Or make an custom action to file a GH issue on failure??? But make sure it doesn’t spam you with issues on repeated failures). I haven’t had the time to investigate this yet.

Also oops just noticed this: “In a public repository, scheduled workflows are automatically disabled when no repository activity has occurred in 60 days.” Which poses some challenges for relying on scheduled builds to make sure a stable slow-moving gem isn’t broken by dependency updates. I definitely am committer on gems that are still in wide use and can go 6-12+ months without a commit, because they are mature/done.

I still have it configured in my workflow; I guess even without notifications it will effect the “badge” on the README, and… maybe i’ll notice? Very far from ideal, work in progress. :(

Rails Matrix

OK, this one needs to test against various ruby versions AND various Rails versions. A while ago I realized that an actual matrix of every ruby combined with every rails was far too many builds. Fortunately, Github Actions supports the same kind of matrix/include syntax as travis, which I use.

     matrix:
        include:
          - gemfile: rails_5_0
            ruby: 2.4

          - gemfile: rails_6_0
            ruby: 2.7

I use the appraisal gem to handle setting up testing under multiple rails versions, which I highly recommend. You could use it for testing variant versions of any dependencies, I use it mostly for varying Rails. Appraisal results in a separate Gemfile committed to your repo for each (in my case) rails version, eg ./gemfiles/rails_5_0.gemfile. So those values I use for my gemfile matrix key are actually portions of the Gemfile path I’m going to want to use for each job.

Then we just need to tell bundler, in a given matrix job, to use the gemfile we specified in the matrix. The old-school way to do this is with the BUNDLE_GEMFILE environmental variable, but I found it error-prone to make sure it stayed consistently set in each workflow step. I found that the newer (although not that new!) bundle config set gemfile worked swimmingly! I just set it before the bundle install, it stays set for the rest of the run including the actual test run.

steps:
    # [...]
    - name: Bundle install
      run: |
        bundle config set gemfile "${GITHUB_WORKSPACE}/gemfiles/${{ matrix.gemfile }}.gemfile"
        bundle install --jobs 4 --retry 3

Note that single braces are used for ordinary bash syntax to reference the ENV variable ${GITHUB_WORKSPACE}, but double braces for the github actions context value interpolation ${{ matrix.gemfile }}.

Works great! Oh, note how we set the name of the job to include both ruby and rails matrix values, important for it showing up legibly in Github UI: name: ${{ matrix.gemfile }}, ruby ${{ matrix.ruby }}. Because of how we constructed our gemfile matrix, that shows up with job names rails_5_0, ruby 2.7.

Still not using bundler caching in this workflow. As before, we’re concerned about the ruby/setup-ruby built-in bundler-cache feature not working as desired without a Gemfile.lock in the repo. This time, I’m also not sure how to get that feature to play nicely with the variant gemfiles and bundle config set gemfile. Github Actions makes you put together a lot more pieces together yourself compared to travis, there are still things I just postponed figuring out for now.

update jan 11: the ruby/setup-ruby action now includes a ruby version matrix example in it’s README. https://github.com/ruby/setup-ruby#matrix-of-gemfiles It does require you use the BUNDLE_GEMFILE env variable, rather than the bundle config set gemfile command I used here. This should ordinarily be fine, but is something to watch out for in case other instructions you are following tries to use bundle config set gemfile instead, for reasons or not.

Postgres

This project needs to build against a real postgres. That is relatively easy to set up in Github Actions.

Postgres normally by default allows connections on localhost without a username/password set, and my past builds (in travis or locally) took advantage of this to not bother setting one, which then the app didn’t have to know about. But the postgres image used for Github Actions doesn’t allow this, you have to set a username/password. So the section of the workflow that sets up postgres looks like:

jobs:
   tests:
     services:
       db:
         image: postgres:9.4
         env:
           POSTGRES_USER: postgres
           POSTGRES_PASSWORD: postgres
         ports: ['5432:5432']

5432 is the default postgres port, we need to set it and map it so it will be available as expected. Note you also can specify whatever version of postgres you want, this one is intentionally testing on one a bit old.

OK now our Rails app that will be executed under rspec needs to know that username and password to use in it’s postgres connection; when before it connected without a username/password. That env under the postgres service image is not actually available to the job steps. I didn’t find any way to DRY the username/password in one place, I had to repeat it in another env block, which I put at the top level of the workflow so it would apply to all steps.

And then I had to alter my database.yml to use those ENV variables, in the test environment. On a local dev machine, if your postgres doens’t have a username/password requirement and you don’t set the ENV variables, it keeps working as before.

I also needed to add host: localhost to the database.yml; before, the absence of the host key meant it used a unix-domain socket (filesystem-located) to connect to postgres, but that won’t work in the Github Actions containerized environment.

Note, there are things you might see in other examples that I don’t believe you need:

  • No need for an apt-get of pg dev libraries. I think everything you need is on the default GH Actions images now.
  • Some examples I’ve seen do a thing with options: --health-cmd pg_isready, my builds seem to be working just fine without it, and less code is less code to maintain.

allow_failures

In travis, I took advantage of the travis allow_failures key in most of my gems.

Why? I am testing against various ruby and Rails versions; I want to test against *future* (pre-release, edge) ruby and rails versions, cause its useful to know if I’m already with no effort passing on them, and I’d like to keep passing on them — but I don’t want to mandate it, or prevent PR merges if the build fails on a pre-release dependency. (After all, it could very well be a bug in the dependency too!)

There is no great equivalent to allow_failures in Github Actions. (Note again, continue_on_error just makes failed jobs look identical to successful jobs, and isn’t very helpful here).

I investigated some alternatives, which I may go into more detail on in a future post, but on one project I am trying a separate workflow just for “future ruby/rails allowed failures” which only checks master commits (not PRs), and has a separate badge on README (which is actually pretty nice for advertising to potential users “Yeah, we ALREADY work on rails edge/6.1.rc1!”). Main downside there is having to copy/paste synchronize what’s really the same workflow in two files.

A Rails app

I have many more number of projects I’m a committer on that are gems, but I spend more of my time on apps, one app in specific.

So here’s an example Github Actions CI workflow for a Rails app.

It mostly remixes the features we’ve already seen. It doesn’t need any matrix. It does need a postgres.

It does need some “OS-level” dependencies — the app does some shell-out to media utilities like vips and ffmpeg, and there are integration tests that utilize this. Easy enough to just install those with apt-get, works swimmingly.

        - name: Install apt dependencies
          run: |
            sudo apt-get -y install libvips-tools ffmpeg mediainfo

Update 25 Nov: My apt-get that worked for a couple weeks started failing for some reason on trying to install a libpulse0 dependency of one of those packages, the solution was doing a sudo apt-get update before the sudo apt-get install. I guess this is always good practice? (That forum post also uses apt install and apt update instead of apt-get install and apt-get update, that I can’t tell you much about, I’m really not a linux admin).

In addition to the bundle install, a modern Rails app using webpacker needs yarn install. This just worked for me — no need to include lines for installing npm itself or yarn or any yarn dependencies, although some examples I find online have them. (My yarn installs seem to happen in ~20 seconds, so I’m not motivated to try to figure out caching for yarn).

And we need to create the test database in the postgres, which I do with RAILS_ENV=test bundle exec rails db:create — typical Rails test setup will then automatically run migrations if needed. There might be other (better?) ways to prep the database, but I was having trouble getting rake db:prepare to work, and didn’t spend the time to debug it, just went with something that worked.

    - name: Set up app
       run: |
         RAILS_ENV=test bundle exec rails db:create
         yarn install

Rails test setup usually ends up running migrations automatically is why I think this worked alone, but you could also throw in a RAILS_ENV=test bundle exec rake db:schema:load if you wanted.

Under travis I had to install chrome with addons: chrome: stable to have it available to use with capybara via the webdrivers gem. No need for installing chrome in Github Actions, some (recent-ish?) version of it is already there as part of the standard Github Actions build image.

In this workflow, you can also see a custom use of the github “cache” action to cache a Solr install that the test setup automatically downloads and sets up. In this case the cache doesn’t actually save us any build time, but is kinder on the apache foundation servers we are downloading from with every build otherwise (and have gotten throttled from in the past).

Conclusion

Github Aciton sis a really impressively powerful product. And it’s totally going to work to replace travis for me.

It’s also probably going to take more of my time to maintain. The trade-off of more power/flexibility and focusing on almost limitless use cases is more things th eindividual project has to get right for their use case. For instance figuring out the right configuration to get caching for bundler or yarn right, instead of just writing cache: { yarn: true, bundler: true}. And when you have to figure it out yourself, you can get it wrong, which when you are working on many projects at once means you have a bunch of places to fix.

The amazingness of third-party action “marketplace” means you have to figure out the right action to use (the third-party ruby/setup-ruby instead of the vendor’s actions/setup-ruby), and again if you change your mind about that you have a bunch of projects to update.

Anyway, it is what it is — and I’m grateful to have such a powerful and in fact relatively easy to use service available for free! I could not really live without CI anymore, and won’t have to!

Oh, and Github Actions is giving me way more (free) simultaneous parallel workers than travis ever did, for my many-job builds!

Unexpected performance characteristics when exploring migrating a Rails app to Heroku

I work at a small non-profit research institute. I work on a Rails app that is a “digital collections” or “digital asset management” app. Basically it manages and provides access (public as well as internal) to lots of files and description about those files, mostly images.

It’s currently deployed on some self-managed Amazon EC2 instances (one for web, one for bg workers, one in which postgres is installed, etc). It gets pretty low-traffic in general web/ecommerce/Rails terms. The app is definitely not very optimized — we know it’s kind of a RAM hog, we know it has many actions whose response time is undesirable. But it works “good enough” on it’s current infrastructure for current use, such that optimizing it hasn’t been the highest priority.

We are considering moving it from self-managed EC2 to heroku, largely because we don’t really have the capacity to manage the infrastructure we currently have, especially after some recent layoffs.

Our Rails app is currently served by passenger on an EC2 t2.medium (4G of RAM).

I expected the performance characteristics moving to heroku “standard” dynos would be about the same as they are on our current infrastructure. But was surprised to see some degradation:

  • Responses seem much slower to come back when deployed, mainly for our slowest actions. Quick actions are just as quick on heroku, but slower ones (or perhaps actions that involve more memory allocations?) are much slower on heroku.
  • The application instances seem to take more RAM running on heroku dynos than they do on our EC2 (this one in particular mystifies me).

I am curious if anyone with more heroku experience has any insight into what’s going on here. I know how to do profiling and performance optimization (I’m more comfortable with profiling CPU time with ruby-prof than I am with trying to profile memory allocations with say derailed_benchmarks). But it’s difficult work, and I wasn’t expecting to have to do more of it as part of a migration to heroku, when performance characteristics were acceptable on our current infrastructure.

Response Times (CPU)

Again, yep, know these are fairly slow response times. But they are “good enough” on current infrastruture (EC2 t2.medium), wasn’t expecting them to get worse on heroku (standard-1x dyno, backed by heroku pg standard-0 ).

Fast pages are about the same, but slow pages (that create a lot of objects in memory?) are a lot slower.

This is not load testing, I am not testing under high traffic or for current requests. This is just accessing demo versions of the app manually one page a time, to see response times when the app is only handling one response at a time. So it’s not about how many web workers are running or fit into RAM or anything; one is sufficient.

ActionExisting EC2 t2.mediumHeroku standard-1x dyno
Slow reporting page that does a few very expensive SQL queries, but they do not return a lot of objects. Rails logging reports: Allocations: 8704~3800ms~3200ms (faster pg?)
Fast page with a few AR/SQL queries returning just a few objects each, a few partials, etc. Rails logging reports: Allocations: 820581-120ms~120ms
A fairly small “item” page, Rails logging reports: Allocations: 40210~200ms~300ms
A medium size item page, loads a lot more AR models, has a larger byte size page response. Allocations: 361292~430ms600-700ms
One of our largest pages, fetches a lot of AR instances, does a lot of allocations, returns a very large page response. Allocations: 19837333000-4000ms5000-7000ms

Fast-ish responses (and from this limited sample, actually responses with few allocations even if slow waiting on IO?) are about the same. But our slowest/highest allocating actions are ~50% slower on heroku? Again, I know these allocations and response times are not great even on our existing infrastructure; but why do they get so much worse on heroku? (No, there were no heroku memory errors or swapping happening).

RAM use of an app instance

We currently deploy with passenger (free), running 10 workers on our 4GB t2.medium.

To compare apples to apples, deployed using passenger on a heroku standard-1x. Just one worker instance (because that’s actually all I can fit on a standard-1x!), to compare size of a single worker from one infrastructure to the other.

On our legacy infrastructure, on a server that’s been up for 8 days of production traffic, passenger-status looks something like this:

  Requests in queue: 0
  * PID: 18187   Sessions: 0       Processed: 1074398   Uptime: 8d 23h 32m 12s
    CPU: 7%      Memory  : 340M    Last used: 1s
  * PID: 18206   Sessions: 0       Processed: 78200   Uptime: 8d 23h 32m 12s
    CPU: 0%      Memory  : 281M    Last used: 22s
  * PID: 18225   Sessions: 0       Processed: 2951    Uptime: 8d 23h 32m 12s
    CPU: 0%      Memory  : 197M    Last used: 8m 8
  * PID: 18244   Sessions: 0       Processed: 258     Uptime: 8d 23h 32m 11s
    CPU: 0%      Memory  : 161M    Last used: 1h 2
  * PID: 18261   Sessions: 0       Processed: 127     Uptime: 8d 23h 32m 11s
    CPU: 0%      Memory  : 158M    Last used: 1h 2
  * PID: 18278   Sessions: 0       Processed: 105     Uptime: 8d 23h 32m 11s
    CPU: 0%      Memory  : 169M    Last used: 3h 2
  * PID: 18295   Sessions: 0       Processed: 96      Uptime: 8d 23h 32m 11s
    CPU: 0%      Memory  : 163M    Last used: 3h 2
  * PID: 18312   Sessions: 0       Processed: 91      Uptime: 8d 23h 32m 11s
    CPU: 0%      Memory  : 169M    Last used: 13h
  * PID: 18329   Sessions: 0       Processed: 92      Uptime: 8d 23h 32m 11s
    CPU: 0%      Memory  : 163M    Last used: 13h
  * PID: 18346   Sessions: 0       Processed: 80      Uptime: 8d 23h 32m 11s
    CPU: 0%      Memory  : 162M    Last used: 13h

We can see, yeah, this app is low traffic, most of those workers don’t see a lot of use. The first worker, which has handled by far the most traffic has a Private RSS of 340M. (Other workers having handled fewer requests much slimmer). Kind of overweight, not sure where all that RAM is going, but it is what it is. I could maybe hope to barely fit 3 workers on a heroku standard-2 (1024M) instance, if these sizes were the same on Heroku.

This is after a week of production use — if I restart passenger on a staging server, and manually access some of my largest, hungriest, most-allocating pages a few times, I can only see Private RSS use of like 270MB.

However, on the heroku standard-1x, with one passenger worker, using the heroku log-runtime-metrics feature to look at memory… private RSS is I believe what should correspond to passenger’s report, and what heroku uses for memory capacity limiting…

Immediately after restarting my app, it’s at sample#memory_total=184.57MB sample#memory_rss=126.04MB. After manually accessing a few of my “hungriest” actions, I see: sample#memory_total=511.36MB sample#memory_rss=453.24MB . Just a few manual requests not a week of production traffic, and 33% more RAM than on my legacy EC2 infrastructure after a week of production traffic. Actually approaching the limits of what can fit in a standard-1x (512MB) dyno as just one worker.

Now, is heroku’s memory measurements being done differently than passenger-status does them? Possibly. It would be nice to compare apples to apples, and passenger hypothetically has a service that would let you access passenger-status results from heroku… but unfortunately I have been unable to get it to work. (Ideas welcome).

Other variations tried on heroku

Trying the heroku gaffneyc/jemalloc build-pack with heroku config:set JEMALLOC_ENABLED=true (still with passenger, one worker instance) doesn’t seem to have made any significant differences, maybe 5% RAM savings or maybe it’s a just a fluke.

Switching to puma (puma5 with the experimental possibly memory-saving features turned on; just one worker with one thread), doesn’t make any difference in response time performance (none expected), but… maybe does reduce RAM usage somehow? After a few sample requests of some of my hungriest pages, I see sample#memory_total=428.11MB sample#memory_rss=371.88MB, still more than my baseline, but not drastically so. (with or without jemalloc buildpack seems to make no difference). Odd.

So what should I conclude?

I know this app could use a fitness regime; but it performs acceptably on current infrastructure.

We are exploring heroku because of staffing capacity issues, hoping to not to have to do so much ops. But if we trade ops for having to spend much time on challenging (not really suitable for junior dev) performance optimization…. that’s not what we were hoping for!

But perhaps I don’t know what I’m doing, and this haphapzard anecdotal comparison is not actually data and I shoudn’t conclude much from it? Let me know, ideally with advice of how to do it better?

Or… are there reasons to expect different performance chracteristics from heroku? Might it be running on underlying AWS infrastructure that has less resources than my t2.medium?

Or, starting to make guess hypotheses, maybe the fact that heroku standard tier does not run on “dedicated” compute resources means I should expect a lot more variance compared to my own t2.medium, and as a result when deploying on heroku you need to optimize more (so the worst case of variance isn’t so bad) than when running on your own EC? That’s maybe just part of what you get with heroku, unless paying for performance dynos, it is even more important to have an good performing app? (yeah, I know I could use more caching, but that of course brings it’s own complexities, I wasn’t expecting to have to add it in as part of a heroku migration).

Or… I find it odd that it seems like slower (or more allocating?) actions are the ones that are worse. Is there any reason that memory allocations would be even more expensive on a heroku standard dyno than on my own EC2 t2.medium?

And why would the app workers seem to use so much more RAM on heroku than on my own EC2 anyway?

Any feedback or ideas welcome!

faster_s3_url: Optimized S3 url generation in ruby

Subsequent to my previous investigation about S3 URL generation performance, I ended up writing a gem with optimized implementations of S3 URL generation.

github: faster_s3_url

It has no dependencies (not even aws-sdk). It can speed up both public and presigned URL generation by around an order of magnitude. In benchmarks on my 2015 MacBook compared to aws-sdk-s3: public URLs from 180 in 10ms to 2200 in 10ms; presigned URLs from 10 in 10ms to 300 in 10ms (!!).

While if you are only generating a couple S3 URLs at a time you probably wouldn’t notice aws-sdk-ruby’s poor performance, if you are generating even just hundreds at a time, and especially for presigned URLs, it can really make a difference.

faster_s3_url supports the full API for aws-sdk-s3 presigned URLs , including custom params like response_content_disposition. It’s tests actually test that results match what aws-sdk-s3 would generate.

For shrine users, faster_s3_url includes a Shrine storage sub-class that can be drop-in replacement of Shrine::Storage::S3 to just have all your S3 URL generations via shrine be using the optimized implementation.

Key in giving me the confidence to think I could pull off an independent S3 presigned URL implementation was WeTransfer’s wt_s3_signer gem be succesful. wt_s3_signer makes some assumptions/restrictions to get even higher performance than faster_s3_url (two or three times as fast) — but the restrictions/assumptions and API to get that performance weren’t suitable for use cases, so I implemented my own.

Delete all S3 key versions with ruby AWS SDK v3

If your S3 bucket is versioned, then deleting an object from s3 will leave a previous version there, as a sort of undo history. You may have a “noncurrent expiration lifecycle policy” set which will delete the old versions after so many days, but within that window, they are there.

What if you were deleting something that accidentally included some kind of sensitive or confidential information, and you really want it gone?

To make matters worse, if your bucket is public, the version is public too, and can be requested by an unauthenticated user that has the URL including a versionID, with a URL that looks something like: https://mybucket.s3.amazonaws.com/path/to/someting.pdf?versionId=ZyxTgv_pQAtUS8QGBIlTY4eKmANAYwHT To be fair, it would be pretty hard to “guess” this versionID! But if it’s really sensitive info, that might not be good enough.

It was a bit tricky for me to figure out how to do this with the latest version of ruby SDK (as I write, “v3“, googling sometimes gave old versions).

It turns out you need to first retrieve a list of all versions with bucket.object_versions . With no arg, that will return ALL the versions in the bucket, which could be a lot to retrieve, not what you want when focused on just deleting certain things.

If you wanted to delete all versions in a certain “directory”, that’s actually easiest of all:

s3_bucket.object_versions(prefix: "path/to/").batch_delete!

But what if you want to delete all versions from a specific key? As far as I can tell, this is trickier than it should be.

# danger! This may delete more than you wanted
s3_bucket.object_versions(prefix: "path/to/something.doc").batch_delete!

Because of how S3 “paths” (which are really just prefixes) work, that will ALSO delete all versions for path/to/something.doc2 or path/to/something.docdocdoc etc, for anything else with that as a prefix. There probably aren’t keys like that in your bucket, but that seems dangerously sloppy to assume, that’s how we get super weird bugs later.

I guess there’d be no better way than this?

key = "path/to/something.doc"
s3_bucket.object_versions(prefix: key).each do |object_version|
  object_version.delete if object_version.object_key == key
end

Is there anyone reading this who knows more about this than me, and can say if there’s a better way, or confirm if there isn’t?

Github Actions tutorial for ruby CI on Drifting Ruby

I’ve been using travis for free automated testing (“continuous integration”, CI) on my open source projects for a long time. It works pretty well. But it’s got some little annoyances here and there, including with github integration, that I don’t really expect to get fixed after its acquisition by private equity. They also seem to have cut off actual support channels (other than ‘forums’) for free use; I used to get really good (if not rapid) support when having troubles, now I kinda feel on my own.

So after hearing about the pretty flexible and powerful newish Github Actions feature, I was interested in considering that as an alternative. It looks like it should be free for public/open source projects on github. And will presumably have good integration with the rest of github and few kinks. Yeah, this is an example of how a platform getting an advantage starting out by having good third-party integration can gradually come to absorb all of that functionality itself; but I just need something that works (and, well, is free for open source), I don’t want to spend a lot of time on CI, I just want it to work and get out of the way. (And Github clearly introduced this feature to try to avoid being overtaken by Gitlab, which had integrated flexible CI/CD).

So anyway. I was interested in it, but having a lot of trouble figuring out how to set it up. Github Actions is a very flexible tool, a whole platform really, which you can use to set up almost any kind of automated task you want, in many different languages. Which made it hard for me to figure out “Okay, I just want tests to run on all PR commits, and report back to the PR if it’s mergeable”.

And it was really hard to figure this out from the docs, it’s such a flexible abstract tool. And I have found it hard to find third party write-ups and tutorials and blogs and such — in part because Github Actions was in beta development for so long, that some of the write-ups I did find were out of date.

Fortunately Drifting Ruby has provided a great tutorial, which gets you started with a basic ruby CI testing. It looks pretty straightforward to for instance figure out how to swap in rspec for rake test. And I always find it easier to google for solutions to additional fancy things I want to do, finding results either in official docs or third-party blogs, when I have the basic skeleton in place.

I hope to find time to experiment with Github Actions in the future. I am writing this blog post in part to log for myself the Drifting Ruby episode so I don’t lose it! The show summary has this super useful template:

.github/workflows/main.yml
name: CI
on:
push:
branches: [ master, develop ]
pull_request:
branches: [ master, develop ]
jobs:
test:
# services:
# db:
# image: postgres:11
# ports: ['5432:5432']
runs-on: ubuntu-latest
steps:
uses: actions/checkout@v2
name: Setup Ruby
uses: ruby/setup-ruby@v1.45.0
with:
ruby-version: 2.7.1
uses: Borales/actions-yarn@v2.3.0
with:
cmd: install
name: Install Dependencies
run: |
# sudo apt install -yqq libpq-dev
gem install bundler
name: Install Gems
run: |
bundle install
name: Prepare Database
run: |
bundle exec rails db:prepare
name: Run Tests
# env:
# DATABASE_URL: postgres://postgres:@localhost:5432/databasename
# RAILS_MASTER_KEY: ${{secrets.RAILS_MASTER_KEY}}
run: |
bundle exec rails test
name: Create Coverage Artifact
uses: actions/upload-artifact@v2
with:
name: code-coverage
path: coverage/
security:
runs-on: ubuntu-latest
steps:
uses: actions/checkout@v2
name: Setup Ruby
uses: ruby/setup-ruby@v1.45.0
with:
ruby-version: 2.7.1
name: Install Brakeman
run: |
gem install brakeman
name: Run Brakeman
run: |
brakeman -f json > tmp/brakeman.json || exit 0
name: Brakeman Report
uses: devmasx/brakeman-linter-action@v1.0.0
env:
REPORT_PATH: tmp/brakeman.json
GITHUB_TOKEN: ${{secrets.GITHUB_TOKEN}}

More benchmarking optimized S3 presigned_url generation

In a recent post, I explored profiling and optimizing S3 presigned_url generation in ruby to be much faster. In that post, I got down to using a Aws::Sigv4::Signer instance from the AWS SDK, but wondered if there was a bunch more optimization to be done within that black box.

Julik posted a comment on that post, letting us know that they at WeTransfer have already spent some time investigating and creating an optimized S3 URL signer, at https://github.com/WeTransfer/wt_s3_signer . Nice! It is designed for somewhat different use cases than mine, but is still useful — and can be benchmarked against the other techniques.

Some things to note:

  • wt_s3_signer does not presently do URI escaping; it may or may not be noticeably slower if it did; it will not generate working URLs if your S3 keys include any characters that need to be escaped in the URI
  • wt_s3_signer gets ultimate optimizations by having you re-use a signer object, that has a fixed/common datestamp and expiration and other times. That doesn’t necessarily fit into the APIs I want to fit it into — but can we still get performance advantage by re-creating the object each time with those things? (Answer, yes, although not quite as much. )
  • wt_s3_signer does not let you supply additional query parameters, such as response_content_disposition or response_content_type. I actually need this feature; and need it to be different for each presigned_url created even in a batch.
  • wt_s3_signer’s convenience for_s3_bucket constructor does do at least one network request to S3… to look up the appropriate AWS region I guess? That makes it far too expensive to re-use for_s3_bucket convenience constructor once-per-url, but I don’t want to do this anyway, I’d rather just pass in the known region, as well as the known bucket base URL, etc. Fortunately the code is already factored well to give us a many-argument plain constructor where we can just pass that all in, with no network lookups triggered.
  • Insists on looking up AWS credentials from the standard locations, there’s no way to actually pass in an access_key_id and secret_access_key explicitly, which is a problem for some use cases where an app needs to use more than one set of credentials.

Benchmarks

So the benchmarks! This time I switched to benchmark-ips, cause, well, it’s just better. I am benchmarking 1200 URL generations again.

I am benchmarking re-using a single WT::S3Signer object for all 1200 URLs, as the gem intends. But also compared to instantiating a WT::S3Signer for each URL generation — using WT::S3Signer.new, not WT::S3Signer.for_s3_bucket — the for_s3_bucket version cannot be used instantiated once per URL generation without crazy bad performance (I did try, although it’s not included in these benchmarks).

I include all the presigned_url techniques I demo’d in the last post, but for clarity took any public url techniques out.

Calculating -------------------------------------
   sdk presigned_url      1.291  (± 0.0%) i/s -      7.000  in   5.459268s
use inline instantiated Aws::Sigv4::Signer directly for presigned url (with escaping)
                          4.950  (± 0.0%) i/s -     25.000  in   5.082505s
Re-use Aws::Sigv4::Signer for presigned url (with escaping)
                          5.458  (±18.3%) i/s -     27.000  in   5.037205s
Re-use Aws::Sigv4::Signer for presigned url (without escaping)
                          5.751  (±17.4%) i/s -     29.000  in   5.087846s
wt_s3_signer re-used     45.925  (±15.2%) i/s -    228.000  in   5.068666s
wt_s3_signer instantiated each time
                         15.924  (±18.8%) i/s -     75.000  in   5.016276s

Comparison:
wt_s3_signer re-used:       45.9 i/s
wt_s3_signer instantiated each time:       15.9 i/s - 2.88x  (± 0.00) slower
Re-use Aws::Sigv4::Signer for presigned url (without escaping):        5.8 i/s - 7.99x  (± 0.00) slower
Re-use Aws::Sigv4::Signer for presigned url (with escaping):        5.5 i/s - 8.41x  (± 0.00) slower
use inline instantiated Aws::Sigv4::Signer directly for presigned url (with escaping):        5.0 i/s - 9.28x  (± 0.00) slower
   sdk presigned_url:        1.3 i/s - 35.58x  (± 0.00) slower

Wow! Re-using a single WT::S3Signer, as the intend, is a LOT LOT faster than anything else — 35x faster than the built-in AWS SDK presigned_url method!

But even instantiating a new WT::S3Signer for each URL — while significantly slower than re-use — is still significantly faster than any of the methods using an AWS SDK Aws::Sigv4::Signer directly, and still a lot lot faster than the AWS SDK presigned_url method.

So this has promise, even if you re-use the thing, it’s better than any other option. I may try to PR and/or fork to get some of the features I’d need in there… although the license is problematic for many projects I work on. With the benchmarking showing the value of this approach, I could also just try to reimplement from scratch based on the Amazon instructions/example code that wt_s3_signer itself used, and/or the ruby AWS SDK implementation.

Delivery patterns for non-public resources hosted on S3

I work at the Science History Institute on our Digital Collections app (written in Rails), which is kind of a “digital asset management” app combined with a public catalog of our collection.

We store many high-resolution TIFF images that can be 100MB+ each, as well as, currently, a handful of PDFs and audio files. We have around 31,000 digital assets, which make up about 1.8TB. In addition to originals, we have “derivatives” for each file (JPG conversions of a TIFF original at various sizes; MP3 conversions of FLAC originals; etc) — around 295,000 derivatives (~10 per original) taking up around 205GB. Not a huge amount of data compared to some, but big enough to be something to deal with, and we expect it could grow by an order of magnitude in the next couple years.

We store them all — originals and derivatives — in S3, which generally works great.

We currently store them all in public S3 buckets, and when we need an image thumb url for an <img src>, we embed a public S3 URL (as opposed to pre-signed URLs) right in our HTML source. Having the user-agent get the resources directly from S3 is awesome, because our app doesn’t have to worry about handling that portion of the “traffic”, something S3 is quite good at (and there are CDN options which work seamlessly with S3 etc; although our traffic is currently fairly low and we aren’t using a CDN).

But this approach stops working if some of your assets can not be public, and need to be access-controlled with some kind of authorization. And we are about to start hosting a class of assets that are such.

Another notable part of our app is that in it’s current design it can have a LOT of img src thumbs on a page. Maybe 600 small thumbs (one or each scanned page of a book), each of which might use an img srcset to deliver multiple resolutions. We use Javascript lazy load code so the browser doesn’t actually try to load all these img src unless they are put in viewport, but it’s still a lot of URLs generated on the page, and potentially a lot of image loads. While this might be excessive and a design in ned of improvement, a 10×10 grid of postage-stamp-sized thumbs on a page (each of which could use a srcset) does not seem unreasonable, right? There can be a lot of URLs on a page in an “asset management” type app, it’s how it is.

As I looked around for advice on this or analysis of the various options, I didn’t find much. So, in my usual verbose style, I’m going to outline my research and analysis of the various options here. None of the options are as magically painless as using public bucket public URL on S3, alas.

All public-read ACLs, Public URLs

What we’re doing now. The S3 bucket is set as public, all files have S3 public-read ACL set, and we use S3 “public” URLs as <img src> in our app. Which might look like https://my-bucket.s3.us-west-2.amazonaws.com/path/to/thumb.jpg .

For actual downloads, we might still use an S3 presigned url , not for access control (the object is already public), but to specify a content-disposition response header for S3 to use on the fly.

Pro

  • URLs are persistent and stable and can be bookmarked, or indexed by search engines. (We really want our images in Google Images for visibility) And since the URLs are permanent and good indefinitely, they aren’t a problem for HTML including these urls in source to be cached indefinitely. (as long as you don’t move your stuff around in your S3 buckets).
  • S3 public URLs are much cheaper to generate than the cryptographically presigned URLs, so it’s less of a problem generating 1200+ of them in a page response. (And can be optimized an order of magnitude further beyond the ruby SDK implementation).
  • S3 can scale to handle a lot of traffic, and Cloudfront or another CDN can easily be used to scale further. Putting a CDN on top of a public bucket is trivial. Our Rails app is entirely uninvolved in delivering the actual images, so we don’t need to use precious Rails workers on delivering images.

Con

  • Some of our materials are still being worked on by staff, and haven’t actually been published yet. But they are still in S3 with a public-read ACL. They have hard to guess URLs that shouldn’t be referred to on any publically viewable web page — but we know that shouldn’t be relied upon for anything truly confidential.
    • That has been an acceptable design so far, as none of these materials are truly confidential, even if not yet published to our site. But this is about to stop being acceptable as we include more truly confidential materials.

All protected ACL, REDIRECT to presigned URL

This is the approach taken by Rails’ ActiveStorage does in standard setup/easy path. It assumes all resources will stored to S3 without public ACL; a random user can’t access via S3 without a time-limited presigned URL being supplied by the app.

ActiveStorage’s standard implementation will give you a URL to your Rails app itself when you ask for a URL for an S3-stored asset — a rails URL is what might be in your <img src> urls. That Rails URL will redirect to a unique temporary S3 presigned URL that allows access to the non-public resource.

Pro

  • This pattern allows your app to decide based on current request/logged-in-user and asset, whether to grant acccess, on a case by case basis. (Although it’s not clear to me where the hooks are in ActiveStorage for this; I don’t actually use ActiveStorage, and it’s easy enough to implement this pattern generally, with authorization logic).
  • S3 is still delivering assets directly to your users, so scaling issues are still between S3 and the requestor, and your app doesn’t have to get involved.
  • The URLs that show up in your delivered HTML pages, say as <img src> or <a href> URLs — are pointing your app, and are still persistent and indefinitely valid — so the HTML is still indefinitely cacheable by any HTTP cache. The will redirect to a unique-per-user and temporary presigned URL, but that’s not what’s in the HTML source.
    • You can even more your images around (to different buckets/locations or entirely different services) without invalidating the cache of the HTML. the URLs in your cached HTML don’t change, where they redirect to do. (This may be ActiveStorage’s motivation for this design?)

Cons

  • Might this interfere with Google Images indexing? While it’s hard (for me) to predict what might effect Google Images indexing, my own current site’s experience seems to say its actually fine. Google is willing to index an image “at” a URL that actually HTTP 302 redirects to a presigned S3 URL. Even though on every access the redirect will be to a different URL, Google doesn’t seem to think this is fishy. Seems to be fine.
  • Makes figuring out how to put a CDN in the mix more of a chore, you can’t just put it in front of your S3, as you only want to CDN/cache public URLs, but may need to use more sophisticated CDN features or setup or choices.
  • The asset responses themselves, at presigned URLs, are not cacheable by an HTTP cache, either browser caching or intermediate. (Or at least not for more than a week, the maximum expiry of presigned urls).
  • This is the big one. Let’s say you have 40 <img src> thumbnails on a page, and use this method. Every browser page load will result in an additional 40 requests to your app. This potentially requires you to scale your app much larger to handle the same amount of actual page requests, because your actual page requests are now (eg) 40x.
    • This has been reported as an actual problem by Rails ActiveStorage users. An app can suddenly handle far less traffic because it’s spending so much time doing all these redirects.
    • Therefore, ActiveStorage users/developers then tried to figure out how to get ActiveStorage to instead use the “All public-read ACLs, Public URLs delivered directly” model we listed above. It is now possible to do that with ActiveStorage (some answers in that StackOverflow), which is great, because it’s a great model when all your assets can be publicly available… but that was already easy enough to do without AS, we’re here cause that’s not my situation and I need something else!.
    • On another platform that isn’t Rails, the performance concerns might be less, but Rails can be, well, slow here. In my app, a response that does nothing but redirect to https://example.com can still take 100ms to return! I think an out-of-the-box Rails app would be a bit faster, I’m not sure what is making mine so slow. But even at 50ms, an extra (eg) 40x50ms == 2000ms of worker time for every page delivery is a price to pay.
    • In my app where many pages may actually have not 40 but 600+ thumbs on them… this is can be really bad. Even if JS lazy-loading is used, it just seems like asking for trouble.

All protected ACL, PROXY to presigned URL

Okay, just like above, but the app action, instead of redireting to S3…. actually reads the bytes from s3 on the back-end, and delivers them to to the user-agent directly, as a sort of proxy.

The pros/cons are pretty similar to redirect solution, but mostly with a lot of extra cons….

Extra Pro

  • I guess it’s an extra pro that the fact it’s on S3 is completely invisible to the user-agent, so it can’t possibly mess up Google Images indexing or anything like that.

Extra Cons

  • If you were worried about the scaling implications of tying up extra app workers with the redirect solution, this is so much worse, as app workers are now tied up for as long as it takes to proxy all those bytes from S3 (hopefully the nginx or passenger you have in front of your web app means you aren’t worried about slow clients, but that byte shuffling from S3 will still add up).
  • For very large assets, such as I have, this is likely incompatible with a heroku deploy, because of heroku’s 30s request timeout.

One reason I mention this option, is I believe it is basically what a hyrax app (some shared code used in our business domain) does. Hyrax isn’t necessarily using S3, but I believe does have the Rails app involved in proxying and delivering bytes for all files (including derivatives), including for <img src>. So that approach is working for them well enough, so maybe shouldn’t be totally dismissed. But it doesn’t seem right to me — I really liked the much better scaling curve of our app when we moved it away from sufia (a hyrax precedessor), and got it to stop proxying bytes like this. Plus I think this is probably a barrier to deploying hyrax apps to heroku, and we are interested in investigating heroku with our app.

All protected ACL, have nginx proxy to presigned URL?

OK, like the above “proxy” solution, but with a twist. A Rails app is not the right technology for proxying lots of bytes.

But nginx is, that’s honestly it’s core use case, it’s literally built for a proxy use case, right? It should be able to handle lots of em concurrently with reasonable CPU/memory resources. If we can get nginx doing the proxying, we don’t need to worry about tying up Rails workers doing it.

I got really excited about this for a second… but it’s kind of a confusing mess. What URLs are we actually delivering in <img src> in HTML source? If they are Rails app URLs, that will then trigger an nginx proxy using something like nginx x-accel but for to a remote (presigned S3) URL instead of a local file, we have all the same downsides as the REDIRECT option above, without any real additional benefit (unless you REALLY want to hide that it’s from S3).

If instead we want to embed URLs in the HTML source that will end up being handled directly by nginx without touching the Rails app… it’s just really confusing to figure out how to set nginx up to proxy non-public content from S3. nginx has to be creating signed requests… but we also want to access-control it somehow, it should only be creating these when the app has given it permission on a per-request basis… there are a variety of of nginx third party modules that look like maybe could be useful to put this together, some more maintained/documented than others… and it just gets really confusing.

PLUS if you want to depoy to heroku (which we are considering), this nginx still couldn’t be running on heroku, cause of that 30s limit, it would have to be running on your own non-heroku host somewhere.

I think if I were a larger commercial company with a product involving lots and lots of user-submitted images that I needed to access control and wanted to store on S3…. I might do some more investigation down this path. But for my use case… I think this is just too complicated for us to maintain, if it can be made to work at all.

All Protected ACL, put presigned URLs in HTML source

Protect all your S3 assets with non-public ACLs, so they can only be accessed after your app decides the requester has privileges to see it, via a presigned URL. But instead of using a redirect or proxy, just generate presigend URLs and use them directly in <img src> for thumbs or or <a href> for downloads etc.

Pro

  • We can control access at the app level
  • No extra requests for redirects or proxies, we aren’t requiring our app to have a lot more resources to handle an additional request per image thumb loaded.
  • Simple.

Con

  • HTML source now includes limited-time-expiring URLs in <img src> etc, so can’t be cached indefinitely, even for public pages. (Although can be cached for up to a week, the maximum expiry of S3 presigned URLs, which might be good enough).
  • Presigned S3 URLs are really expensive to generate. It’s actually infeasible to include hundreds of them on a page, can take almost 1ms per URL generated. This can be optimized somewhat with custom code, but still really expensive. This is the main blocker here I think, for what otherwise might be “simplest thing that will work”.

Different S3 ACLs for different resources

OK, so the “public bucket” approach I am using now will work fine for most of my assets. It is a minority that actually need to be access controlled.

While “access them all with presigned URLs so the app is the one deciding if a given request gets access” has a certain software engineering consistency appeal — the performance and convennience advantages of public_read S3 ACL are maybe too great to give up when 90%+ of my assets work fine with it.

Really, this whole long post is probably to convince myself that this needs to be done, because it seems like such a complicated mess… but it is, I think the lesser evil.

What makes this hard is that the management interface needs to let a manager CHANGE the public-readability status of an asset. And each of my assets might have 12 derivatives, so that’s 13 files to change, which can’t be done instantaneously if you wait for S3 to confirm, which probably means a background job. And you open yourself up to making a mistake and having a resource in the wrong state.

It might make sense to have an architecture that minimizes the number of times state has to be changed. All of our assets start out in a non-published draft state, then are later published; but for most of our resources destined for publication, it’s okay if they have public_read ACL in ‘draft’ state. Maybe there’s another flag for whether to really protect/restrict access securely, that can be set on ingest/creation only for the minority of assets that need it? So only needs to be changed if am mistake were made, or decision changed?

Changing “access state” on S3 could be done by one of two methods. You could have everything in the same bucket, and actually change the S3 ACL. Or you could have two separate buckets, one for public files and one for securely protected files. Then, changing the ‘state’ requires a move (copy then delete) of the file from one bucket to another. While the copy approach seems more painful, it has a lot of advantages: you can easily see if an object has the ‘right’ permissions by just seeing what bucket it is in (while using S3’s “block public access” features on the non-public bucket), making it easier to audit manually or automatically; and you can slap a CDN on top of the “public” bucket just as simply as ever, rather than having mixed public/nonpublic content in the same bucket.

Pro

  • The majority of our files that don’t need to be secured can still benefit from the convenience and performance advantages of public_read ACL.
  • Including can still use a straightforward CDN on top of bucket bucket, and HTTP cache-forever these files too.
  • Including no major additional load put on our app for serving the majority of assets that are public

Con

  • Additional complexity for app. It has to manage putting files in two different buckets with different ACLs, and generating URLs to the two classes differently.
  • Opportunity for bugs where an asset is in the ‘wrong’ bucket/ACL. Probably need a regular automated audit of some kind — making sure you didn’t leave behind a file in ‘public’ bucket that isn’t actually pointed to by the app is a pain to audit.
  • It is expensive to switch the access state of an asset. A book with 600 pages each with 12 derivatives, is over 7K files that need to have their ACLs changed and/or copied to another bucket if the visibility status changes.
  • If we try to minimize need to change ACL state, by leaving files destined to be public with public_read even before publication and having separate state for “really secure on S3” — this is a more confusing mental model for staff asset managers, with more opportunity for human error. Should think carefully of how this is exposed in staff UI.
  • For protected things on S3, you still need to use one of the above methods of giving users access, if any users are to be given access after an auth check.

I don’t love this solution, but this post is a bunch of words to basically convince myself that it is the lesser evil nonetheless.

Speeding up S3 URL generation in ruby

It looks like the AWS SDK is very slow at generating S3 URLs, both public and presigned, and that you can generate around an order of magnitude faster in both cases. This can matter if you are generating hundreds of S3 URLs at once.

My app

The app I work is a “digital collections” or “digital asset management” app. It is about displaing lists of files, so it displays a LOT of thumbnails. The thumbnails are all stored in S3, and at present we generate URLs directly to S3 in src‘s on page.

Some of our pages can have 600 thumbnails. (Say, a digitized medieval manuscript with 600 pages). Also, we use srcset to offer the browser two resolutions for each images, so that’s 1200 URLs.

Is this excessive, should we not put 600 URLs on a page? Maybe, although it’s what our app does at present. But 100 thumbnails on a page does not seem excessive; imagine a 10×10 grid of postage-stamp-sized thumbs, why not? And they each could have multiple URLs in a srcset.

It turns out that S3 URL generation can be slow enough to be a bottleneck with 1200 generations in a page, or in some cases even 100. But it can be optimized.

On Benchmarking

It’s hard to do benchmarking in a reliable way. I just used Benchmark.bmbm here; it is notable that on different runs of my comparisons, I could see results differ by 10-20%. But this should be sufficient for relative comparisons and basic orders of magnitude. Exact numbers will of course differ on different hardware/platform anyway. (benchmark-ips might possibly be a way to get somewhat more reliable results, but I didn’t remember it until I was well into this. There may be other options?).

I ran benchmarks on my 2015 Macbook 2.9 GHz Dual-Core Intel Core i5.

I’m used to my MacBook being faster than our deployed app on an EC2 instance, but in this case running benchmarks on EC2 had very similar results. (Of course, EC2 instance CPU performance can be quite variable).

Public S3 URLs

A public S3 URL might look like https://bucket_name.s3.amazonaws.com/path/to/my/object.rb . Or it might have a custom domain name, possibly to a CDN. Pretty simple, right?

Using shrine, you might generate it like model.image_url(public_true). Which calls Aws::S3::Object#public_url . Other dependencies or your own code might call the AWS SDK method as well.

I had noticed in earlier profiling that generating S3 URLs seemed to be taking much longer than I expected, looking like a bottleneck for my app. We use shrine, but shrine doesn’t add much overhead here, it’s pretty much just calling out to the AWS SDK public_url or presigned_url methods.

It seems like generating these URLs should be very simple, right? Here’s a “naive” implementation based on a shrine UploadedFile argument. Obviously it would be easy to use a custom or CDN hostname in this implementation alternately.

def naive_public_url(shrine_file)
"https://#{["#{shrine_file.storage.bucket.name}.s3.amazonaws.com", *shrine_file.storage.prefix, shrine_file.id].join('/')}"
end
naive_public_url(model.image)
#=> "https://somebucket.s3.amazonaws.com/path/to/image.jpg&quot;
view raw naive_s3.rb hosted with ❤ by GitHub

Benchmark generating 1200 URLs with naive implementation vs a straight call of S3 AWS SDK public_url…

original AWS SDK public_url implementation 0.053043 0.000275 0.053318 ( 0.053782)
naive implementation 0.004730 0.000016 0.004746 ( 0.004760)
view raw gistfile1.txt hosted with ❤ by GitHub

53ms vs 5ms, it’s an order of magnitude slower indeed.

53ms is not peanuts when you are trying to keep a web response under 200ms, although it may not be terrible. But let’s see if we can figure out why it’s so slow anyway.

Examining with ruby-prof points to what we could see in the basic implementation in AWS SDK source code, no need to dig down the stack. The most expensive elements are the URI.parse and the URI-safe escaping. Are we missing anything from our naive implementation then?

Well, the URI.parse is just done to make sure we are operating only on the path portion of the URL. But I can’t figure out any way bucket.url would return anything but a hostname-only URL with an empty path anyway, all the examples in docs are such. Maybe it could somehow include a path, but I can’t figure out any way the URL being parsed would have a ? query component or # fragment, and without that it’s safe to just append things without a parse. (Even without that assumption, there will be faster ways than a parse, which is quite slow!) Also just calling bucket.url is a bit expensive, and can deal with some live arn: lookups we won’t be using.

URI Escaping, the pit of confusing alternatives

What about escaping? Escaping can be such a confusing topic with S3, with different libraries at different times handling it different/wrong, then it would be sane to just never use any characters in an S3 key that need any escaping, maybe put some validation on your setters to ensure this. And then you don’t need to take the performance hit of escaping.

But okay, maybe we really need/want escaping to ensure any valid S3 key is turned into a valid S3 URL. Can we do escaping more efficiently?

The original implementation splits the path on / and then runs each component through the SDK’s own Seahorse::Util.uri_escape(s). That method’s implementation uses CGI.escape, but then does two gsub‘s to alter the value somewhat, not being happy with CGI.escape. Those extra gsubs are more performance hit. I think we can use ERB::Util.url_encode instead of CGI.escape + gsubs to get the same behavior, which might get us some speed-up.

But we also seem to be escaping more than is necessary. For instance it will escape any ! in a key to %21, and it turns out this isn’t at all necessary, the URL resolve quite fine without escaping this. If we escape only what is needed, can we go even faster?

I think what we actually need is what URI.escape does — and since URI.escape doesn’t escape /, we don’t need to split on / first, saving us even more time. Annoyingly, URI.escape is marked obsolete/deprecated! But it’s stdlib implementation is relatively simple pure ruby, it would be easy enough to copy it into our codebase.

Even faster? The somewhat maintenance-neglected but still working at present escape_utils gem has a C implementation of some escaping routines. It’s hard when many implementations aren’t clear on exactly what they are escaping, but I think the escape_uri (note i on the end not l) is doing the same thing as URI.escape. Alas, there seems to be no escape_utils implementation that corresponds to CGI.escape or ERB::Util.url_encode.

So now we have a bunch of possibilities, depending on if we are willing to change escaping semantics and/or use our naive implementation of hostname-supplying.

Original AWS SDK public_url100%
optimized AWS SDK public_urlAvoid the URI.parse, use ERB::Util.url_encode. Should be functionally identical, same output, I think!60%
naive implementationNo escaping of S3 key for URL at all7.5%
naive + ERB::Util.url_encodeshould be functionally identical escaping to original implementation, ie over-escaping28%
naive + URI.escapewe think is sufficient escaping, can be done much faster15%
naive + EscapeUtils.escape_uriwe think is identical to URI.escape but faster C implementation11%

We have a bunch of opportunities for much faster implementations, even with existing over-escaping implementation. Here’s the file I used to benchmark.

Presigned S3 URLs

A Presigned URL is used to give access to non-public content, and/or to specify response headeres you’d like S3 to include with response, such as Content-Disposition. Presigned S3 URLs all have an expiration (max one week), and involve a cryptographic signature.

I expect most people are using the AWS SDK for these, rather than reinvent an implementation of the cryptographic signing protocol.

And we’d certainly expect these to be slower than public URLs, because of the crypto signature involved. But can they do be optimized? It looks like yes, at least about an order of magnitude again.

Benchmarking with AWS SDK presigned_url, 1200 URL generations can take around 760-900ms. Wow, that’s a lot — this is definitely enough to matter, especially in a web app response you’d like to keep under 200ms, and this is likely to be a bottleneck.

We do expect the signing to take longer than a public url, but can we do better?

Look at what the SDK is doing, re-implement a quicker path

The presigned_url method just instantiates and calls out to an Aws::S3::Presigner. First idea, what if we create a single Aws::S3::Presigner, and re-use it 1200 times, instead of instantiating it 1200 times, passing it the same args #presigned_url would? Tried that, it was only minor performance improvement.

OK, let’s look at the Aws:S3::Presigner implementation. It’s got kind of a convoluted way of getting a URL, building a Seahorse::Client::Request, and then doing something weird with it…. maybe modifying it to not actually go to the network, but just act as if it had… returning headers and a signed URL, and then we throw out the headers and just use the signed URL…. phew! Ultimately though it does the actual signing work with another object, an Aws::Sigv4:Signer.

What if we just instantiate one of these ourselves, instantiate it the same arguments the Presigner would have for our use cases, and then call presign_url on it with the same args the Presigner would have. Let’s re-use a Signer object 1200 times instead of instantiating it each time, in case that matters.

We still need to create the public_url in order to sign it. Let’s use our replacement naive implementation with URI.escape escaping.

AWS_SIG4_SIGNER = Aws::Sigv4::Signer.new(
service: 's3',
region: AWS_CLIENT.config.region,
credentials_provider: SOME_AWS_CLIENT.config.credentials,
unsigned_headers: Aws::S3::Presigner::BLACKLISTED_HEADERS,
uri_escape_path: false
)
def naive_with_uri_escape_escaping(shrine_file)
# because URI.escape does NOT escape `/`, we don't need to split it,
# which is what actually saves us the time.
path = URI.escape(shrine_file.id)
"https://#{["#{shrine_file.storage.bucket.name}.s3.amazonaws.com", *shrine_file.storage.prefix, shrine_file.id].join('/')}"
end
# not yet handling custom query params eg for content-disposition
def direct_aws_sig4_signer(url)
AWS_SIG4_SIGNER.presign_url(
http_method: "GET",
url: url,
headers: {},
body_digest: 'UNSIGNED-PAYLOAD',
expires_in: 900, # seconds
time: nil
).to_s
end
direct_aws_sig4_signer( naive_with_uri_escape_escaping( shrine_uploaded_file ) )
# => presigned S3 url

Yes, it’s much faster!

Bingo! Now I measure 1200 URLs in 170-220ms, around 25% of the time. Still too slow to want to do 1200 of them on a single page, and around 4x slower than SDK public_url.

Interestingly, while we expect the cryptographic signature to take some extra time… that seems to be at most 10% of the overhead that the logic to sign a URL was adding? We experimented with re-using an Aws::Sigv4::Signer vs instantiating one each time; and applying URI-escaping or not. These did make noticeable differences, but not astounding ones.

This optimized version would have to be enhanced to be able to handle additional query param options such as specified content-disposition, I optimistically hope that can be done without changing the performance characteristics much.

Could it be optimized even more, by profiling within the Aws::Sigv4::Signer implementation? Maybe, but it doesn’t really seem worth it — we are already introducing some fragility into our code by using lower-level APIs and hoping they will remain valid even if AWS changes some things in the future. I don’t really want to re-implement Aws::Sigv4::Signer, just glad to have it available as a tool I can use like this already.

The Numbers

The script I used to compare performance in different ways of creating presigned S3 URLs (with a couple public URLs for comparison) is available in a gist, and here is the output of one run:

user system total real
sdk public_url 0.054114 0.000335 0.054449 ( 0.054802)
naive S3 public url 0.004575 0.000009 0.004584 ( 0.004582)
naive S3 public url with URI.escape 0.009892 0.000090 0.009982 ( 0.011209)
sdk presigned_url 0.756642 0.005855 0.762497 ( 0.789622)
re-use instantiated SDK Presigner 0.817595 0.005955 0.823550 ( 0.859270)
use inline instantiated Aws::Sigv4::Signer directly for presigned url (with escaping) 0.216338 0.001941 0.218279 ( 0.226991)
Re-use Aws::Sigv4::Signer for presigned url (with escaping) 0.185855 0.001124 0.186979 ( 0.188798)
Re-use Aws::Sigv4::Signer for presigned url (without escaping) 0.178457 0.001049 0.179506 ( 0.180920)
view raw gistfile1.txt hosted with ❤ by GitHub

So what to do?

Possibly there are optimizations that would make sense in the AWS SDK gem itself? But it would actually take a lot more work to be sure what can be done without breaking some use cases.

I think there is no need to use URI.parse in public_url, the URIs can just be treated as strings and concatenated. But is there an edge case I’m missing?

Using different URI escaping method definitely helps in public_url; but how many other people who aren’t me care about optimizing public_url; and what escaping method is actually required/expected, is changing it a backwards compat problem; and is it okay maintenance-wise to make the S3 object use a different escaping mechanism than the common SDK Seahorse::Util.uri_escape workhorse, which might be used in places with different escaping requirements?

For presigned_urls, cutting out a lot of the wrapper code and using a Aws::Sigv4::Signer directly seems to have significant performance benefits, but what edge cases get broken there, and do they matter, and can a regression be avoided through alternate performant maintainable code?

Figuring this all out would take a lot more research (and figuring out how to use the test suite for the ruby SDK more facilely than I can write now; it’s a test suite for the whole SDK, and it’s a bear to run the whole thing).

Although if any Amazon maintainers of the ruby SDK, or other experts in it’s internals, see this and have an opinion, I am curious as to their thoughts.

But I am a lot more confident that some of these optimizations will work fine for my use cases. One of the benefits of using shrine is that all of my code already accesses S3 URL generation via shrine API. So I could easily swap in a locally optimized version, either with a shrine plugin, or just a local sub-class of the shrine S3 storage class. So I may consider doing that.