I develop a “digital collections” or “asset management” app, which manages and makes digitized historical objects and their descriptions available to the public, from the collections here at the Science History Institute.
The app receives relatively low level of traffic (according to Google Analytics, around 25K pageviews a month), although we want it to be able to handle spikes without falling down. It is not the most performance-optimized app, it does have some relatively slow responses and can be RAM-hungry. But it works adequately on our current infrastructure: Web traffic is handled on a single AWS EC2 t2.medium instance, with 10 passenger processes (free version of passenger, so no multi-threading).
We are currently investigating the possibility of moving our infrastructure to heroku. After realizing that heroku standard dynos did not seem to have the performance characteristics I had expected, I decided to approach performance testing more methodically, to compare different heroku dyno formations to each other and to our current infrastructure. Our basic research question is probably What heroku formation do we need to have similar performance to our existing infrastructure?
I am not an expert at doing this — I did some research, read some blog posts, did some thinking, and embarked on this. I am going to lead you through how I approached this and what I found. Feedback or suggestions are welcome. The most surprising result I found was much poorer performance from heroku standard dynos than I expected, and specifically that standard dynos would not match performance of present infrastructure.
What URLs to use in test
Some older load-testing tools only support testing one URL over and over. I decided I wanted to test a larger sample list of URLs — to be a more “realistic” load, and also because repeatedly requesting only one URL might accidentally use caches in ways you aren’t expecting giving you unrepresentative results. (Our app does not currently use fragment caching, but caches you might not even be thinking about include postgres’s built-in automatic caches, or passenger’s automatic turbocache (which I don’t think we have turned on)).
My initial thought to get a list of such URLs from our already-in-production app from production logs, to get a sample of what real traffic looks like. There were a couple barriers for me to using production logs as URLs:
- Some of those URLs might require authentication, or be POST requests. The bulk of our app’s traffic is GET requests available without authentication, and I didn’t feel like the added complexity of setting up anything else in a load traffic was worthwhile.
- Our app on heroku isn’t fully functional yet. Without having connected it to a Solr or background job workers, only certain URLs are available.
In fact, a large portion of our traffic is an “item” or “work” detail page like this one. Additionally, those are the pages that can be the biggest performance challenge, since the current implementation includes a thumbnail for every scanned page or other image, so response time unfortunately scales with number of pages in an item.
So I decided a good list of URLs was simply a representative same of those “work detail” pages. In fact, rather than completely random sample, I took the 50 largest/slowest work pages, and then added in another 150 randomly chosen from our current ~8K pages. And gave them all a randomly shuffled order.
In our app, every time a browser requests a work detail page, the JS on that page makes an additional request for a JSON document that powers our page viewer. So for each of those 200 work detail pages, I added the JSON request URL, for a more “realistic” load, and 400 total URLs.
Performance: “base speed” vs “throughput under load”
Thinking about it, I realized there were two kinds of “performance” or “speed” to think about.
You might just have a really slow app, to exagerate let’s say typical responses are 5 seconds. That’s under low/no-traffic, a single browser is the only thing interacting with the app, it makes a single request, and has to wait 5 seconds for a response.
That number might be changed by optimizations or performance regressions in your code (including your dependencies). It might also be changed by moving or changing hardware or virtualization environment — including giving your database more CPU/RAM resources, etc.
But that number will not change by horizontally scaling your deployment — adding more puma or passenger processes or threads, scaling out hosts with a load balancer or heroku dynos. None of that will change this base speed because it’s just how long the app takes to prepare a response when not under load, how slow it is in a test only one web worker , where adding web workers won’t matter because they won’t be used.
Then there’s what happens to the app actually under load by multiple users at once. The base speed is kind of a lower bound on throughput under load — page response time is never going to get better than 5s for our hypothetical very slow app (without changing the underlying base speed). But it can get a lot worse if it’s hammered by traffic. This throughput under load can be effected not only by changing base speed, but also by various forms of horizontal scaling — how many puma or passenger processes you have with how many threads each, and how many CPUs they have access to, as well as number of heroku dynos or other hosts behind a load balancer.
(I had been thinking about this distinction already, but Nate Berkopec’s great blog post on scaling Rails apps gave me the “speed” vs “throughout” terminology to use).
For my condition, we are not changing the code at all. But we are changing the host architecture from a manual EC2 t2.medium to heroku dynos (of various possible types) in a way that could effect base speed, and we’re also changing our scaling architecture in a way that could change throughput under load on top of that — from one t2.medium with 10 passenger process to possibly multiple heroku dynos behind heroku’s load balancer, and also (for Reasons) switching from free passenger to trying puma with multiple threads per process. (we are running puma 5 with new experimental performance features turned on).
So we’ll want to get a sense of base speed of the various host choices, and also look at how throughput under load changes based on various choices.
Benchmarking tool: wrk
We’re going to use wrk.
There are LOTS of choices for HTTP benchmarking/load testing, with really varying complexity and from different eras of web history. I got a bit overwhelmed by it, but settled on wrk. Some other choices didn’t have all the features we need (some way to test a list of URLs, with at least some limited percentile distribution reporting). Others were much more flexible and complicated and I had trouble even figuring out how to use them!
wrk does need a custom lua script in order to handle a list of URLs. I found a nice script here, and modified it slightly to take filename from an ENV variable, and not randomly shuffle input list.
It’s a bit confusing understanding the meaning of “threads” vs “connections” in
wrk arguments. This blog post from appfolio clears it up a bit. I decided to leave threads set to 1, and vary connections for load — so
-c1 -t1 is a “one URL at a time” setting we can use to test “base speed”, and we can benchmark throughput under load by increasing connections.
We want to make sure we run the test for long enough to touch all 400 URLs in our list at least once, even in the slower setups, to have a good comparison — ideally it would be go through the list more than once, but for my own ergonomics I had to get through a lot of tests so ended up less tha ideal. (Should I have put fewer than 400 URLs in? Not sure).
Conclusions in advance
As benchmarking posts go (especially when I’m the one writing them), I’m about to drop a lot of words and data on you. So to maximize the audience that sees the conclusions (because they surprise me, and I want feedback/pushback on them), I’m going to give you some conclusions up front.
Our current infrastructure has web app on a single EC2 t2.medium, which is a burstable EC2 type — our relatively low-traffic app does not exhaust it’s burst credits. Measuring base speed (just one concurrent request at a time), we found that performance dynos seem to have about the CPU speed of a bursting t2.medium (just a hair slower).
But standard dynos are as a rule 2 to 3 times slower; additionally they are highly variable, and that variability can be over hours/days. A 3 minute period can have measured response times 2 or more times slower than another 3 minute period a couple hours later. But they seem to typically be 2-3x slower than our current infrastructure.
Under load, they scale about how you’d expect if you knew how many CPUs are present, no real surprises. Our existing t2.medium has two CPUs, so can handle 2 simultaneous requests as fast as 1, and after that degrades linearly.
A single performance-L ($500/month) has 4 CPUs (8 hyperthreads), so scales under load much better than our current infrastructure.
A single performance-M ($250/month) has only 1 CPU (!), so scales pretty terribly under load.
Testing scaling with 4 standard-2x’s ($200/month total), we see that it scales relatively evenly. Although lumpily because of variability, and it starts out so much worse performing that even as it scales “evenly” it’s still out-performed by all other arcchitectures. :( (At these relatively fast median response times you might say it’s still fast enough who cares, but in our fat tail of slower pages it gets more distressing).
Let’s compare base speed
OK, let’s get to actual measurements! For “base speed” measurements, we’ll be telling
wrk to use only one connection and one thread.
Existing t2.medium: base speed
Our current infrastructure is one EC2 t2.medium. This EC2 instance type has two vCPUs and 4GB of RAM. On that single EC2 instance, we run passenger (free not enterprise) set to have 10 passenger processes, although the base speed test with only one connection should only touch one of the workers. The
t2 is a “burstable” type, and we do always have burst credits (this is not a high traffic app; verified we never exhausted burst credits in these tests), so our test load may be taking advantage of burst cpu.
$ URLS=./sample_works.txt wrk -c 1 -t 1 -d 3m --timeout 20s --latency -s load_test/multiplepaths.lua.txt https://[current staging server] multiplepaths: Found 400 paths multiplepaths: Found 400 paths Running 3m test @ https://staging-digital.sciencehistory.org 1 threads and 1 connections Thread Stats Avg Stdev Max +/- Stdev Latency 311.00ms 388.11ms 2.37s 86.45% Req/Sec 11.89 8.96 40.00 69.95% Latency Distribution 50% 90.99ms 75% 453.40ms 90% 868.81ms 99% 1.72s 966 requests in 3.00m, 177.43MB read Requests/sec: 5.37 Transfer/sec: 0.99MB
I’m actually feeling pretty good about those numbers on our current infrastructure! 90ms median, not bad, and even 453ms 75th percentile is not too bad. Now, our test load involves some JSON responses that are quicker to deliver than corresponding HTML page, but still pretty good. The 90th/99th/and max request (2.37s) aren’t great, but I knew I had some slow pages, this matches my previous understanding of how slow they are in our current infrastructure.
90th percentile is ~9 times 50th percenile.
I don’t have an understanding of why the two different
Requests/Sec values are so different, and don’t totally understand what to do with the
+/- Stdev values, so I’m just going to be sticking to looking at the latency percentiles, I think “latency” could also be called “response times” here.
But ok, this is our baseline for this workload. And doing this 3 minute test at various points over the past few days, I can say it’s nicely regular and consistent, occasionally I got a slower run, but 50th percentile was usually 90ms–105ms, right around there.
Heroku standard-2x: base speed
From previous mucking about, I learned I can only reliably fit one puma worker in a standard-1x, and heroku says “we typically recommend a minimum of
2 processes, if possible” (for routing algorithmic reasons when scaled to multiple dynos), so I am just starting at a standard-2x with two puma workers each with 5 threads, matching heroku recommendations for a standard-2x dyno.
So one thing I discovered is that bencharks from a heroku
standard dyno are really variable, but here are typical ones:
$ heroku dyno:resize type size qty cost/mo ─────── ─────────── ─── ─────── web Standard-2X 1 50 $ heroku config:get --shell WEB_CONCURRENCY RAILS_MAX_THREADS WEB_CONCURRENCY=2 RAILS_MAX_THREADS=5 $ URLS=./sample_works.txt wrk -c 1 -t 1 -d 3m --timeout 20s --latency -s load_test/multiplepaths.lua.txt https://scihist-digicoll.herokuapp.com/ multiplepaths: Found 400 paths multiplepaths: Found 400 paths Running 3m test @ https://scihist-digicoll.herokuapp.com/ 1 threads and 1 connections Thread Stats Avg Stdev Max +/- Stdev Latency 645.08ms 768.94ms 4.41s 85.52% Req/Sec 5.78 4.36 20.00 72.73% Latency Distribution 50% 271.39ms 75% 948.00ms 90% 1.74s 99% 3.50s 427 requests in 3.00m, 74.51MB read Requests/sec: 2.37 Transfer/sec: 423.67KB
I had heard that heroku
standard dynos would have variable performance, because they are shared multi-tenant resources. I had been thinking of this like during a 3 minute test I might see around the same median with more standard deviation — but instead, what it looks like to me is that running this benchmark on Monday at 9am might give very different results than at 9:50am or Tuesday at 2pm. The variability is over a way longer timeframe than my 3 minute test — so that’s something learned.
Running this here and there over the past week, the above results seem to me typical of what I saw. (To get better than “seem typical” on this resource, you’d have to run a test, over several days or a week I think, probably not hammering the server the whole time, to get a sense of actual statistical distribution of the variability).
I sometimes saw tests that were quite a bit slower than this, up to a 500ms median. I rarely if ever saw results too much faster than this on a standard-2x. 90th percentile is ~6x median, less than my current infrastructure, but that still gets up there to 1.74 instead of 864ms.
This typical one is quite a bit slower than than our current infrastructure, our median response time is 3x the latency, with 90th and max being around 2x. This was worse than I expected.
Heroku performance-m: base speed
Although we might be able to fit more puma workers in RAM, we’re running a single-connection base speed test, so it shouldn’t matter to, and we won’t adjust it.
$ heroku dyno:resize type size qty cost/mo ─────── ───────────── ─── ─────── web Performance-M 1 250 $ heroku config:get --shell WEB_CONCURRENCY RAILS_MAX_THREADS WEB_CONCURRENCY=2 RAILS_MAX_THREADS=5 $ URLS=./sample_works.txt wrk -c 1 -t 1 -d 3m --timeout 20s --latency -s load_test/multiplepaths.lua.txt https://scihist-digicoll.herokuapp.com/ multiplepaths: Found 400 paths multiplepaths: Found 400 paths Running 3m test @ https://scihist-digicoll.herokuapp.com/ 1 threads and 1 connections Thread Stats Avg Stdev Max +/- Stdev Latency 377.88ms 481.96ms 3.33s 86.57% Req/Sec 10.36 7.78 30.00 37.03% Latency Distribution 50% 117.62ms 75% 528.68ms 90% 1.02s 99% 2.19s 793 requests in 3.00m, 145.70MB read Requests/sec: 4.40 Transfer/sec: 828.70KB
This is a lot closer to the ballpark of our current infrastructure. It’s a bit slower (117ms median intead of 90ms median), but in running this now and then over the past week it was remarkably, thankfully, consistent. Median and 99th percentile are both 28% slower (makes me feel comforted that those numbers are the same in these two runs!), that doesn’t bother me so much if it’s predictable and regular, which it appears to be. The
max appears to me still a little bit less regular on heroku for some reason, since performance is supposed to be non-shared AWS resources, you wouldn’t expect it to be, but slow requests are slow, ok.
90th percentile is ~9x median, about the same as my current infrastructure.
heroku performance-l: base speed
$ heroku dyno:resize type size qty cost/mo ─────── ───────────── ─── ─────── web Performance-L 1 500 $ heroku config:get --shell WEB_CONCURRENCY RAILS_MAX_THREADS WEB_CONCURRENCY=2 RAILS_MAX_THREADS=5 URLS=./sample_works.txt wrk -c 1 -t 1 -d 3m --timeout 20s --latency -s load_test/multiplepaths.lua.txt https://scihist-digicoll.herokuapp.com/ multiplepaths: Found 400 paths multiplepaths: Found 400 paths Running 3m test @ https://scihist-digicoll.herokuapp.com/ 1 threads and 1 connections Thread Stats Avg Stdev Max +/- Stdev Latency 471.29ms 658.35ms 5.15s 87.98% Req/Sec 10.18 7.78 30.00 36.20% Latency Distribution 50% 123.08ms 75% 635.00ms 90% 1.30s 99% 2.86s 704 requests in 3.00m, 130.43MB read Requests/sec: 3.91 Transfer/sec: 741.94KB
No news is good news, it looks very much like
performance-m, which is exactly what we expected, because this isn’t a load test. It tells us that performance-m and performance-l seem to have similar CPU speeds and similar predictable non-variable regularity, which is what I find running this test periodically over a week.
90th percentile is ~10x median, about the same as current infrastructure.
The higher Max speed is just evidence of what I mentioned, the speed of slowest request did seem to vary more than on our manual t2.medium, can’t really explain why.
Summary: Base speed
Not sure how helpful this visualization is, charting 50th, 75th, and 90th percentile responses across architectures.
But basically: performance dynos perform similarly to my (bursting) t2.medium. Can’t explain why performance-l seems slightly slower than performance-m, might be just incidental variation when I ran the tests.
The standard-2x is about twice as slow as my (bursting) t2.medium. Again recall standard-2x results varied a lot every time I ran them, the one I reported seems “typical” to me, that’s not super scientific, admittedly, but I’m confident that standard-2x are a lot slower in median response times than my current infrastructure.
Throughput under load
Ok, now we’re going to test using wrk to use more connections. In fact, I’ll test each setup with various number of connections, and graph the result, to get a sense of how each formation can handle throughput under load. (This means a lot of minutes to get all these results, at 3 minutes per number of connection test, per formation!).
An additional thing we can learn from this test, on heroku we can look at how much RAM is being used after a load test, to get a sense of the app’s RAM usage under traffic to understand the maximum number of puma workers we might be able to fit in a given dyno.
Existing t2.medium: Under load
A t2.medium has 4G of RAM and 2 CPUs. We run 10 passenger workers (no multi-threading, since we are free, rather than enterprise, passenger). So what do we expect? With 2 CPUs and more than 2 workers, I’d expect it to handle 2 simultaneous streams of requests almost as well as 1; 3-10 should be quite a bit slower because they are competing for the 2 CPUs. Over 10, performance will probably become catastrophic.
2 connections are exactly flat with 1, as expected for our two CPUs, hooray!
Then it goes up at a strikingly even line. Going over 10 (to 12) simultaneous connections doesn’t matter, even though we’ve exhausted our workers, I guess at this point there’s so much competition for the two CPUs already.
The slope of this curve is really nice too actually. Without load, our median response time is 100ms, but even at a totally overloaded 12 overloaded connections, it’s only 550ms, which actually isn’t too bad.
We can make a graph that in addition to median also has 75th, 90th, and 99th percentile response time on it:
It doesn’t tell us too much; it tells us the upper percentiles rise at about the same rate as the median. At 1 simultaneous connection 90th percentile of 846ms is about 9 times the median of 93ms; at 10 requests the 90th percentile of 3.6 seconds is about 8 times the median of 471ms.
This does remind us that under load when things get slow, this has more of a disastrous effect on already slow requests than fast requests. When not under load, even our 90th percentile was kind of sort of barley acceptable at 846ms, but under load at 3.6 seconds it really isn’t.
Single Standard-2X dyno: Under load
A standard-2X dyno has 1G of RAM. The (amazing, excellent, thanks schneems) heroku puma guide suggests running two puma workers with 5 threads each. At first I wanted to try running three workers, which seemed to fit into available RAM — but under heavy load-testing I was getting Heroku
R14 Memory Quota Exceeded errors, so we’ll just stick with the heroku docs recommendations. Two workers with 5 threads each fit with plenty of headroom.
A standard-2x dyno is runs on shared (multi-tenant) underlying Amazon virtual hardware. So while it is running on hardware with 4 CPUs (each of which can run two “hyperthreads“), the puma doc suggests “it is best to assume only one process can execute at a time” on
What do we expect? Well, if it really only had one CPU, it would immediately start getting bad at 2 simulataneous connections, and just get worse from there. When we exceed the two worker count, will it get even worse? What about when we exceed the 10 thread (2 workers * 5 threads) count?
You’d never run just one dyno if you were expecting this much traffic, you’d always horizontally scale. This very artificial test is just to get a sense of it’s characteristics.
Also, we remember that standard-2x’s are just really variable; I could get much worse or better runs than this, but graphed numbers from a run that seemed typical.
Well, it really does act like 1 CPU, 2 simultaneous connections is immediately a lot worse than 1.
The line isn’t quite as straight as in our existing t2.medium, but it’s still pretty straight; I’d attribute the slight lumpiness to just the variability of shared-architecture standard dyno, and figure it would get perfectly straight with more data.
It degrades at about the same rate of our baseline t2.medium, but when you start out slower, that’s more disastrous. Our t2.medium at an overloaded 10 simultaneous requests is 473ms (pretty tolerable actually), 5 times the median at one request only. This standard-2x has a median response time of 273 ms at only one simultaneous request, and at an overloaded 10 requests has a median response time also about 5x worse, but that becomes a less tolerable 1480ms.
Does also graphing the 75th, 90th, and 99th percentile tell us much?
Eh, I think the lumpiness is still just standard shared-architecture variability.
The rate of “getting worse” as we add more overloaded connections is actually a bit better than it was on our t2.medium, but since it already starts out so much slower, we’ll just call it a wash. (On t2.medium, 90th percentile without load is 846ms and under an overloaded 10 connections 3.6s. On this single standard-2x, it’s 1.8s and 5.2s).
I’m not sure how much these charts with various percentiles on them tell us, I’ll not include them for every architecture hence.
standard-2x, 4 dynos: Under load
OK, realistically we already know you shouldn’t have just one standard-2x dyno under that kind of load. You’d scale out, either manually or perhaps using something like the neat Rails Autoscale add-on.
Let’s measure with 4 dynos. Each is still running 2 puma workers, with 5 threads each.
What do we expect? Hm, treating each dyno as if it has only one CPU, we’d expect it to be able to handle traffic pretty levelly up to 4 simultenous connections, distributed to 4 dynos. It’s going to do worse after that, but up to 8 there is still one puma worker per connection so it might get even worse after 8?
Well… I think that actually is relatively flat from 1 to 4 simultaneous connections, except for lumpiness from variability. But lumpiness from variability is huge! We’re talking 250ms median measured at 1 connection, up to 369ms measured median at 2, down to 274ms at 3.
And then maybe yeah, a fairly shallow slope up to 8 simutaneous connections than steeper.
But it’s all fairly shallow slope compared to our base t2.medium. At 8 connections (after which we pretty much max out), the standard-2x median of 464ms is only 1.8 times the median at 1 conection. Compared to the t2.median increase of 3.7 times.
As we’d expect, scaling out to 4 dynos (with four cpus/8 hyperthreads) helps us scale well — the problem is the baseline is so slow to begin (with very high bounds of variability making it regularly even slower).
performance-m: Under load
By testing under load, I demonstrated I could actually fit 12 workers on there without any memory limit errors. But is there any point to doing that with only 1/2 CPUs? Under a bit of testing, it appeared not.
The heroku puma docs recommend only 2 processes with 5 threads. You could do a whole little mini-experiment just trying to measure/optimize process/thread count on performance-m! We’ve already got too much data here, but in some experimentation it looked to me like 5 processes with 2 threads each performed better (and certainly no worse) than 2 processes with 5 threads — if you’ve got the RAM just sitting there anyway (as we do), why not?
I actually tested with 6 puma processes with 2 threads each. There is still a large amount of RAM headroom we aren’t going to use even under load.
What do we expect? Well, with the 2 “hyperthreads” perhaps it can handle 2 simultaneous requests nearly as well as 1 (or not?); after that, we expect it to degrade quickly same as our original t2.medium did.
It an handle 2 connections slightly better than you’d expect if there really was only 1 CPU, so I guess a hyperthread does give you something. Then the slope picks up, as you’d expect; and it looks like it does get steeper after 4 simultaneous connections, yup.
performance-l: Under load
A performance-l ($500/month) costs twice as much as a performance-m ($250/month), but has far more than twice as much resources. performance-l has a whopping 14GB of RAM compared to performance-m’s 2.5GB; and performance-l has 4 real CPUs/hyperthreads available to use (visible using the nproc technique in the heroku puma article.
Because we have plenty of RAM to do so, we’re going to run 10 worker processes to match our original t2.medium’s. We still ran with 2 threads, just cause it seems like maybe you should never run a puma worker with only one thread? But who knows, maybe 10 workers with 1 thread each would perform better; plenty of room (but not plenty of my energy) for yet more experimentation.
What do we expect? The graph should be pretty flat up to 4 simultaneous connections, then it should start getting worse, pretty evenly as simultaneous connections rise all the way up to 12.
It is indeed pretty flat up to 4 simultaneous connections. Then up to 8 it’s still not too bad — median at 8 is only ~1.5 median at 1(!). Then it gets worse after 8 (oh yeah, 8 hyperthreads?).
But the slope is wonderfully shallow all the way. Even at 12 simultaneous connections, the median response time of 266ms is only 2.5x what it was at one connection. (In our original t2.medium, at 12 simultaneous connections median response time was over 5x what it was at 1 connection).
This thing is indeed a monster.
Summary Comparison: Under load
We showed a lot of graphs that look similar, but they all had different sclaes on the y-axis. Let’s plot median response times under load of all architectures on the same graph, and see what we’re really dealing with.
The blue t2.medium is our baseline, what we have now. We can see that there isn’t really a similar heroku option, we have our choice of better or worse.
The performance-l is just plain better than what we have now. It starts out performing about the same as what we have now for 1 or 2 simultaneous connections, but then scales so much flatter.
The performance-m also starts out about thesame, but sccales so much worse than even what we have now. (it’s that 1 real CPU instead of 2, I guess?).
The standard-2x scaled to 4 dynos… has it’s own characteristics. It’s baseline is pretty terrible, it’s 2 to 3 times as slow as what we have now even not under load. But then it scales pretty well, since it’s 4 dynos after all, it doesn’t get worse as fast as performance-m does. But it started out so bad, that it remains far worse than our original t2.medium even under load. Adding more dynos to standard-2x will help it remain steady under even higher load, but won’t help it’s underlying problem that it’s just slower than everyone else.
Discussion: Thoughts and Surprises
- I had been thinking of a t2.medium (even with burst) as “typical” (it is after all much slower than my 2015 Macbook), and has been assuming (in retrospect with no particular basis) that a heroku standard dyno would perform similarly.
- Most discussion and heroku docs, as well as the naming itself, suggest that a ‘standard’ dyno is, well, standard, and performance dynos are for “super scale, high traffic apps”, which is not me.
- But in fact, heroku standard dynos are much slower and more variable in performance than a bursting t2.medium. I suspect they are slower than other options you might consider non-heroku “typical” options.
- My conclusion is honestly that “standard” dynos are really “for very fast, well-optimized apps that can handle slow and variable CPU” and “performance” dynos are really “standard, matching the CPU speeds you’d get from a typical non-heroku option”. But this is not how they are documented or usually talked about. Are other people having really different experiences/conclusions than me? If so, why, or where have I gone wrong?
- This of course has implications for estimating your heroku budget if considering switching over. :(
- If you have a well-optimized fast app, say even 95th percentile is 200ms (on bursting t2.medium), then you can handle standard slowness — so what your 95th percentile is now 600ms (and during some time periods even much slower, 1s or worse, due to variability). That’s not so bad for a 95th percentile.
- One way to get a very fast is of course caching. There is lots of discussion of using caching in Rails, sometimes the message (explicit or implicit) is “you have to use lots of caching to get reasonable performance cause Rails is so slow.” What if many of these people are on heroku, and it’s really you have to use lots of caching to get reasonable performance on heroku standard dyno??
- I personally don’t think caching is maintenance free; in my experience properly doing cache invalidation and dealing with significant processing spikes needed when you choose to invalidate your entire cache (cause cached HTML needs to change) lead to real maintenance/development cost. I have not needed caching to meet my performance goals on present architecture.
- Everyone doesn’t necessarily have the same performance goals/requirements. Mine of a low-traffic non-commercial site are are maybe more modest, I just need users not to be super annoyed. But whatever your performance goals, you’re going to have to spend more time on optimization on a heroku standard than something with much faster CPU — like a standard affordable mid-tier EC2. Am I wrong?
- One significant factor on heroku standard dyno performance is that they use shared/multi-tenant infrastructure. I wonder if they’ve actually gotten lower performance over time, as many customers (who you may be sharing with) have gotten better at maximizing their utilization, so the shared CPUs are typically more busy? Like a frog boiling, maybe nobody noticed that standard dynos have become lower performance? I dunno, brainstorming.
- Or maybe there are so many apps that start on heroku instead of switcching from somewhere else, that people just don’t realize that standard dynos are much slower than other low/mid-tier options?
- I was expecting to pay a premium for heroku — but even standard-2x’s are a significant premium over paying for t2.medium EC2 yourself, one I found quite reasonable…. performance dynos are of course even more premium.
- I had a sort of baked-in premise that most Rails apps are “IO-bound”, they spend more time waiting on IO than using CPU. I don’t know where I got that idea, I heard it once a long time ago and it became part of my mental model. I now do not believe this is true true of my app, and I do not in fact believe it is true of most Rails apps in 2020. I would hypothesize that most Rails apps today are in fact CPU-bound.
- The performance-m dyno only has one CPU. I had somehow also been assuming that it would have two CPUs — I’m not sure why, maybe just because at that price! It would be a much better deal with two CPUs.
- Instead we have a huge jump from $250 performance-m to $500 performance-l that has 4x the CPUs and ~5x the RAM.
- So it doesn’t make financial sense to have more than one performance-m dyno, you might as well go to performance-l. But this really complicates auto-scaling, whether using Heroku’s feature , or the awesome Rails Autoscale add-on. I am not sure I can afford a performance-l all the time, and a performance-m might be sufficient most of the time. But if 20% of the time I’m going to need more (or even 5%, or even unexpectedly-mentioned-in-national-media), it would be nice to set things up to autoscale up…. I guess to financially irrational 2 or more performance-m’s? :(
- The performance-l is a very big machine, that is significantly beefier than my current infrastructure. And has far more RAM than I need/can use with only 4 physical cores. If I consider standard dynos to be pretty effectively low tier (as I do), heroku to me is kind of missing mid-tier options. A 2 CPU option at 2.5G or 5G of RAM would make a lot of sense to me, and actually be exactly what I need… really I think performance-m would make more sense with 2 CPUs at it’s existing already-premium price point, and to be called a “performance” dyno. . Maybe heroku is intentionally trying set options to funnel people to the highest-priced performance-l.
Conclusion: What are we going to do?
In my investigations of heroku, my opinion of the developer UX and general service quality only increases. It’s a great product, that would increase our operational capacity and reliability, and substitute for so many person-hours of sysadmin/operational time if we were self-managing (even on cloud architecture like EC2).
But I had originally been figuring we’d use standard dynos (even more affordably, possibly auto-scaled with Rails Autoscale plugin), and am disappointed that they end up looking so much lower performance than our current infrastructure.
Could we use them anyway? Response time going from 100ms to 300ms — hey, 300ms is still fine, even if I’m sad to lose those really nice numbers I got from a bit of optimization. But this app has a wide long-tail ; our 75th percentile going from 450ms to 1s, our 90th percentile going from 860ms to 1.74s and our 99th going from 2.3s to 4.4s — a lot harder to swallow. Especially when we know that due to standard dyno variability, a slow-ish page that on my present architecture is reliably 1.5s, could really be anywhere from 3 to 9(!) on heroku.
I would anticipate having to spend a lot more developer time on optimization on heroku standard dynos — or, i this small over-burdened non-commercial shop, not prioritizing that (or not having the skills for it), and having our performance just get bad.
So I’m really reluctant to suggest moving our app to heroku with standard dynos.
A performance-l dyno is going to let us not have to think about performance any more than we do now, while scaling under high-traffic better than we do now — I suspect we’d never need to scale to more than one performance-l dyno. But it’s pricey for us.
A performance-m dyno has a base-speed that’s fine, but scales very poorly and unaffordably. Doesn’t handle an increase in load very well as one dyno, and to get more CPUs you have to pay far too much (especially compared to standard dynos I had been assuming I’d use).
So I don’t really like any of my options. If we do heroku, maybe we’ll try a performance-m, and “hope” our traffic is light enough that a single one will do? Maybe with Rails autoscale for traffic spikes, even though 2 performance-m dynos isn’t financially efficient? If we are scaling to 2 (or more!) performance-m’s more than very occasionally, switch to performance-l, which means we need to make sure we have the budget for it?