ActiveRecord Concurrency in Rails4: Avoid leaked connections!

My past long posts about multi-threaded concurrency in Rails ActiveRecord are some of the most visited posts on this blog, so I guess I’ll add another one here; if you’re a “tl;dr” type, you should probably bail now, but past long posts have proven useful to people over the long-term, so here it is.

I’m in the middle of updating my app that uses multi-threaded concurrency in unusual ways to Rails4.   The good news is that the significant bugs I ran into in Rails 3.1 etc, reported in the earlier post have been fixed.

However, the ActiveRecord concurrency model has always made it too easy to accidentally leak orphaned connections, and in Rails4 there’s no good way to recover these leaked connections. Later in this post, I’ll give you a monkey patch to ActiveRecord that will make it much harder to accidentally leak connections.

Background: The ActiveRecord Concurrency Model

Is pretty much described in the header docs for ConnectionPool, and the fundamental architecture and contract hasn’t changed since Rails 2.2.

Rails keeps a ConnectionPool of individual connections (usually network connections) to the database. Each connection can only be used by one thread at a time, and needs to be checked out and then checked back in when done.

You can check out a connection explicitly using `checkout` and `checkin` methods. Or, better yet use the `with_connection` method to wrap database use.  So far so good.

But ActiveRecord also supports an automatic/implicit checkout. If a thread performs an ActiveRecord operation, and that thread doesn’t already have a connection checked out to it (ActiveRecord keeps track of whether a thread has a checked out connection in Thread.current), then a connection will be silently, automatically, implicitly checked out to it. It still needs to be checked back in.

And you can call `ActiveRecord::Base.clear_active_connections!`, and all connections checked out to the calling thread will be checked back in. (Why might there be more than one connection checked out to the calling thread? Mostly only if you have more than one database in use, with some models in one database and others in others.)

And that’s what ordinary Rails use does, which is why you haven’t had to worry about connection checkouts before.  A Rails action method begins with no connections checked out to it; if and only if the action actually tries to do some ActiveRecord stuff, does a connection get lazily checked out to the thread.

And after the request had been processed and the response delivered, Rails itself will call `ActiveRecord::Base.clear_active_connections!` inside the thread that handled the request, checking back connections, if any, that were checked out.

The danger of leaked connections

So, if you are doing “normal” Rails things, you don’t need to worry about connection checkout/checkin. (modulo any bugs in AR).

But if you create your own threads to use ActiveRecord (inside or outside a Rails app, doesn’t matter), you absolutely do.  If you proceed blithly to use AR like you are used to in Rails, but have created Threads yourself — then connections will be automatically checked out to you when needed…. and never checked back in.

The best thing to do in your own threads is to wrap all AR use in a `with_connection`. But if some code somewhere accidentally does an AR operation outside of a `with_connection`, a connection will get checked out and never checked back in.

And if the thread then dies, the connection will become orphaned or leaked, and in fact there is no way in Rails4 to recover it.  If you leak one connection like this, that’s one less connection available in the ConnectionPool.  If you leak all the connections in the ConnectionPool, then there’s no more connections available, and next time anyone tries to use ActiveRecord, it’ll wait as long as the checkout_timeout (default 5 seconds; you can set it in your database.yml to something else) trying to get a connection, and then it’ll give up and throw a ConnectionTimeout. No more database access for you.

In Rails 3.x, there was a method `clear_stale_cached_connections!`, that would  go through the list of all checked out connections, cross-reference it against the list of all active threads, and if there were any checked out connections that were associated with a Thread that didn’t exist anymore, they’d be reclaimed.   You could call this method from time to time yourself to try and clean up after yourself.

And in fact, if you tried to check out a connection, and no connections were available — Rails 3.2 would call clear_stale_cached_connections! itself to see if there were any leaked connections that could be reclaimed, before raising a ConnectionTimeout. So if you were leaking connections all over the place, you still might not notice, the ConnectionPool would clean em up for you.

But this was a pretty expensive operation, and in Rails4, not only does the ConnectionPool not do this for you, but the method isn’t even available to you to call manually.  As far as I can tell, there is no way using public ActiveRecord API to clean up a leaked connection; once it’s leaked it’s gone.

So this makes it pretty important to avoid leaking connections.

(Note: There is still a method `clear_stale_cached_connections` in Rails4, but it’s been redefined in a way that doesn’t do the same thing at all, and does not do anything useful for leaked connection cleanup.  That it uses the same method name, I think, is based on misunderstanding by Rails devs of what it’s doing. See Fear the Reaper below. )

Monkey-patch AR to avoid leaked connections

I understand where Rails is coming from with the ‘implicit checkout’ thing.  For standard Rails use, they want to avoid checking out a connection for a request action if the action isn’t going to use AR at all. But they don’t want the developer to have to explicitly check out a connection, they want it to happen automatically. (In no previous version of Rails, back from when AR didn’t do concurrency right at all in Rails 1.0 and Rails 2.0-2.1, has the developer had to manually check out a connection in a standard Rails action method).

So, okay, it lazily checks out a connection only when code tries to do an ActiveRecord operation, and then Rails checks it back in for you when the request processing is done.

The problem is, for any more general-purpose usage where you are managing your own threads, this is just a mess waiting to happen. It’s way too easy for code to ‘accidentally’ check out a connection, that never gets checked back in, gets leaked, with no API available anymore to even recover the leaked connections. It’s way too error prone.

That API contract of “implicitly checkout a connection when needed without you realizing it, but you’re still responsible for checking it back in” is actually kind of insane. If we’re doing our own `Thread.new` and using ActiveRecord in it, we really want to disable that entirely, and so code is forced to do an explicit `with_connection` (or `checkout`, but `with_connection` is a really good idea).

So, here, in a gist, is a couple dozen line monkey patch to ActiveRecord that let’s you, on a thread-by-thread basis, disable the “implicit checkout”.  Apply this monkey patch (just throw it in a config/initializer, that works), and if you’re ever manually creating a thread that might (even accidentally) use ActiveRecord, the first thing you should do is:

Thread.new do 
   ActiveRecord::Base.forbid_implicit_checkout_for_thread!

   # stuff
end

Once you’ve called `forbid_implicit_checkout_for_thread!` in a thread, that thread will be forbidden from doing an ‘implicit’ checkout.

If any code in that thread tries to do an ActiveRecord operation outside a `with_connection` without a checked out connection, instead of implicitly checking out a connection, you’ll get an ActiveRecord::ImplicitConnectionForbiddenError raised — immediately, fail fast, at the point the code wrongly ended up trying an implicit checkout.

This way you can enforce your code to only use `with_connection` like it should.

Note: This code is not battle-tested yet, but it seems to be working for me with `with_connection`. I have not tried it with explicitly checking out a connection with ‘checkout’, because I don’t entirely understand how that works.

DO fear the Reaper

In Rails4, the ConnectionPool has an under-documented thing called the “Reaper”, which might appear to be related to reclaiming leaked connections.  In fact, what public documentation there is says: “the Reaper, which attempts to find and close dead connections, which can occur if a programmer forgets to close a connection at the end of a thread or a thread dies unexpectedly. (Default nil, which means don’t run the Reaper).”

The problem is, as far as I can tell by reading the code, it simply does not do this.

What does the reaper do?  As far as I can tell trying to follow the code, it mostly looks for connections which have actually dropped their network connection to the database.

A leaked connection hasn’t necessarily dropped it’s network connection. That really depends on the database and it’s settings — most databases will drop unused connections after a certain idle timeout, by default often hours long.  A leaked connection probably hasn’t yet had it’s network connection closed, and a properly checked out not-leaked connection can have it’s network connection closed (say, there’s been a network interruption or error; or a very short idle timeout on the database).

The Reaper actually, if I’m reading the code right, has nothing to do with leaked connections at all. It’s targeting a completely different problem (dropped network, not checked out but never checked in leaked connections). Dropped network is a legit problem you want to be handled gracefullly; I have no idea how well the Reaper handles it (the Reaper is off by default, I don’t know how much use it’s gotten, I have not put it through it’s paces myself). But it’s got nothing to do with leaked connections.

Someone thought it did, they wrote documentation suggesting that, and they redefined `clear_stale_cached_connections!` to use it. But I think they were mistaken. (Did not succeed at convincing @tenderlove of this when I tried a couple years ago when the code was just in unreleased master; but I also didn’t have a PR to offer, and I’m not sure what the PR should be; if anyone else wants to try, feel free!)

So, yeah, Rails4 has redefined the existing `clear_stale_active_connections!` method to do something entirely different than it did in Rails3, it’s triggered in entirely different circumstance. Yeah, kind of confusing.

Oh, maybe fear ruby 1.9.3 too

When I was working on upgrading the app, I’m working on, I was occasionally getting a mysterious deadlock exception:

ThreadError: deadlock; recursive locking:

In retrospect, I think I had some bugs in my code and wouldn’t have run into that if my code had been behaving well. However, that my errors resulted in that exception rather than a more meaningful one, maybe possibly have been a bug in ruby 1.9.3 that’s fixed in ruby 2.0. 

If you’re doing concurrency stuff, it seems wise to use ruby 2.0 or 2.1.

Can you use an already loaded AR model without a connection?

Let’s say you’ve already fetched an AR model in. Can a thread then use it, read-only, without ever trying to `save`, without needing a connection checkout?

Well, sort of. You might think, oh yeah, what if I follow a not yet loaded association, that’ll require a trip to the db, and thus a checked out connection, right? Yep, right.

Okay, what if you pre-load all the associations, then are you good? In Rails 3.2, I did this, and it seemed to be good.

But in Rails4, it seems that even though an association has been pre-loaded, the first time you access it, some under-the-hood things need an ActiveRecord Connection object. I don’t think it’ll end up taking a trip to the db (it has been pre-loaded after all), but it needs the connection object. Only the first time you access it. Which means it’ll check one out implicitly if you’re not careful. (Debugging this is actually what led me to the forbid_implicit_checkout stuff again).

Didn’t bother trying to report that as a bug, because AR doesn’t really make any guarantees that you can do anything at all with an AR model without a checked out connection, it doesn’t really consider that one way or another.

Safest thing to do is simply don’t touch an ActiveRecord model without a checked out connection. You never know what AR is going to do under the hood, and it may change from version to version.

Concurrency Patterns to Avoid in ActiveRecord?

Rails has officially supported multi-threaded request handling for years, but in Rails4 that support is turned on by default — although there still won’t actually be multi-threaded request handling going on unless you have an app server that does that (Puma, Passenger Enterprise, maybe something else).

So I’m not sure how many people are using multi-threaded request dispatch to find edge case bugs; still, it’s fairly high profile these days, and I think it’s probably fairly reliable.

If you are actually creating your own ActiveRecord-using threads manually though (whether in a Rails app or not; say in a background task system), from prior conversations @tenderlove’s preferred use case seemed to be creating a fixed number of threads in a thread pool, making sure the ConnectionPool has enough connections for all the threads, and letting each thread permanently check out and keep a connection.

I think you’re probably fairly safe doing that too, and is the way background task pools are often set up.

That’s not what my app does.  I wouldn’t necessarily design my app the same way today if I was starting from scratch (the app was originally written for Rails 1.0, gives you a sense of how old some of it’s design choices are; although the concurrency related stuff really only dates from relatively recent rails 2.1 (!)).

My app creates a variable number of threads, each of which is doing something different (using a plugin system). The things it’s doing generally involve HTTP interactions with remote API’s, is why I wanted to do them in concurrent threads (huge wall time speedup even with the GIL, yep). The threads do need to occasionally do ActiveRecord operations to look at input or store their output (I tried to avoid concurrency headaches by making all inter-thread communications through the database; this is not a low-latency-requirement situation; I’m not sure how much headache I’ve avoided though!)

So I’ve got an indeterminate number of threads coming into and going out of existence, each of which needs only occasional ActiveRecord access. Theoretically, AR’s concurrency contract can handle this fine, just wrap all the AR access in a `with_connection`.  But this is definitely not the sort of concurrency use case AR is designed for and happy about. I’ve definitely spent a lot of time dealing with AR bugs (hopefully no longer!), and just parts of AR’s concurrency design that are less than optimal for my (theoretically supported) use case.

I’ve made it work. And it probably works better in Rails4 than any time previously (although I haven’t load tested my app yet under real conditions, upgrade still in progress). But, at this point,  I’d recommend avoiding using ActiveRecord concurrency this way.

What to do?

What would I do if I had it to do over again? Well, I don’t think I’d change my basic concurrency setup — lots of short-lived threads still makes a lot of sense to me for a workload like I’ve got, of highly diverse jobs that all do a lot of HTTP I/O.

At first, I was thinking “I wouldn’t use ActiveRecord, I’d use something else with a better concurrency story for me.”  DataMapper and Sequel have entirely different concurrency architectures; while they use similar connection pools, they try to spare you from having to know about it (at the cost of lots of expensive under-the-hood synchronization).

Except if I had actually acted on that when I thought about it a couple years ago, when DataMapper was the new hotness, I probably would have switched to or used DataMapper, and now I’d be stuck with a large unmaintained dependency. And be really regretting it. (And yeah, at one point I was this close to switching to Mongo instead of an rdbms, also happy I never got around to doing it).

I don’t think there is or is likely to be a ruby ORM as powerful, maintained, and likely to continue to be maintained throughout the life of your project, as ActiveRecord. (although I do hear good things about Sequel).  I think ActiveRecord is the safe bet — at least if your app is actually a Rails app.

So what would I do different? I’d try to have my worker threads not actually use AR at all. Instead of passing in an AR model as input, I’d fetch the AR model in some other safer main thread, convert it to a pure business object without any AR, and pass that in my worker threads.  Instead of having my worker threads write their output out directly using AR, I’d have a dedicated thread pool of ‘writers’ (each of which held onto an AR connection for it’s entire lifetime), and have the indeterminate number of worker threads pass their output through a threadsafe queue to the dedicated threadpool of writers.

That would have seemed like huge over-engineering to me at some point in the past, but at the moment it’s sounding like just the right amount of engineering if it lets me avoid using ActiveRecord in the concurrency patterns I am, that while it officially supports, it isn’t very happy about.

17 thoughts on “ActiveRecord Concurrency in Rails4: Avoid leaked connections!

  1. That’s very odd vigram, I think something else must be going on in your application. I have never seen a case where exclusively using `with_connection` resulted in leaked connections that an explicit close would fix. Is it possible some other part of your code was using AR outside of a `with_connection`? The AR monkey patch I provide in this post can help catch and forbid such implicit use.

  2. Regarding the Reaper; It looks to me like this has either been patched recently or the above information is inaccurate. The connection pool’s reap method will gather the connections for which the owner is not alive (the owner is the thread that created the connection). These stale connections are then either returned to the pool (if they can access the database) or disconnected (Not sure exactly what the behavior here is, but it looks like it will return the connection to the databases global pool). In any case, The behavior appears to be similar to that of Rails3.

  3. Thanks Jordan. Can you provide URLs to source code that you believe demonstrates the Reaper working in that way? This stuff gets very confusing, and the ‘contract’ of AR stuff is not well documented. Last I looked you had to chase through several nested method invocations (including both abstract and database-specific classes) to figure out what the Reaper was really doing — but once you did so, it appeared that the Reaper was reaping connections only if the underlying network connection to the db was no longer valid, which is a different thing than the owning thread not being alive.

    But maybe it’s changed, which would be nice. Some hints as to where to look in the source and how to read it would be quite helpful!

  4. https://github.com/rails/rails/blob/master/activerecord/lib/active_record/connection_adapters/abstract/connection_pool.rb

    lines 388-392: Collects connections whose owner is not alive (thread has exited)

    The rest of the function goes through the stale connections and either adds them back to the pool (checkin) or returns the connection to the global pool (disconnect). Returning to the global pool only seems to happen in the case of a connectivity issue.

    Disclaimer: I haven’t run any tests against this to be sure, but I have played around with most of these classes in other versions of rails.

  5. Nice, thanks Jordan!

    The change since I last looked seems to be making the test: `conn.in_use? && !conn.owner.alive?`.

    That does *appear* to do the right thing as far as actually reaping “orphaned” connections, although this stuff is so confusing and error prone I’d want to test it too to be sure.

    git logs suggest that line appeared first in Rails 4.2.0. https://github.com/rails/rails/commit/9e457a8654fa89fe329719f88ae3679aefb21e56

    In Rails 4.1, it used a different check, involving `active?`, which ended up checking the network health of the connection to the db instead. https://github.com/rails/rails/blob/4-1-stable/activerecord/lib/active_record/connection_adapters/abstract/connection_pool.rb#L396

    So, it looks like this has been improved/fixed in 4.2, awesome! Thanks for the heads-up Jordan!

  6. Haha, this is pretty hilarious, from the commit message that adds the correct check in the reaper:

    “As a no-op reap is now trivial (only entails checking a thread status
    per connection), we can also perform one in-line any time we decide to
    sleep for a connection.”

    That is actually exactly what Rails 3.2 did, tenderlove was adamant about removing that check in Rails 4.0 — it was bad for performance, and he believed it was unneccesary. I understood it was bad for performance, but was worried about it’s removal, it seemed neccesary. But I guess a solution was eventually found that didn’t have the performance problems, and it was returned. Great! This remove a significant source of errors in multi-threaded AR use. Although you still don’t really want to be letting the reaper reap your orphaned connections instead of cleaning them up yourself (it’s still possible to run out of connections before the reaper reaps), this makes the cost of a mistake a lot less dire.

  7. forbid_implicit_checkout_for_thread! would be a great addition to rails, exactly for the concurrent api requests you mention

  8. This post helped me out a ton, but I’m pleased to report that Rails 4.2 appears to be a lot better about retrieving leaked connections than 4.1 was.

    Here’s two example projects using the same layout and copies of the same database:
    https://asciinema.org/a/6xscqvehm8gh3cnhidxza8a47

    That said, this is still vital to stay aware of if you’re using long-running threads that may not always need access to AR the entire time (a thread pool for scheduled jobs, for example).

  9. Thank you for the detailed explanation and a nice fix. I was having problem with leaked connections on my non-Rails application which uses ActiveRecord and spawns threads all over the place, and this monkey-patch helped me track it down and fix things quickly.

  10. Hi,
    thanks for the post! It´s really nice. I´m facing to connection leak errors and after when no more connections left in the pool I got ActiveRecord::ConnectionTimeoutError (could not obtain a database connection within 5.000 seconds (waited 5.000 seconds)) exception.

    I´m trying to fix it with no success so far.

    There are some things that doesn´t match for me. I understand the problem is always in a multithread scenario. But, I´m not sure if I´m using multithread, I mean, I´m not creating threads explicitly. I just load the webapp in a puma multithread server, and 15 threads are created when server is startup.

    I have also a 5 connections db pool. But, I understood, Rails is thread safe with connections…

    Kind regards.

  11. @Roberto, you are using a thread pool of size 15 with puma. Therfore you need to have 15 connections with ActiveRecord.
    This connection pool size is for every process this means if you would also have sidekiq (default 25 workers) you should have 25 as db connection setting, even if you only need 15 in your puma process.

    Hope this helps.

Leave a comment