ruby Monitor bug? AR concurrency bug? Concurrency is hard.

So I’ve been messing with multi-threaded concurrency in ActiveRecord for some time. Lately I’ve been really digging into my app to figure out why it’s not performing exactly like I’d expect.  I even found a bug in Rails 3.2.0-3.2.2 — thanks very much to Aaron Patterson/tenderlove for dealing with my questions, spurious error reports, one real error report, and generally explaining things to me in the rails issue tracker. I owe him a couple cases of beer.

But now I’m running into something I just can’t figure out; it seems like it might actually be a bug in MRI Monitor implementation; or it might  be a fundamental flaw in ActiveRecord’s use of Monitor for it’s concurrency model; or it might be something entirely different that I can’t figure out.

Things we could debate about but let’s consider  ‘out of scope’ at present

First, let’s get a couple things out of the way:

  • Yes, I am using multi-threaded ActiveRecord.  Yes, I know there are various other designs I could be using (“can’t you just use redis?”). I have reasons for choosing this one. I’ll admit that lately I’ve started thinking no matter how much harder it seems to implement a model without threading it’s better than threading, but for the moment this is what I’ve got, and ActiveRecord claims to be able to support it. Let’s leave debate/analysis about whether multi-threaded code is the right approach for a different post.  Also note that at present I’m not actually using multi-threaded request handling in Rails, just multi-threaded access to AR.
  • Yes, I am aware of the MRI GIL.  It is not a problem for my particular use cases/contexts/needs, and as far as I can tell has nothing to do with the particular issue I’m seeing here. In fact, I’d think if anything the GIL would make the kind of problem I’m seeing less likely, but I’m still seeing it, in MRI 1.9.3.

The Mystery

Hard to summarize without leading you through it, but let’s review the basics of ActiveRecord ConnectionPool.  It keeps a pool of actual connections to the db.  The pool by default has max 5 connections in it. A given connection can only be used by one thread at a time. When a thread wants a connection, it asks for it — if one isn’t available (they’re all checked out), it should block waiting for one, until one is avail (having been checked in by another thread), or a timeout (by default 5 seconds) occurs.  If the timeout occured and no connection is still available, it should raise a ConnectionTimeoutError.

I have some code in a test case where a whole bunch of threads are competing for connections, more than 5 threads. However, each thread only holds onto the connection for a brief moment (0.01 second), and then returns it to the pool. They all oughta be able to get connections within the 5 second timeout and be happy (if not fast).

Instead, some threads somehow end up giving up when trying to get a connection, long before the 5 seconds are up. As far as I can tell, there is somehow a gap between: when the MonitorMixin::ConditionVariable#signal is sent on checkin in one thread; and when the Monitor::ConditionVariable#wait returns in a second thread as a result of that signal. In that gap, somehow a third thread manages to grab that just checked-in connection.

The Demo

Running against rails 3-2-stable, because there is at least one concurrency-related bug fixed there compared to 3.2.2, although I’m not sure if it’s relevant to what’s going on here. Using ruby 1.9.3-p0, because my system is mysteriously unable to compile -p125 and I haven’t managed to get to the bottom of it yet.

    def test_wait_on_connection
        threads = []

        20.times do |i|
          threads << do
             pool.with_connection do
              sleep 0.01
             sleep 0.5


        threads.each {|t| t.join}

So  you see we have a bunch of threads wanting to checkout a connection for a brief period of time — 0.01 seconds. They should be able to take turns sharing the 5 connections in the pool without a timeout, is what I think.

Sometimes this test succeeds fine, other times it doesn’t — classic hard to reproduce race condition. (The more threads I add to the 20 you see above, the more often it fails. Incidentally raising or lowering the number of connections in the pool does not seem to have an effect of how frequent the failure race condition is). When it fails, it looks something like this:

Finished tests in 0.572996s, 1.7452 tests/s, 0.0000 assertions/s.

  1) Error:
ActiveRecord::ConnectionTimeoutError: could not obtain a database connection within 5 seconds. The max pool size is currently 5; consider increasing it.

The whole test only took 0.57 seconds to run, yet an exception was raised that it timed out waiting 5 seconds for a connection, what?

It is hard to debug multi-threaded race conditions. Using a debugger is right out, and even putting in debugging ‘puts’ statements sometimes changes the timing enough to make it reproduce differently. But as far as I can tell from my hacky puts statements and such, indeed what’s happening is:

  1. Thread1 goes to checkout a connection, one is not available, so it blocks on the ConditionVariable.wait 
  2. Before the timeout is reached, Thread2 checks in a condition, and sends #signal to the ConditionVariable.wait
  3. What we EXPECT to happen at this point is that Thread1 immediately returns from it’s `@queue.wait`, and continues on with it’s processing, in a critical condition so no other threads can be in their critical conditions, and completes the checkout of that connection.
    1. What seems to happen instead:  Yes, Thread1 returns from it’s `@queue.wait`, and continues on with it’s processing, but by the time it gets to its check to make sure a connection is available, somehow another thread has already checked out that connection. So Thread1 assumes that the reason it woke up from @queue.wait is because the timeout was reached, because otherwise when it wakes up, it’s supposed to have a connection available.


Huh, that shouldn’t be possible, right?  As far as I can tell from how Monitor and it’s ConditionVariable are supposed to work….

  • The @queue.signal in #checkin in a ‘synchronized’ critical area.
  • The @queue.wait in #checkout is in a ‘synchronized’ critical area.
  • Any other thread that is checking out the connection in between these two points would also have to do so in a synchronized critical condition.
  •  Only one thread should be in a synchronized critical condition at once.
  • When @queue.signal is called, exclusive access to the synchronized critical area should pass to the thread woken up from @queue.wait, without any opportunity for any other thread to enter a synchronized critical condition — and the thread waking up from @queue.wait should have to finish going through it’s critical section (in the process checking out the newly avail connection) before any other thread gets access to a critical section — and without any other thread having access to a critical section, no other thread should be able to checkout that connection first.

Why isn’t it working that way?

  • MRI Monitor/ConditionVariable is busted, it has a race condition? (Note well: I think I have been able to reproduce on jruby 1.6.7, but it takes a lot more threads to do so. 100 will do it. But tests finish in 1.1 seconds, with an exception raised about a ‘could not obtain database connection in 5 seconds’, so it looks like the same thing is happening. Keep in mind it’s hard to reproduce this apparently race condition on demand in any ruby.)
  • I’m not understanding how Monitor/ConditionVariable is supposed to work?  But if I’m not understanding how it’s supposed to work, apparently ActiveRecord::ConnectionPool isn’t either, since it’s not behaving as it’s supposed to under concurrency.
    • ConnectionPool has some bugs/race condition in it with regard to this kind of concurrency?
    • ConnectionPool’s design is fundamentally flawed, you can’t use Monitor/ConditionVariable to do what it’s trying to do?
    • It’s basic design is okay, but it needs more `synchronized` protections around more things, somehow it’s letting something in that’s messing things up? (I can’t figure out what).
  • Something else entirely I can’t think of?

I have no idea. Anyone else?

Main conclusion, oh my is multi-threaded concurrency hard, you probably don’t want to do it. (Although I still can’t find an alternate solution that’s satisfactory for my particular use patterns, but that’s another topic).

Postscript: Things that don’t seem to fix the problem

Wrapping nearly every method in ConnectionPool in a `synchronize do` doesn’t seem to help, the race condition can still be triggered.

Changing the `@queue.wait(@timeout)` to just `@queue.wait` doesn’t help! If the wait really was timing out, removing the timeout would cause it to just wait a really long time, or even forever. But removing the timeout, the test still completes just as quickly as usual, and still fails with a “could not get connection in 5 seconds” exception, when race condition exhibits. The @queue.wait is not timing out, it’s actually getting a signal, but finding no connection available after the signal. huh?

3 thoughts on “ruby Monitor bug? AR concurrency bug? Concurrency is hard.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s