Rails threading nightmare

So, concurrent programming can certainly be tricky no matter what. But if you keep it simple, and keep inter-task communication and shared data to a minimum, it should be do-able.

I thought I had figured out a way to put some concurrency into a Rails app, using ruby’s built in green threads, but lately I’ve had some observations leading me to believe I don’t at all, and it’s not working. Follows are my lengthy findings and ideas.

Background

It used to be the conventional wisdom that “you can’t do threads in Rails”, but with Rails 2.2 that’s changed somewhat–with some people claiming that conventional wisdom was always wrong, you just had to be careful before,  but now it’s even more robustly supported in Rails. Theoretically. Thing is, the focus and excitement around ‘concurrency’ in Rails 2.2 is about concurrent request handling.

I have no need of concurrent request handling, I’m happy to have requests handled one at a time in a queue.  But I still need concurrency — both within a request process (do some things in parallel before returning a result), as well as to execute some things in the background (after HTTP response is returned the browser), hopefully concurrent with the request queue continuing to be serviced.

People don’t seem to talk about this much. Here’s an attempt to describe my domain.

Problem Statement

My app (Umlaut) talks to lots and lots of external web services. (With “web service” understood broadly — it could be a nice REST service, it could essentially be a ‘screen-scrape’ of a web site intended for human interaction. But it’s interaction with external HTTP servers).

Tasks

Consider a ‘task’ in my application to be something that:  uses ActiveRecord to get some data describing what it must do;  talks to an external web service (perhaps with more than one HTTP request), using either Net::HTTP or open-uri; processes the results (probably using Hpricot); and then writes some results to the database using ActiveRecord again.

Waves

‘Tasks’ are organized in ‘waves’.  Within a wave, tasks need to be run concurrently. It doesn’t make sense to send a request to (eg) the Amazon API, wait for it to return a result, process that result, and only then send a request to (eg) the Google API.  Especially when some of the APIs involved are really slow. It makes sense to send out these requests concurrently, in parallel.

However, for business logic purposes, we don’t want to do all of our tasks concurrently. Thus, the ‘waves’.  All the tasks in a given wave are done concurrently, and then when all of these tasks have completed, on to the next wave.

Foreground vs. Background

Some of these waves are ‘foreground’ waves. That means that before the Rails action returns a response to the browser, all of these foreground waves must complete. We need the things they calculate and write to the db with ActiveRecord in order to return a response.

Others of these waves are ‘background’ waves, and the response should be returned even before they’ve completed. The resposne should be returned, and then the background waves will go about their merry business doing stuff and writing to the db with ActiveRecord, and the browse will occasionally check back with AJAX or other technique, and check the db with AR to get the ‘new’ stuff created by the background waves.

Little to no inter-task communication

There needs to be little to no inter-task communication or shared data. When a task is started, it can get an ActiveRecord id, and look up it’s own copy of the particular model object itself. Then it writes what it’s got (including it’s ‘completed’ status as a task) back to the db with AR.  There doesn’t need to be any shared data, and there doesn’t need to be any inter-process (‘process’ used broadly to mean concurrent task whatever the implementation).

So that sounds fairly simple as concurrent programming goes, should be do-able, right? Well, not so much.

Initial Approach, Ruby threads

So my initial approach was simply to use ruby Threads.  When executing a ‘wave’, start a Thread for each ‘task’, then use #join to wait on them, so you know the ‘wave’ is over and the next ‘wave’ can be started.

Wait for all foreground waves to complete before returning the initial response from the Rails action.

Then, right before actually returning the response (the end of the Rails controller action method), start up a Thread to manage the background waves. So the idea goes, this thread shouldn’t interfere with the response being returned, the response can be returned and meanwhile this background wave controller thread can do it’s thing–firing off each of the background ‘waves’. Meanwhile, while the bg thread is doing it’s thing, other Rails requests should be able to be processed (even without concurrent request handling).

I originally wrote about this approach here. Theoretically, it should even gotten slightly easier in Rails 2.  That monkey patching is no longer neccesary in Rails 2.x, as the bug was fixed.  In Rails 2.2.x, allow_concurrency=true should no longer be neccesary, ActiveRecord 2.2 is (in theory, I think?) always prepared to handle concurrency (although ActiveController won’t do concurrent request handling unless you config.threadsafe! — but recall I don’t need concurrent request handling).

And the fact that Rails core team considers some concurrency a valid use in Rails 2.2 (albeit concentrated on concurrent request handling, but our case should be even simpler than that… right?) should make us even more comfortable with this — indeed, the fact that you need to make sure that dynamic class reloading is turned off if you want to do concurrency, that I discovered on my own, is now mentioned explicitly for people who want concurrency in Rails.

But.

Trouble in River City

But.  I’ve noticed some things that don’t make any sense, and give me pause with my whole setup.

Those background tasks in their own thread(s) shouldn’t, I didn’t think, keep the initial Rails response from being returned (let alone keep other requests from being processed). And yet, they do. Kind of.  The response doesn’t wait for all of the background tasks to complete (usually), but it is significantly slowed by the presence of those background tasks.   I don’t really know why, nothing in the background task, so far as I know, should keep the actual Rails request-response event from completing.

When I look at the logs in console in debug mode, sometimes Rails _says_ it’s completed rendering right when it should (by the timestamp on the log line) — but then those lines don’t actually show up in the console into many seconds later. And the browser doesn’t actually get the response sometimes until many seconds after that!  And this is just in development, with only me using it, not under load!

Certainly many threads executing will slow down my system, as each thread trades off little time slice of it’s own  (there aren’t magic CPU’s coming into existence to handle each one), but this isn’t just a slowdown, it’s a block of some kind.  But why?

Net::HTTP?

It shouldn’t be Net::HTTP.  As far as I can tell, Net::HTTP should be quite happy to run in a concurrent environment, and Net::HTTP processes going on in one thread shouldn’t stop another thread from doing it’s thing (whether or not Net::HTTP is blocked waiting for it’s HTTP response).

Hpricot?

I can’t think of any reason it would be Hpricot.

ActiveRecord?

ActiveRecord itself?  Certainly possible. But ActiveRecord isn’t supposed to do this–if you’re running with allow_concurrency=true (or in ActiveRecord 2.2), each thread should get it’s own ActiveRecord connection, and one connection shouldn’t need to wait on another, each one shoudl be able to do it’s thing simultaneously.

Except that I discovered researching this stuff that the actual Rails mysql adapter does close out your application — when the mysql adapter is waiting for a response from the db, it monopolizes the thread scheduler until the response comes back, not allowing it’s thread to be switched out.  Okay, fine, that’s unfortunate, but I don’t think it’s my problem. The amount of time the response is delayed is way more than the time of any individual db response. We’re talking 10 seconds or more.

ruby thread scheduler?

Is it the ruby thread scheduler itself? Maybe. Trying desperately to find any info on concurrent thread programming in ruby (there ain’t much out there), I found some references to the fact that the ruby thread scheduler is pretty stupid, and not so good at time slicing threads.

Is the thread scheduler letting these background threads monopolize all the CPU time, and not letting the actual Rails thread get a slice to finish returning the response?  Maybe, I’m not sure how to tell. Putting a bunch of Thread.pass’s into the background threads, as well as playing with setting Thread.current.priority on background threads — does seem to have an effect, lessening the problem, but not entirely getting rid of it.

If I put some random sleep(0.5) calls in the background thread code, the problem seems to mostly go away. This is not a good sign, random sleeps fixing up your problems is a classic case of out of control concurrency, where threads are competing for resources in a non-predictable way. But these threads shouldn’t really be competing for any resources.  Except a slice of CPU time from the ruby scheduler.  Or maybe that silly monopolistic mysql adapter’s time. Or maybe…

mongrel?

I am fronting the app with mongrel. Mongrel documentation is pretty silent on concurrency issues — looking through the listserv, the original mongrel developer’s response seemed to be “don’t do that, just don’t!”.  Also, making the assumption that the only reason you’d want to deal with any concurrency at all is for concurrent request handling (which is not what I want) — which potentially means there are some assumptions in mongrel that if you have active_record.allow_concurrency=true that is what you’re doing (no I’m not!), possibly doing weird things.

Could mongrel somehow be noticing these threads, and refusing to return the original response until they’re done? That actually seems pretty unlikely.

Next Steps

My money is on the ruby thread scheduler being really duncey and not properly letting threads share CPU slices, even when they aren’t competing for resources.

But I’m not actually sure how to verify this in practice. Debugging concurrency is a pain.  I guess my next step is to really pare down my app’s logic trying to make a really simple demonstration case, and seeing if even if I take all of the ActiveRecord and even HTTP::Net and Hpricot code out, if it still demonstrates.  And then if I find out yes, that simply provides more evidence that it’s a duncey ruby thread scheduler.

Or try to figure out how to actually debug this in such a way that I can observe which threads are getting stopped and started when. But that would still require paring down a demonstration case to figure out why.

Assuming it is the duncey ruby thread scheduler, that means finding a non-thready way of accomplishing this problem case.

There are a BUNCH of methods listed on this Rails wiki page. Demonstrating the fact that no solution has garnered the consensus of actually working and being non-painful to implement. You can’t really trust everything on that wiki page, some of it is just plain wrong (or out of date), it’s just kind of the collective notes of a bunch of people trying to figure this out.

None of those solutions are revolutionary. There are fewer solutions than it looks like, because some of the listed solutions actually use other of the listed solutions.  Spending way too much time going through them all and reading all available docs and sometimes source, it really breaks down to three main methods.

1) Use threads (see above)

2) Fork new processes at an OS level

3) Offload tasks to some external app, through some kind of inter-process communication.

Spawn

Tom Anderson’s Spawn plugin continues to look like the best (simplest, most flexible, most robust) package to me for forking new OS-level processes.

Advantage of forking with spawn: Pretty much the same semantics I’m using now for threading.

Disadvantages of forking:  Makes debugging harder, can’t easily (or at all?) use ruby-debug anymore, which I’ve grown to love.  (But this is going to be a problem with any non-thread solution to concurrency).  Potential memory hit to having all those forks in memory — people say it’s not as bad as you’d think, but my problem case has a lot of concurrent tasks (forked processes) at once.

BackgroundRB

BackgroundRB looks like the best bet to me out of the “offload concurrent processes to some other external non-Rails Web app process”, as far as combining ease of use with robustness.  And that’s saying something, because I not think BackgroundRB looks particularly easy to set up/use/debug.  But at least it seems to be fairly mature, and it’s web page no longer warns you that it’s beta software you shouldn’t use.

Some people seem to like Workling/Starling as an ‘easier’ version of the ‘offload to external process’ technique than BackgroundRB.  But looking at the docs, I can’t figure out what about is supposed to be any easier to set up or maintain or develop for. It looks just as complicated, if not more complicated, to me. But maybe I’m missing something? Or maybe it’s people who are already familiar with Starling who like it.  At any rate, it seems to be less mature, widely tested, and used than BackgroundRB at the moment.

BackgroundRB also is going to make debugging harder, and make me give up ruby-debug, but that seems to be inevitable at this point.

Sigh

So that was a really long post.  But I thought it would be welcome, becuase there is so little information on concurrency in Rails on the net (and most of the tiny amount tha tis there is about concurrent request handling — I can’t be the only one that needs concurrency for dealing with a large number of requests to external resources, can I? It’s WEB 2.0, man!).

So this is what I’ve painstakingly figured out (mostly figured out what I don’t know, rather than what I do). Any additional clarifications or information are much appreciated.

26 thoughts on “Rails threading nightmare

  1. Personally, I avoided BackgroundRB. It seems like a nifty utility, but I never really was able to get the hang of it. I found Spawn to be much more intuitive and works great in LF. You do take a memory hit and with Mongrel — occasionally, mongrel wigs out — but moving to mod_rails seems to have solved that issue.

    However, without having looked at Umlaut, I’d bet that you are seeing two things:
    1) the thread scheduler. Unless it has changed, the thread.join function allows threads to complete — but stops processing till all threads are joined.
    2) ActiveRecord — even with concurrency set to allow threading — but my experience is that in practice, ActiveRecord is a blocking object.

    –TR

  2. Terry, I’m not sure what you mean by “thread.join allows threads to complete, but stops processing till all threads are joined” — stops processing where? It obviously doesn’t stop processing of those threads, or they’d never join. I mean, the whole point of join, seems to me, is that the thread that calls it should stop, give up the CPU, allowing any other threads to do their thing, until the join returns.

    But I could try an alternative to ‘join’ — like writing my own loop that sleeps for a couple hundred ms and then checks the thread status, before going to sleep again. That _ought_ not to make any difference, but who knows.

    But yeah, I suspect there’s something funky going on with AR.

    So you have spawn working even with mod_rails? That’s one thing I was wondering about.

    I was also wondering what your experience is with spawn, and how _many_ spawns you do in a given process. One per each resource searched? Which could be how many? In Umlaut, a given openurl request could result in a dozen or more different ‘tasks’ (ie threads, or spawned proccesses)–I’m worried about the memory/performance (not to mention debugging) implications of sending off a dozen or more spawns per request.

    Very curious as to your experience with that stuff. I actually sent you an email hoping to talk to you about it?

  3. PS: And my mongrels are already occasionally wigging out — specifically they sometimes die for no reason, leaving no log. Curious what kind of ‘wig outs’ you were seeing. But I’m not too happy with mongrel in general, and like the idea of switching to mod_rails.

  4. My experience is that the green threads are simply children off the main. So, when you join, the main thread will stop until all children complete. At least, that’s been my experience.

    –TR

  5. I’m not sure why mongrel occasionally goes nuts — in my case, it manifests itself by essentially consuming all system memory. I’ve had alerts where our serve essentially is panicing because one mongrel (which should only consume ~40 mbs) is consuming 2 GB. Since switching to mod_rails, I’ve seen none of that.

    –TR

  6. Well, yeah, the thread that called the “join” is _supposed_ to stop until the threads it’s waiting on with ‘join’ come back. I counted on that, that’s not surprising, or a problem.

    But here’s an interesting statement:

    “In particular, initiating network connections and/or a broken or slow DNS server will typically block the whole Ruby process while the call completes.”

    http://ph7spot.com/articles/system_timer

    I’m not sure what it means by ‘initiating’ network connections specifically, but this sounds like it _could_ be disastrous.

    I am rather unpleased with ruby right now.

    So, Terry, you’re having no problems with the spawn and mod_rails combination? Any thoughts on my worries about a dozen spawns per request being a problem? Is that similar to the situation you have with LF?

  7. >>So, Terry, you’re having no problems
    >>with the spawn and mod_rails
    >>combination? Any thoughts on
    >>my worries about a dozen spawns
    >>per request being a problem?

    No — at present, it seems as though mod_rails does a good job managing memory and for libraryfind — I have few long running threads — so things finish quickly (in addition to doing some throddling through the use of a jobs queue).

    —TR

  8. Interesting, I had thought you’d have one async task (whether thread or forked process) for each external resource being searched by LF, and that each LF search request would search many external resources, leading to many async tasks.

    Which part of my assumption were wrong? That you have one async task per external resource searched, or that an LF search searches many external resources? Or ‘other’?

    Thanks for the feedback, it’s very helpful to compare with another project facing similar issues.

  9. Each search initiates a new fork, however, async processes are throddled through a background job queue that helps to manage how many threads are being forked at any given time. It worked well with Mongrel (have worked this way for about a year) — but is working much better with mod_rails (and I’m curious to see if I see even better performance with enterprise ruby to take advantage of some of the enhancements created just for mod_rails).

    –TR

  10. How any async tasks does your throttling allow at once?

    I’d think that throttling like that would end up leading to a very slow experience for the user.

    But I also just realized that I can probably get away with _one_ process fork per request. First fork for the ‘background manager’, then the background manager (in it’s own process) can still use Threads to manage it’s ‘waves’, confident that that bg stuff can’t interfere with any other foreground processes or requests.

    If there’s a high rate of requests, could still lead to a lot of forked processes existing at once (I would NOT want to pool or throttle them), but not as ridiculously many as if I actually gave each task its’ own forked process. Hmm.

  11. ~10. In practice, most forked processes in LF end in 0.2 seconds (save for true federated processes) — so the throddle rarely is invoked as processes end before others ever initiate.

    –TR

  12. Wait, how can you get a response from an external search query in 0.2 seconds?

    Terry, you sure you don’t have any time to have a 20 minute phone conversation with me on this?

    You going to be at code4lib? I might try to kidnap you there.

  13. Well, LF harvests more materials than it queries remotely and since each collection is searched individually (for caching purposes) — this is where you get the 0.2 response time (items being retrieved from solr/lucene). However, of the items that are federated (8-10 at a time), many of these resources have relately fast response times as well. The ILS, anything over http (by and large) — these all have response times (in general) of under a second. So, long running processes tend to be reserved primarily for many of the journal aggregators — some of which I’m correcting by moving away from Z39.50 to XML gateway API when possible.

    Anyway, I’ve had some time free up on Friday. I’ll contact you offline.

    And yes, I will be at C4L.

  14. Spawn looks like the easiest option for me too. I was getting trouble with mysql and the background plugin.

    Very nice post, congrats.

  15. Ah yes, I mean to write a follow up to this, but lost the energy.

    So I’m still using plain old ordinary ruby green threads.

    The problem seemed to be that ruby threads would block each other on occasions where I did not expect them to.

    For instance, I have my Rails action fire off a Thread, and just leave it going, no wait() on it. The intent was that the Rails action should complete and return a response to the browser, while the Thread continues to operate. However, that thread in the background seemed to somehow be monopolizing the CPU, it wasn’t actually giving the ‘main’ Rails thread a chance to finish up and return the response. The main Rails thread didn’t neccesarily wait until the ‘background’ thread was done to return, but was significantly slowed.

    I’m really not sure why; first guess was it was doing something somewhere that didn’t allow the ruby thread scheduler to schedule it out, that blocked on the OS-level, as ruby threads can do. But that doesn’t really explain my solution.

    Which was simply to explicitly set the thread priority of these ‘background’ threads to something less than the 0 default. thread.priority = -1. Meanwhile the main Rails thread was still presumably at it’s 0 default. This seemed to result in the main Rails thread getting enough CPU time to finish up and return the response to the response in a timely manner, about the same time it would take if that longer-running thread wasn’t being started.

    Can’t completely explain it, but it worked.

    I also sprinkled some Thread.pass statements into my ‘background’ threads, just in case — I’m not exactly sure what this does, it might end up making my ‘background’ threads take longer to finish than they should — but I really wanted to prioritize making sure the foreground actual Rails threads got the CPU they needed, with these ‘background’ threads just taking what’s left.

    It seems to be working out.

  16. You should give JRuby a try. Native threads, faster execution speed, etc.

    Also, I’m working on a more robust Producer/Consumer framework called CommonThread(http://github.com/awksedgreep/commonthread). Just toss your jobs in a queue and work them in a separate process. It’s activerecord aware so you can just symlink the models from your rails app.

  17. I am new to rails and landed up on this article. I am designing a system which exactly has the same problem – i.e. calling a bunch of services in parallel. It was good to know your experience, It does look like ruby and rails are not good at this…

    I do have some experience in webapps with long running tasks. I think your background thread solution is a bad idea. Whenever you need tasks that need to complete in the background (usually because they are slow), You should separate the task out from the process which is serving requests. (which is designed to be stateless and fast)

    This adds reliability to both your foreground request serving and backend tasks at the cost of complexity. Depending on the reliability you need, You can choose from a backgroundrb to a kestrel like message queue. The solution you have chosen is the simplest (and so most unreliable – What if your process is killed, What if other requests to the same process slow down because of the background wave etc)

    In general all web request/replies are meant to be fast. slow replies (ie rails actions) are a bad idea because users do not know what is taking so long … they might assume its a network connectivity glitch and hit stop/reload etc. Thus when you are going to do a slow job. It is better you have a spinner / progress bar style interface … Basically making the job async. The only disadvantage of doing this is again 1) complexity and 2) You need to use the db to store state.

    You are doing 2 anyway :)

    Thanks for the info again. I (never thought I would say this ) miss Java where I could just give this request to a thread pool .. Rails/Ruby concurrency needs some work

  18. Yeah, the thing with offloading it to a seperate process, is I need very low latency in queing the jobs and getting the jobs back. The whole point of the concurrency here is to get my response(s) back to the browser as quickly as possible. The concurrent processes are only going to take a few seconds each at most. I can’t have the addition of a separate process add more than a couple hundred ms latency to the total interaction, at most.

    Additionally, for those processes that I need to run concurrently, but want to wait for completion before even returning the initial browser request — I couldn’t figure out any system that would provide this, very low latency, and the ability to wait on completion. Since then, I see that _maybe_ a Redis-based queue will do it, but it’s not entirely clear.

    So basically, the added complexity swamped me, I couldn’t figure out a way to do it that would actually do what I needed, and not be a nightmare. It was beyond my capacities, I guess.

    But if you’ve got one, please do share your code!

    My current solution seems to be working reasonably well, after the changes mentioned in comment 16.

  19. I hit this same thing and isolated it to ActiveRecord. My multithreaded code works fine so long as I don’t touch AR in the background threads. As soon as I do I start getting random multi-second delays. Incidentally, it’s not even just blocking on join; even if I comment out the join altogether (throwing away the result) I still get the same delay. I’d put my money on AR having problems unlocking things after concurrent access.

  20. Low enough latency to put three of four things in the queue, and get a response back, _before_ returning the http response to the user? Do you do that, James?

    But certainly that sort of external system is one solution. I don’t think ‘case is closed’ though, there are trade-offs to it (added complexity of install and deploy for this app that other people download and use themselves, it is a shared open source app), and Rails is _supposed_ to be fine with threads. I was thinking that when I get around to migrating to Rails3, it would be — anyone have experience with this kind of threading in Rails3?

    However, I’d say again, that my Rails2 application using this approach to threading actually IS working in production at the moment, after I took care to set the Thread.priority less than 0 on all threads that weren’t the request-response thread. It’s good enough to work. And this is using the old MySQL adapter, I’d expect it would perform even better with the new thread-friendly MySQL2 adapter.

  21. We are using activemq in a production environment and it’s certainly fast enough for being used within an http request. They were designed to support high volume several thousan transactions a second stock trading on exchange floors, so they should be good enough to be used here.

Leave a comment