So, concurrent programming can certainly be tricky no matter what. But if you keep it simple, and keep inter-task communication and shared data to a minimum, it should be do-able.
I thought I had figured out a way to put some concurrency into a Rails app, using ruby’s built in green threads, but lately I’ve had some observations leading me to believe I don’t at all, and it’s not working. Follows are my lengthy findings and ideas.
It used to be the conventional wisdom that “you can’t do threads in Rails”, but with Rails 2.2 that’s changed somewhat–with some people claiming that conventional wisdom was always wrong, you just had to be careful before, but now it’s even more robustly supported in Rails. Theoretically. Thing is, the focus and excitement around ‘concurrency’ in Rails 2.2 is about concurrent request handling.
I have no need of concurrent request handling, I’m happy to have requests handled one at a time in a queue. But I still need concurrency — both within a request process (do some things in parallel before returning a result), as well as to execute some things in the background (after HTTP response is returned the browser), hopefully concurrent with the request queue continuing to be serviced.
People don’t seem to talk about this much. Here’s an attempt to describe my domain.
My app (Umlaut) talks to lots and lots of external web services. (With “web service” understood broadly — it could be a nice REST service, it could essentially be a ‘screen-scrape’ of a web site intended for human interaction. But it’s interaction with external HTTP servers).
Consider a ‘task’ in my application to be something that: uses ActiveRecord to get some data describing what it must do; talks to an external web service (perhaps with more than one HTTP request), using either Net::HTTP or open-uri; processes the results (probably using Hpricot); and then writes some results to the database using ActiveRecord again.
‘Tasks’ are organized in ‘waves’. Within a wave, tasks need to be run concurrently. It doesn’t make sense to send a request to (eg) the Amazon API, wait for it to return a result, process that result, and only then send a request to (eg) the Google API. Especially when some of the APIs involved are really slow. It makes sense to send out these requests concurrently, in parallel.
However, for business logic purposes, we don’t want to do all of our tasks concurrently. Thus, the ‘waves’. All the tasks in a given wave are done concurrently, and then when all of these tasks have completed, on to the next wave.
Foreground vs. Background
Some of these waves are ‘foreground’ waves. That means that before the Rails action returns a response to the browser, all of these foreground waves must complete. We need the things they calculate and write to the db with ActiveRecord in order to return a response.
Others of these waves are ‘background’ waves, and the response should be returned even before they’ve completed. The resposne should be returned, and then the background waves will go about their merry business doing stuff and writing to the db with ActiveRecord, and the browse will occasionally check back with AJAX or other technique, and check the db with AR to get the ‘new’ stuff created by the background waves.
Little to no inter-task communication
There needs to be little to no inter-task communication or shared data. When a task is started, it can get an ActiveRecord id, and look up it’s own copy of the particular model object itself. Then it writes what it’s got (including it’s ‘completed’ status as a task) back to the db with AR. There doesn’t need to be any shared data, and there doesn’t need to be any inter-process (‘process’ used broadly to mean concurrent task whatever the implementation).
So that sounds fairly simple as concurrent programming goes, should be do-able, right? Well, not so much.
Initial Approach, Ruby threads
So my initial approach was simply to use ruby Threads. When executing a ‘wave’, start a Thread for each ‘task’, then use #join to wait on them, so you know the ‘wave’ is over and the next ‘wave’ can be started.
Wait for all foreground waves to complete before returning the initial response from the Rails action.
Then, right before actually returning the response (the end of the Rails controller action method), start up a Thread to manage the background waves. So the idea goes, this thread shouldn’t interfere with the response being returned, the response can be returned and meanwhile this background wave controller thread can do it’s thing–firing off each of the background ‘waves’. Meanwhile, while the bg thread is doing it’s thing, other Rails requests should be able to be processed (even without concurrent request handling).
I originally wrote about this approach here. Theoretically, it should even gotten slightly easier in Rails 2. That monkey patching is no longer neccesary in Rails 2.x, as the bug was fixed. In Rails 2.2.x, allow_concurrency=true should no longer be neccesary, ActiveRecord 2.2 is (in theory, I think?) always prepared to handle concurrency (although ActiveController won’t do concurrent request handling unless you config.threadsafe! — but recall I don’t need concurrent request handling).
And the fact that Rails core team considers some concurrency a valid use in Rails 2.2 (albeit concentrated on concurrent request handling, but our case should be even simpler than that… right?) should make us even more comfortable with this — indeed, the fact that you need to make sure that dynamic class reloading is turned off if you want to do concurrency, that I discovered on my own, is now mentioned explicitly for people who want concurrency in Rails.
Trouble in River City
But. I’ve noticed some things that don’t make any sense, and give me pause with my whole setup.
Those background tasks in their own thread(s) shouldn’t, I didn’t think, keep the initial Rails response from being returned (let alone keep other requests from being processed). And yet, they do. Kind of. The response doesn’t wait for all of the background tasks to complete (usually), but it is significantly slowed by the presence of those background tasks. I don’t really know why, nothing in the background task, so far as I know, should keep the actual Rails request-response event from completing.
When I look at the logs in console in debug mode, sometimes Rails _says_ it’s completed rendering right when it should (by the timestamp on the log line) — but then those lines don’t actually show up in the console into many seconds later. And the browser doesn’t actually get the response sometimes until many seconds after that! And this is just in development, with only me using it, not under load!
Certainly many threads executing will slow down my system, as each thread trades off little time slice of it’s own (there aren’t magic CPU’s coming into existence to handle each one), but this isn’t just a slowdown, it’s a block of some kind. But why?
It shouldn’t be Net::HTTP. As far as I can tell, Net::HTTP should be quite happy to run in a concurrent environment, and Net::HTTP processes going on in one thread shouldn’t stop another thread from doing it’s thing (whether or not Net::HTTP is blocked waiting for it’s HTTP response).
I can’t think of any reason it would be Hpricot.
ActiveRecord itself? Certainly possible. But ActiveRecord isn’t supposed to do this–if you’re running with allow_concurrency=true (or in ActiveRecord 2.2), each thread should get it’s own ActiveRecord connection, and one connection shouldn’t need to wait on another, each one shoudl be able to do it’s thing simultaneously.
Except that I discovered researching this stuff that the actual Rails mysql adapter does close out your application — when the mysql adapter is waiting for a response from the db, it monopolizes the thread scheduler until the response comes back, not allowing it’s thread to be switched out. Okay, fine, that’s unfortunate, but I don’t think it’s my problem. The amount of time the response is delayed is way more than the time of any individual db response. We’re talking 10 seconds or more.
ruby thread scheduler?
Is it the ruby thread scheduler itself? Maybe. Trying desperately to find any info on concurrent thread programming in ruby (there ain’t much out there), I found some references to the fact that the ruby thread scheduler is pretty stupid, and not so good at time slicing threads.
Is the thread scheduler letting these background threads monopolize all the CPU time, and not letting the actual Rails thread get a slice to finish returning the response? Maybe, I’m not sure how to tell. Putting a bunch of Thread.pass’s into the background threads, as well as playing with setting Thread.current.priority on background threads — does seem to have an effect, lessening the problem, but not entirely getting rid of it.
If I put some random sleep(0.5) calls in the background thread code, the problem seems to mostly go away. This is not a good sign, random sleeps fixing up your problems is a classic case of out of control concurrency, where threads are competing for resources in a non-predictable way. But these threads shouldn’t really be competing for any resources. Except a slice of CPU time from the ruby scheduler. Or maybe that silly monopolistic mysql adapter’s time. Or maybe…
I am fronting the app with mongrel. Mongrel documentation is pretty silent on concurrency issues — looking through the listserv, the original mongrel developer’s response seemed to be “don’t do that, just don’t!”. Also, making the assumption that the only reason you’d want to deal with any concurrency at all is for concurrent request handling (which is not what I want) — which potentially means there are some assumptions in mongrel that if you have active_record.allow_concurrency=true that is what you’re doing (no I’m not!), possibly doing weird things.
Could mongrel somehow be noticing these threads, and refusing to return the original response until they’re done? That actually seems pretty unlikely.
My money is on the ruby thread scheduler being really duncey and not properly letting threads share CPU slices, even when they aren’t competing for resources.
But I’m not actually sure how to verify this in practice. Debugging concurrency is a pain. I guess my next step is to really pare down my app’s logic trying to make a really simple demonstration case, and seeing if even if I take all of the ActiveRecord and even HTTP::Net and Hpricot code out, if it still demonstrates. And then if I find out yes, that simply provides more evidence that it’s a duncey ruby thread scheduler.
Or try to figure out how to actually debug this in such a way that I can observe which threads are getting stopped and started when. But that would still require paring down a demonstration case to figure out why.
Assuming it is the duncey ruby thread scheduler, that means finding a non-thready way of accomplishing this problem case.
There are a BUNCH of methods listed on this Rails wiki page. Demonstrating the fact that no solution has garnered the consensus of actually working and being non-painful to implement. You can’t really trust everything on that wiki page, some of it is just plain wrong (or out of date), it’s just kind of the collective notes of a bunch of people trying to figure this out.
None of those solutions are revolutionary. There are fewer solutions than it looks like, because some of the listed solutions actually use other of the listed solutions. Spending way too much time going through them all and reading all available docs and sometimes source, it really breaks down to three main methods.
1) Use threads (see above)
2) Fork new processes at an OS level
3) Offload tasks to some external app, through some kind of inter-process communication.
Tom Anderson’s Spawn plugin continues to look like the best (simplest, most flexible, most robust) package to me for forking new OS-level processes.
Advantage of forking with spawn: Pretty much the same semantics I’m using now for threading.
Disadvantages of forking: Makes debugging harder, can’t easily (or at all?) use ruby-debug anymore, which I’ve grown to love. (But this is going to be a problem with any non-thread solution to concurrency). Potential memory hit to having all those forks in memory — people say it’s not as bad as you’d think, but my problem case has a lot of concurrent tasks (forked processes) at once.
BackgroundRB looks like the best bet to me out of the “offload concurrent processes to some other external non-Rails Web app process”, as far as combining ease of use with robustness. And that’s saying something, because I not think BackgroundRB looks particularly easy to set up/use/debug. But at least it seems to be fairly mature, and it’s web page no longer warns you that it’s beta software you shouldn’t use.
Some people seem to like Workling/Starling as an ‘easier’ version of the ‘offload to external process’ technique than BackgroundRB. But looking at the docs, I can’t figure out what about is supposed to be any easier to set up or maintain or develop for. It looks just as complicated, if not more complicated, to me. But maybe I’m missing something? Or maybe it’s people who are already familiar with Starling who like it. At any rate, it seems to be less mature, widely tested, and used than BackgroundRB at the moment.
BackgroundRB also is going to make debugging harder, and make me give up ruby-debug, but that seems to be inevitable at this point.
So that was a really long post. But I thought it would be welcome, becuase there is so little information on concurrency in Rails on the net (and most of the tiny amount tha tis there is about concurrent request handling — I can’t be the only one that needs concurrency for dealing with a large number of requests to external resources, can I? It’s WEB 2.0, man!).
So this is what I’ve painstakingly figured out (mostly figured out what I don’t know, rather than what I do). Any additional clarifications or information are much appreciated.