Struggling Towards Reliable Capybara Javascript Testing

You may have reached this blog post because you’re having terribly frustrating problems with Capybara Javascript-driver feature tests that are unreliable in intermittent and hard to reproduce ways.

You may think this is a rare problem, since Capybara/JS is such a popular tool suite, and you didn’t find too much on it on Google, and what you did find mostly suggested you’d only have problems like this if you were “doing it wrong”, so maybe you just weren’t an experienced enough coder to figure it out.

After researching these problems, my belief is that intermittent test failures are actually fairly endemic to those using Capybara JS drivers, at least/especially with a complex JS front-end environment (Angular, React/Flux, Ember, etc).  It’s not just you, I believe, these problems plague many experienced developers.

This blog post summarizes what I learned trying to make my own JS feature tests reliable — I think there is no magic bullet, but to begin with you can understand the basic architecture and nature of race condition problems in this context; there are a bucket of configuration issues you can double-check to reduce your chances of problems somewhat; turning off rspec random ordering may be surprisingly effective at decreasing intermittent failures; but ultimately, building reliable JS feature tests with Capybara is a huge challenge.

My situation: I had not previously had any experience with new generation front-end JS front-end frameworks. Relatedly, I had previously avoided Capybara JS features, being scared of them (retroactively my intution was somewhat justified), and mostly not testing my (limited) JS.  But at the new gig, I was confronted with supporting a project which: Had relatively intensive front-end JS, for ‘legacy’ reasons using a combination of React and Angular; was somewhat under-tested, with JS feature tests somewhat over-represented in the test suite (things that maybe could have been tested with functional/controller or unit tests, were instead being tested with UI feature tests); and had such intermittent unreliability in the test suite that it made the test suite difficult to use for it’s intended purpose.

I have not ultimately solved the intermittent failures, but I have significantly decreased their frequency, making the test suite more usable.

I also learned a whole lot in the process. If you are a “tldr” type, this post might not be for you, it has become large. My goal is to provide the post I wish I had found before embarking on many, many hours of research and debugging; it may take you a while to read and assimilate, but if you’re as frustrated as I was, hopefully it will save you many more hours of independent research and experimentation to put it all together.

The relevant testing stack in the app I was investigating is: Rails 4.1.x, Rspec 2.x (with rspec-rails), Capybara, DatabaseCleaner, Poltergeist. So that’s what I focused on. Changing any of these components (say, MiniTest for Rspec) could make things come out different, although the general picture is probably the same with any Capybara JS driver.

No blame

To get it out of the way, I’m not blaming Capybara as ‘bad software’ either.

The inherent concurrency involved in the way JS feature tests are done makes things very challenging.

Making things more challenging is that the ‘platform’ that gives us JS feature testing is composed of a variety of components with separate maintainers, all of which are intended to work “mix and match” with different choices of platform components: Rails itself in multiple versions; Capybara, with rspec, or even minitest; with DatabaseCleaner probably not but not presumedly; with your choice of JS browser simulator driver.

All of these components need to work together to try and avoid race conditions, all of these components keep changing and releasing new versions relatively independently and un-syncronized; and all of these components are maintained by people who are deeply committed to making sure their part does it’s job or contract adequately, but there’s not necessarily anyone with the big-picture understanding, authority, and self-assigned responsibility to make the whole integration work.

Such is often the ruby/rails open source environment. It can make it confusing to figure out what’s really going on.

Of Concurrency and Race Conditions in Capybara JS Feature tests

“Concurrency” means a situation where two or more threads or processes are operating “at once”. A “race condition” is when a different outcome can happen each time the same code involving concurrency is run, depending on exactly the order or timing of each concurrent actor (depending on how the OS ends up scheduling the threads/processes, which will not be exactly the same each time).

You necessarily have concurrency in a Capybara Javascript feature test — there are in fact three different concurrent actors (two threads and a separate process) going on, even before you’ve tried to do something fancy with parallel rspec execution (and I would not recommend trying, especially if you are already experiencing intermittent failure problems; there’s enough concurrency in the test stack already), and even if you weren’t using any concurrency in your app itself.

  1. The main thread in the main process that is executing your tests in order.
  2. Another thread in that process is started to run your Rails app in it, to run the simulated browser actions against — this behavior depends on your Capybara driver, but I think all the drivers that support Javascript do this, all drivers but the default :rack_test.
  3. The actual or simulated browser (for Poltergeist, a headless Webkit process; for selenium-webkit, an actual Firefox) that Capybara (via a driver) is controlling, loading pages from the Rails app (2 above) and interacting with them.

There are two main categories of race condition that arise in the Capybara JS feature test stack. Your unreliable tests are probably because of one or both of these. To understand why your tests are failing unreliably and what you can do about it, you need to understand the concurrent architecture of a Capybara JS feature test as above, and these areas of potential race conditions.

The first category — race conditions internal to a single feature test caused by insufficient waiting for AJAX response — is relatively more discussed in “the literature” (what you find googling, blogs, docs, issues, SO’s, etc)  The second category — race conditions in the entire test suite caused by app activity that persists after the test example completes — is actually much more difficult to diagnose and deal with, and is under-treated in ‘the literature’.

1. Race condition WITHIN a single test example: Waiting on Javascript

An acceptance/feature/integration test (I will use those terms interchangeably, we’re talking about tests of the UI) for a web app consists of: Simulate a click (or other interaction) on a page, see if it what results is what you expect. Likely a series of those.

Without Javascript, a click (almost always) results in an HTTP request to the server. The test framework waits for the ensuring response, and sees if it contains what was expected.

With Javascript, the result of an interaction likewise isn’t absolutely instantaneous, but there’s no clear signal (like HTTP request made and then response returned) to see if you’ve waited ‘long enough’ for the expected consequence to happen.

So Capybara ends up waiting just some amount of time while periodically checking the expectation to see if it’s met, up to a maximum amount of time.

The amount of time the Javascript takes to produce the expected change on the page can slightly (or significantly) differ each time you run the tests, with no code changes. If sometimes Capybara is waiting long enough, but other times it isn’t — you get a race condition where sometimes the test passes but others it doesn’t.

This will exhibit straightforwardly as a specific test that sometimes passes and other times doesn’t.  To fix it, you need to make sure Capybara is waiting for results, and willing to wait long enough.

Use the right Capybara API to ensure waits

In older Capybara, developers would often explicitly tell Capybara exactly when to wait and what to wait for with the `wait_until` method.

In Capybara 2.0, author @jnicklas removed the `wait_until` method, explaining that Capybara has sophisticated waiting built-in to many of it’s methods, and wait_until was not neccesary — if you use the Capybara API properly: “For the most part, this behaviour is completely transparent, and you don’t even really have to think about it, because Capybara just does it for you.”

In practice, I think this can end up less transparent than @jnicklas would like, and it can be easier than he hopes to do it wrong.  In addition to the post linked above, additional discussions of using Capybara ‘correctly’ to ensure it’s auto-waiting is in play are here, here and here, plus the Capybara docs.

If you are using the Capybara API incorrectly so it’s not waiting at all, that could result in tests that always fail, but can also result in unreliable intermittently failing race conditions. After all, even if the Javascript involves no AJAX, it does not happen instantaneously. And similarly, in the main thread, moving from one rspec instruction (do a click) to another (see if page body has content) does not happen instantaneously.  The two small amounts of time will vary from run to run, and sometimes the JS may finish before the next ruby statement in your example happens, other times not. Welcome to concurrency.

I ran into just a few feature examples in the app I was working on that had obvious problems in this area.

Making Capybara wait long enough

When Capybara is waiting, how long is it willing to wait before giving up? `Capybara.default_wait_time`, by default 2 seconds. If there are actions that sometimes or always take longer than this, you can increase the `Capybara.default_wait_time` — but do it in a suite-wide `before(:each)` hook, because I think Capybara may reset this value on every run, in at least some versions.

You can also run specific examples or sections of code with a longer wait value by wrapping in a `using_wait_time N do` block. 

At first I spent quite a bit of time playing with this, because it’s fairly understandable and seemed like it could be causing problems. But I don’t think I ended up finding any examples in my app-at-hand that actually needed a longer wait time, that was not the problem.

I do not recommend trying to patch in `wait_until` again, or to patch in various “wait_for _jquery_ajax”, “wait_for_angular”, etc., methods you can find googling.  You introduce another component that could have bugs (or could become buggy with a future version of JQuery/Ajax/Capybara/poltergeist/whatever), you’re fighting against the intention of Capybara, you’re making things even more complicated and harder to debug, and even if it works you’re tying yourself even further to your existing implementation, as there is no reliable way to wait on an AJAX request with the underlying actual browser API. My app-in-hand had some attempts in these directions, but even figuring out if they were working (especially for Angular) was non-trivial. Better just fix your test to wait properly on the expected UI, if you at all can.

In fact, while this stuff is confusing at first, it’s a lot less confusing — and has a lot more written about it on the web — than the other category of Capybara race condition…

2. Race condition BETWEEN test examples: Feature test leaving unfinished business, Controller actions still not done processing when test ends.

So your JS feature test is simulating interactions with a web page, making all sorts of Javascript happen.

Some of this Javascript is AJAX that triggers requests against Rails controllers — running in the Rails app launched in another thread by the Capybara driver.

At some point the test example gets to the end, and has tested everything it’s going to test.

What if, at this point, there is still code running in the Rails app? Maybe an AJAX request was made and the Capybara test didn’t bother waiting for the response.

Anatomy of a Race Condition

RSpec will go on to the next test, but the Rails code is still running. The main thread running the tests will now run DatabaseCleaner.clean, and clear out the database — and the Rails code that was still in progress (in another thread) finds the database cleaned out from under it. Depending on the Rails config, maybe the Rails app now even tries to reload all the classes for dev-mode class reloading, and the code that was still in progress finds class constants undefined and redefined from under it. These things are all likely to cause exceptions to be raised by the in-progress-in-background-thread code.

Or maybe the code unintentionally in progress in the background isn’t interrupted, but it continues to make changes to the database that mess with the new test example that rspec has moved on to, causing that example to fail.

It’s a mess. Rspec assumes each test is run in isolation, when there’s something else running and potentially making changes to the test database concurrently, all bets are off. The presence and the nature of the problem caused depends on exactly how long the unintentional ‘background’ processing takes to complete, and how it lines up on the timeline against the new test, which will vary from run to run, which is what makes this a race condition.

This does happen. I’m pretty sure it’s what was happening to the app I was working on — and still is, I wasn’t able to fully resolve it, although I ameliorated the symptoms with the config I’ll describe below.

The presence and nature of the problem also can depend on which test is ‘next’, which will be different from run to run under random rspec — but I found even re-running the suite with the same seed, the presence and nature of exhibit would vary.

What does it look like?

Tests that fail only when run as part of the entire test suite, but not when run individually. Which sure makes them hard to debug.

One thing you’ll see when this is happening is different tests failing each time. The test that shows up as failing or erroring isn’t actually the one that has the problematic implementation — it’s the previously run JS feature test (or maybe even a JS feature test before that?) that sinned by ending while stuff was still going on in the Rails app.  Which test was the previously run test will vary every run with a different seed, using rspec random testing.  RSpec default output doesn’t tell which was the previous test on a given run; and the RSpec ‘text’ formatter doesn’t really give us the info in the format we want it either (have to translate from human-readable label to test file and line number yourself, which is kind of infeasible sometimes).  I’ve thought about writing an RSpec custom formatter that just prints out file/line information for each example as it goes to give me some hope of figuring out which test is really leaving it’s business unfinished, but haven’t done so.

It can be very hard to recognize when you are suffering from this problem, although when you can’t figure out what the heck else could possibly be going on, that’s a clue. It took me a bunch of hours to realize this was a possible thing, and the thing that was happening to me. Hopefully this very long blog post will save you more time then it costs you to read.

Different tests, especially but not exclusively feature tests, failing/erroring each time you run the very same codebase is a clue.

Another clue is when you see errors reported by RSpec as

     Failure/Error: Unable to find matching line from backtrace

I think that one is always an exception raised as a result of `Capybara.raise_server_errors = true` (the default) in the context of a feature test that left unfinished business. You might make those go away with `Capybara.raise_server_errors = false`, but I really didn’t want to go there, the last thing I want is even less information about what’s going on.

With Postgres, I also believe that `PG::TRDeadlockDetected: ERROR: deadlock detected` exceptions are symptomatic of this problem, although I can’t completely explain it and they may be unrelated (may be DatabaseCleaner-related, more on that later).

And I also still get my phantom_js processes sometimes dying unexpectedly; related? I dunno.

But I think it can also show up as an ordinary unreliable test failure, especially in feature tests.

So just don’t do that?

If I understand right, current Capybara maintainer Thomas Walpole understands this risk, and thinks the answer is: Just don’t do that. You need to understand what your app is doing under-the-hood, and make sure the Capybara test waits for everything to really be done before completing. Fair enough, it’s true that there’s no way to have reliable tests when the ‘unfinished business’ is going on. But it’s easier said than done, especially with complicated front-end JS (Angular, React/Flux, etc), which often actually try to abstract away whether/when an AJAX request is happening, whereas following this advice means we need to know exactly whether, when, and what AJAX requests are happening in an integration test, and deal with them accordingly.

I couldn’t completely get rid of problems that I now strongly suspect are caused by this kind of race condition between test examples, couldn’t completely get rid of the “unfinished business”.

But I managed to make the test suite a lot more reliable — and almost completely reliable once I turned off rspec random test order (doh), by dotting all my i’s in configuration…

Get your configuration right

There are a lot of interacting components in a Capybara JS Feature test, including: Rails itself, rspec, Capybara, DatabaseCleaner, Poltergeist. (Or equivalents or swap-outs for many of these).

They each need to be set up and configured right to avoid edge case concurrency bugs. You’d think this would maybe just happen by installing the gems, but you’d be wrong. There are a number of mis-configurations that can hypothetically result in concurrency race conditions in edge cases (even with all your tests being perfect).

They probably aren’t effecting you, they’re edge cases. But when faced with terribly confusing hard to reproduce race condition unreliable tests, don’t you want to eliminate any known issues?  And when I did all of these things, I did improve my test reliability, even in the presence of presumed continued feature tests that don’t wait on everything (race condition category #2 above).

Update your dependencies

When googling, I found many concurrency-related issues filed for the various dependencies.  I’m afraid I don’t keep a record of them. But Rspec, Capybara, DatabaseCleaner, and Poltegeist have all had at least some known concurrency issues (generally with how they all relate to each other) in the past.

Update to the latest versions of all of them, to at least not be using a version with a known concurrency-related bug that’s been fixed.

I’m still on Rspec 2.x, but at least I updated to the last Rspec 2.x (2.14.1). And updated DatabaseCleaner, Capybara, and Poltegeist to the latest I could.

Be careful configuring DatabaseCleaner — do not use the shared connection monkey-patch

DatabaseCleaner is used to give all your tests a fresh-clean database to reduce unintentional dependencies.

For non-JS-feature tests, you probably have DatabaseCleaner configured with the :transaction method — this is pretty cool, it makes each test example happen in an uncommitted transaction, and then just rolls back the transaction after every example. Very fast, very isolated!

But this doesn’t work with feature tests, because of the concurrency. Since JS feature tests boot a Rails app in another thread from your actual tests, using a different database connection, the running app wouldn’t be able to see any of the fixture/factory setup done in your main test thread in an uncommitted transaction.

So you probably have some config in spec/spec_helper.rb or spec/rails_helper.rb to try and do your JS feature tests using a different DatabaseCleaner mode.

Go back and look at the DatabaseCleaner docs and see if you are set up as currently recommended. Recently DatabaseCleaner README made a couple improvements to the recommended setup, making it more complicated but more reliable. Do what it says. 

My previous setup wasn’t always properly identifying the right tests that really needed the non-:transaction method, the improved suggestion does it with a `Capybara.current_driver == :rack_test` test, which should always work. 

Do make sure to set `config.use_transactional_fixtures = false`, as the current suggestion will warn you about if you don’t. 

Do use append_after instead of append to add your `DatabaseCleaner.clean` hook, to make sure database cleaning happens after Capybara is fully finished with it’s own cleanup. (It probably doesn’t matter, but why take the risk).

It shouldn’t matter if you use :truncation or :deletion strategy; everyone uses “:truncation” because “it’s faster”, but the DatabaseCleaner documentation actually says: “So what is fastest out of :deletion and :truncation? Well, it depends on your table structure and what percentage of tables you populate in an average test.” I don’t believe the choice matters for the concurrency-related problems we’re talking about.

Googing, you’ll find various places on the web advising (or copying advice from other places) monkey-patching Rails ConnectionPool with a “shared_connection” implementation originated by José Valim to make :transaction strategy work even with Capybara JS feature tests. Do not do this. ActiveRecord has had a difficult enough time with concurrency without intentionally breaking it or violating it’s contract — ActiveRecord ConnectionPool intends to give each thread it’s own database connection. This hack is intentionally breaking that. IF you have any tests that are exhibiting “race conditions between examples” (a spec ending while activity is still going on in the Rails app), this hack WILL make it a lot WORSE.  Hacking the tricky concurrency related parts of ActiveRecord ConnectionPool is not the answer. Not even if lots of blog posts from years ago tell you to do it, not even if the README or wiki page for one of the components tells you to (I know one does, but now I can’t find it to cite it on a hall of shame), they are wrong. (This guy agrees with me, so do others if you google).  It was a clever idea José had, but it did not work out, and should not still be passed around the web.

Configure Rails under test to reduce concurrency and reduce concurrency-related problems

In a newly generated Rails 4.x app, if you look in `./config/environments/test.rb`, you’ll find this little hint, which you probably haven’t noticed before:

# Do not eager load code on boot. This avoids loading your whole application
 # just for the purpose of running a single test. If you are using a tool that
 # preloads Rails for running tests, you may have to set it to true.
 config.eager_load = false

If that sounds suggestive, it’s because by saying “a tool that preloads Rails for running tests”, this comment is indeed trying to talk about Capybara with a JS driver, which loads a Rails app in an extra thread. It’s telling you to set eager_load to true if you’re doing that.

Except in at least some (maybe all?) versions of Rails 4.x, setting `config.eager_load = true` will change the default value of `config.allow_concurrency` from false to true. So by changing that, you may now have `config.allow_concurrency`.

You don’t want that, at least not if you’re dealing with horrible horrible race condition test suite already. Why not, you may ask, our whole problem is concurrency, shouldn’t we be better off telling Rails to allow it?  Well, what this config actually does (in Rails 4.x, in 5.x I dunno) is control whether the Rails app itself will force every request to wait in line and be served on one at a time (allow_concurrency false), or create multiple threads (more threads, even more concurrency!) to handle multiple overlapping requests.

This configuration might make your JS feature tests even slower, but when I’m already dealing with a nightmare of unreliable race condition feature tests, the last thing I want is even more concurrency.

I’d set:

config.allow_concurrency = false
config.eager_load = true

Here in this Rails issue you can find a very confusing back and forth about whether `config.allow_concurrency = false` is really necessary for Capybara-style JS feature tests, or if maybe only the allow_concurrency setting is necessary and you don’t really need to change `eager_load` at all, or if the reason you need to set one or another is actually a bug in Rails, which was fixed in a Rails patch release, so what you need to do may depend on what version you are using… at the end of it I still wasn’t sure what the Rails experts were recommending or what was going on. I just set them both. Slower tests are better than terribly terribly unreliable tests, and I’m positive this is the safest configuration.

All this stuff has been seriously refactored in Rails 5.0.  In the best case, it will make it all just work, they’re doing some very clever stuff in Rails 5 to try and allow class-reloading even in the presence of concurrency. In the worst case, it’ll just be a new set of weirdness, bugs, and mis-documentation for us to figure out. I haven’t looked at it seriously yet. (As I write this, 5.0.0.beta.2 is just released).

Why not make sure Warden test helpers are set up right

It’s quite unlikely to be related to this sort of problems, but if you’re using Warden test helpers for devise, as recommended on the devise wiki for use with Capybara, you may not have noticed the part about cleaning up with `Warden.test_reset!` in an `after` hook.

This app had the Warden test helpers in it, but wasn’t doing the clean-up properly. When scouring the web for anything related to Capybara, I found this, and fixed it up to do as recommended. It’s really probably not related to the failures you’re having, but might as well as set things up as documented while you’re at it.

I wouldn’t bother with custom Capybara cleanup

While trying to get things working, I tried various custom `after` hooks with Capybara cleanup, various of `Capybara.reset_session!`, `driver.reset!` and others. I went down a rabbit hole trying to figure out exactly what these methods do, which varies from driver to driver, and what they should do, is there a bug in a driver’s implementation?

None of it helped ultimately. Capybara does it’s own cleanup for itself, it’s probably good enough (especially if DatabaseCleaner.cleanup is properly set up with `after_append` to run after Capybara’s cleanup as it should).  Spending a bunch of hours trying to debug or customize this didn’t get me much enlightenment or test reliability improvements.

The Nuclear Option: Rack Request Blocker

Joel Turkel noticed the “unfinished business race condition” problem (his blog post helped me realize I was on the right track), and came up with some fairly tricky rack middleware attempting to deal with it by preventing the Rails app from accepting more requests if an outstanding thing is still going on from a feature test that didn’t wait on it.

Dan Dorman turned it into a gem. 

I experimented with this, and it seemed to both make my tests much slower (not unexpected), and also not cure my problem, I was still getting race condition failures for some reason. So I abandoned it.

But you could try it, I include it for completeness — it is theoretically the only path to actually guaranteeing against feature test “unfinished business”.

At first I thought it was really doing nothing different than `config.allow_concurrency = false`, already built into Rails was doing (allow_concurrency false puts in the Rack::Lock middleware already included with Rails).

But it actually is a bit more powerful — it will allow a unit test (or any test including those not using Capybara JS driver) to wait on the absolute completion of any unfinished business left by a feature test, and at the beginning of the example. Theoretically. I’m not sure why it didn’t work for me, it’s something you could try.

Sadly, maybe disable RSpec config.order = “random”

I did all of these things.  Things did get better. (I think? The trick with non-reproducible failures is you never know if you are just having a run of luck, but I’m pretty sure I improved it).  But they weren’t fixed. I still had unreliable tests.

Somewhere towards the end of this after many hours, I realized my problem was really about the feature tests not waiting on ‘unfinished business’ (I didn’t discover these things in the same order this post is written!), and it would obviously be best to fix that. But I had some pretty complex semi-‘legacy’ front-end JS using a combination of Angular and React (neither of which I had experience with), it just wasn’t feasible, I just wanted it to be over.

You know what did it?

Commenting out `config.order = “random”` from rspec configuration.

At first I had no idea why this would matter — sure, some random orders might be more likely to trigger race conditions then others, but it’s not just a magic seed, it’s turning off random test ordering altogether.

Aha. Because when a JS feature test follows another JS feature test, `config.allow_concurrency = false` is decent (although far from perfect) at holding up the second feature test until the ‘unfinished business’ is complete — it won’t eliminate overlap, but it’ll reduce it.

But when one (or several or a dozen) ordinary tests follow the JS feature test with ‘unfinished business’, they don’t have `allow_concurrency = false` to protect them, since they aren’t using the full Rails stack with middleware effected by this.

If you turn off random test ordering, all your feature tests end up running in sequence together, and all your other tests end up running in sequence together, without intermingling.

That was the magic that got me to, if not 100% reliable without race condition, pretty darn reliable, enough that i only occasionally see race condition failure now.

I don’t feel great about turning off test order randomization, but I also remember when we all wrote tests before rspec even invented the feature, and we did fine. There’s probably also a way to get Rspec to randomize order _within_ types/directories, but still run all feature tests in a block, which should be just as good.

Postscript: Aggressively Minimize JS Feature Tests

I have come to the conclusion that it is extremely challenging and time-consuming to get Capybara JS feature tests to work reliably, and that this is a necessary consequence of the architecture involved. As a result, your best bet is to avoid or ruthlessly minimize the number of feature tests you write.

The problem is that what is necessary to avoid feature test “unfinished business” is counter to the very reason I write tests.

I want and need my tests to test interface (in the case of a feature test this really is user interface, in other cases API), independent of implementation.  If I refactor or rewrite the internals of an implementation, but intend the interface remains the same — I need to count on my tests passing if and only if the interface indeed remains the same. That’s one of the main reasons I have tests. That’s the assumption behind the “red-green-refactor” cycle of TDD (not that I do TDD really myself strictly, but I think that workflow does capture the point of tests).

@twalpole, the current maintainer of Capybara, is aware of the “unfinished business” problem, and says that you basically just need to write your tests to make sure they wait:

So you either need to have an expectation in your flaky tests that checks for a change that occurs when all ajax requests are completed or with enough knowledge about your app it may be possible to write code to make sure there are no ajax requests ongoing (if ALL ajax requests are made via jquery then you could have code that keeps checking until the request count is 0 for instance) and run that in an after hook that you need to define so it runs before the capybara added after hook that resets sessions….

….you still need to understand exactly what your app is doing on a given page you’re testing.

The problem with this advice is it means the way a test is written is tightly coupled to the implementation, and may need to be changed every time the implementation (especially JS code) is changed. Which kind of ruins the purpose of tests for me.

It’s also very challenging to do if you have complex JS front-end (angular, react, Ember, etc), which often intentionally abstracts away exactly when AJAX requests are occuring. You’ve got to go spelunking through abstraction layers to do it — to write the test right in the first place, and again every time there’s any implementation change which might effect things.

Maybe even worse, fancy new JS front-end techniques often result in AJAX requests which result in no visible UI change (to transparently sync state on the back end, maybe only producing UI change in error cases, “optimistic update” style), which means to write a test that properly waits for “unfinished business” you’d need to violate another piece of Capybara advice, as original Capybara author @jnicklas wrote, “I am firmly convinced that asserting on the state of the interface is in every way superior to asserting on the state of your model objects in a full-stack test” — Capybara is written to best support use cases where you only test the UI, not the back-end.

It’s unfortunate, because there are lots of things that make UI-level integration/feature tests attractive:

  • You’re testing what ultimately matters, what the user actually experiences. Lower-level tests can pass with the actual user-facing app still broken if based on wrong assumptions, but feature tests can’t.
  • You haven’t figured out a great way to test your JS front-end in pure JS, and integrate into your CI, but you already know how to write ruby/rails feature tests.
  • You are confronting an under-tested “legacy” app, whose internals you don’t fully understand, and you need better testing to be confident in your refactoring — it makes a lot of sense to start with UI feature tests, and is sometimes even recommended for approaching an under-tested legacy codebase.

There are two big reasons to try and avoid feature tests with a JS-heavy front-end though: 1) They’re slow (inconvenient), and 2) They are nearly infeasible to make work reliably (damning, especially on a legacy codebase).

Until/unless there’s a robust, well-maintained (ideally by Capybara itself, to avoid yet another has-to-coordinate component) lower-level solution along the lines rack_request_blocker, I think all we can do is avoid Capybara JS feature tests as much as possible — stick to the bare minimum of ‘happy path’ scenarios you can get away with (also common feature-test advice), it’s less painful than the alternative.


If you’re looking for consulting or product development with Rails or iOS, I work at Friends of the Web, a small company that does that.


16 thoughts on “Struggling Towards Reliable Capybara Javascript Testing

  1. Thankyou for putting this post together, its most informative.

    One thing I did on a previous project was run all my js tests seperately from my non-js tests. This is pretty easy to do as all the js tests are tagged (I was using Cucumber, I assume the same can be done in rspec). I only say this as you haven’t mentioned it, and its a simple thing that may help.

  2. Thanks for sharing all of this. To make the server logs more informative about the test being run, you can put `` into an RSpec `config.before_each` block.

  3. Thank you for this post. I have struggled many times, trying to fix randomly breaking tests. I can’t stress enough the importance of learning your testing stack well. There’s so many layers, just like you mentioned, which increase the chances of something going wrong. One tip I would like to add, is to also check the test.log file. Sometimes, you can identify how the previous test affects the current one that is randomly failing. Finally, another source of problems is the asset pipeline. If your test environment compiles assets on the fly and you have a large number of CSS and JS/Coffee files, the first Capybara test might fail, because the server process takes a long time to compile your assets.

  4. Just wanted to point out that since Capybara 2.7.x it has had code to make sure all open requests have fully finished at the end of test before moving on to next one, so the leftover info from previous tests shouldn’t be an issue anymore.

  5. Woah, that’s huge news Thomas! Can you point me to a PR, commit, and/or docs so i can see how it does this and make sure it’s doing what I need and any gotchas in it doing so? Looks like 2.7 came out about a year ago, i didn’t realize Capybara has been doing this since then. That’s HUGE news! Thanks!

    This hooks into the rails app to make sure _all_ requests have finished, including ones triggered by ajax in the browser etc?

  6. The PR was which also added the reuse_server setting. It works through the middleware that Capybara adds to the server it runs, keeps track of the number of currently active requests and then in ‘reset!’ after the browser has been changed to “about:blank” (so no more requests will be initiated) it waits for the number of active requests to become 0 before moving on. So, yes it makes sure all requests (to the app Capybara is running) have completed.

  7. Awesome, so exciting! That should def make it easier to write reliable Capybara tests. Very similar to the custom rack middleware discussed above, but baked into and maintained by capybara, hooray! Have you gotten any sense of it is succeeding to make things less touchy in the wild?

  8. As long as it’s paired with the use of `append_after` for DatabaseCleaner it has worked well. Any reports of it not working were tracked back to people using `after` with DatabaseCleaner, which (depending on include order) can lead to DatabaseCleaner cleaning before `reset!` is called. `append_after` ensures that DatabaseCleaner cleans after everything else.

  9. Great — I wish I knew what tutorials beginners are following and could get them changed, the number of people who are still using `after` is depressing. On a related note Rails 5.1 has baked Capybara into the new SystemTests support, and, I believe, has moved to a database connection pool sharing setup that probably removes the need for DatabaseCleaner in a lot of cases. It’s on my list to figure out how to configure that with RSpec.

  10. Yeah, I haven’t messed with 5.1 yet, but I’m very nervous about the shared db connection thing, since my research/experience suggested that _didn’t_ work well for people trying to do it before Rails made it standard, as discussed above.

    I’ve done some work with ActiveRecord and concurrency, and it’s connections simply weren’t designed to be shared between threads, doing so is a violation of it’s contract, and it’s no surprise it causes all kinds of problems (such as if both threads happen to try to _do_ something with the connection at once, which maybe _shoudln’t_ happen under capybara testing, but if my experience says anything it’s that all sorts of things that _shouldn’t_ will and cause very difficult to fix race conditions).

    AR’s concurrency design doesn’t seem to have changed in 5.1 — — they’re just sharing connection between threads anyway in tests. I am not optimistic, this seems like a poor choice likely to result in difficult to diagnose/fix edge case race conditions. But we’ll see.

    And I’m afraid “how are people figuring out how to do this and how do we update them” — is in part, I think, a consequence of the fact that getting this stuff to work right involves putting together so many different independent gems that have to work right together. It’s hard for people to figure it out, there _isn’t_ any primary documentation they go to, they have to hunt and gather for hints of varying levels of currency and quality. I guess I appreciate that Rails is trying to improve this by making Capybara testing an officially supported feature. I am still awfully worried about how they chose to do it.

  11. It didn’t work previously because each had a reference to the same connection and there was no control over which thread used it when. I believe now everything goes through the pool checking the connection in and out as needed, and the pool will allow another thread to wait up to an amount of time for the connection to be available. If the pool size is set to 1 and any uses of the connection are short enough to not timeout the requests for connections from other threads I think it should work. Guess we’ll find out.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s