Notes on retrying all jobs with ActiveJob retry_on

I would like to configure all my ActiveJobs to retry on failure, and I’d like to do so with the ActiveJob retry_on method.

So I’m going to configure it in my ApplicationJob class, in order to retry on any error, maybe something like:

class ApplicationJob < ActiveJob::Base
  retry_on StandardError # other args to be discussed
end

Why use ActiveJob retry_on for this? Why StandardError?

Many people use backend-specific logic for retries, especially with Sidekiq. That’s fine!

I like the idea of using the ActiveJob functionality:

  • I currently use resque (more on challenges with retry here later), but plan to switch to something else at some point medium-term. Maybe sideqkiq, but maybe delayed_job or good_job. (Just using the DB and not having a redis is attractive to me, as is open source). I like the idea of not having to redo this setup when I switch back-ends, or am trying out different ones.
  • In general, I like the promise of ActiveJob as swappable commoditized backends
  • I like what I see as good_job’s philosophy here, why have every back-end reinvent the wheel when a feature can be done at the ActiveJob level? That can help keep the individual back-end smaller, and less “expensive” to maintain. good_job encourages you to use ActiveJob retries I think.

Note, dhh is on record from 2018 saying he thinks setting up retries for all StandardError is a bad idea. But I don’t really understand why! He says “You should know why you’d want to retry, and the code should document that knowledge.” — but the fact that so many ActiveJob back-ends provide “retry all jobs” functionality makes it seem to me an established common need and best practice, and why shouldn’t you be able to do it with ActiveJob alone?

dhh thinks ActiveJob retry is for specific targetted retries maybe, and the backend retry should be used for generic universal ones? Honestly I don’t see myself doing much specific targetted retries, making all your jobs idempotent (important! Best practice for ActiveJob always!), and just having them all retry on any error seems to me to be the way to go, a more efficient use of developer time and sufficient for at least a relatively simple app.

One situation I have where a retry is crucial, is when I have a fairly long-running job (say it takes more than 60 seconds to run; I have some unavoidably!), and the machine running the jobs needs to restart. It might interrupt the job. It is convenient if it is just automatically retried — put back in the queue to be run again by restarted or other job worker hosts! Otherwise it’s just sitting there failed, never to run again, requiring manual action. An automatic retry will take care of it almost invisibly.

Resque and Resque Scheduler

Resque by default doens’t supprot future-scheduled jobs. You can add them with the resque-scheduler plugin. But I had a perhaps irrational desire to avoid this — resque and it’s ecosystem have at different times had different amounts of maintenance/abandonment, and I’m (perhaps irrationally) reluctant to complexify my resque stack.

And do I need future scheduling for retries? For my most important use cases, it’s totally fine if I retry just once, immediately, with a wait: 0. Sure, that won’t take care of all potential use cases, but it’s a good start.

I thought even without resque supporting future-scheduling, i could get away with:

retry_on StandardError, wait: 0

Alas, this won’t actually work, it still ends up being converted to a future-schedule call, which gets rejected by the resque_adapter bundled with Rails unless you have resque-scheduler installed.

But of course, resque can handle wait:0 semantically, if the code was willing to do it by queing an ordinary resque job…. I don’t know if it’s a good idea, but this simple patch to Rails-bundled resque_adapter will make it willing to accept “scheduled” jobs when the time to be scheduled is actually “now”, just scheduling them normally, while still raising on attempts to future schedule. For me, it makes retry_on.... wait: 0 work with just plain resque.

Note: retry_on attempts count includes first run

So wanting to retry just once, I tried something like this:

# Will never actually retry
retry_on StandardError, attempts: 1

My job was never actually retried this way! It looks like the attempts count includes the first non-error run, the total number of times job will be run, including the very first one before any “retries”! So attempts 1 means “never retry” and does nothing. Oops. If you actually want to retry only once, in my Rails 6.1 app this is what did it for me:

# will actually retry once
retry_on StandardError, attempts: 2

(I think this means the default, attempts: 5 actually means your job can be run a total of 5 times– one original time and 4 retries. I guess that’s what was intended?)

Note: job_id stays the same through retries, hooray

By the way, I checked, and at least in Rails 6.1, the ActiveJob#job_id stays the same on retries. If the job runs once and is retried twice more, it’ll have the same job_id each time, you’ll see three Performing lines in your logs, with the same job_id.

Phew! I think that’s the right thing to do, so we can easily correlate these as retries of the same jobs in our logs. And if we’re keeping the job_id somewhere to check back and see if it succeeded or failed or whatever, it stays consistent on retry.

Glad this is what ActiveJob is doing!

Logging isn’t great, but can be customized

Rails will automatically log retries with a line that looks like this:

Retrying TestFailureJob in 0 seconds, due to a RuntimeError.
# logged at `info` level

Eventually when it decides it’s attempts are exhausted, it’ll say something like:

Stopped retrying TestFailureJob due to a RuntimeError, which reoccurred on 2 attempts.
# logged at `error` level

This does not include the job-id though, which makes it harder than it should be to correlate with other log lines about this job, and follow the job’s whole course through your log file.

It’s also inconsistent with other default ActiveJob log lines, which include:

  • the Job ID in text
  • tags (Rails tagged logging system) with the job id and the string "[ActiveJob]". Because of the way the Rails code applies these only around perform/enqueue, retry/discard related log lines apparently end up not included.
  • The Exception message not just the class when there’s a class.

You can see all the built-in ActiveJob logging in the nicely compact ActiveJob::LogSubscriber class. And you can see how the log line for retry is kind of inconsistent with eg perform.

Maybe this inconsistency has persisted so long in part because few people actually use ActiveJob retry, they’re all still using their backends backend-specific functionality? I did try a PR to Rails for at least consistent formatting (my PR doesn’t do tagging), not sure if it will go anywhere, I think blind PR’s to Rails usually do not.

In the meantime, after trying a bunch of different things, I think I figured out the reasonable way to use the ActiveSupport::Notifications/LogSubscriber API to customize logging for the retry-related events while leaving it untouched from Rails for the others? See my solution here.

(Thanks to BigBinary blog for showing up in google and giving me a head start into figuring out how ActiveJob retry logging was working.)

(note: There’s also this: https://github.com/armandmgt/lograge_active_job But I’m not sure how working/maintained it is. It seems to only customize activejob exception reports, not retry and other events. It would be an interesting project to make an up-to-date activejob-lograge that applied to ALL ActiveJob logging, expressing every event as key/values and using lograge formatter settings to output. I think we see exactly how we’d do that, with a custom log subscriber as we’ve done above!)

Warning: ApplicationJob configuration won’t work for emails

You might think since we configured retry_on on ApplicationJob, all our bg jobs are now set up for retrying.

Oops! Not deliver_later emails.

Good_job README explains that ActiveJob mailers don’t descend from ApplicationMailer. (I am curious if there’s any good reason for this, it seems like it would be nice if they did!)

The good_job README provides one way to configure the built-in Rails mailer superclass for retries.

You could maybe also try setting delivery_job on that mailer superclass to use a custom delivery job (thanks again BigBinary for the pointer)… maybe one that subclasses the default class to deliver emails as normal, but let you set some custom options like retry_on? Not sure if this would be preferable in any way.

Leave a comment