Blacklight: Automatic retries of failed Solr requests

Sometimes my Blacklight app makes a request to Solr and it fails in a temporary/intermittent way.

  • Maybe there was a temporary network interupting, resulting in a failed connection or timeout
  • Maybe Solr was overloaded and being slow, and timed out
    • (Warning: Blacklight by default sets no timeouts, and is willing to wait forever for Solr, which you probably don’t want. How to set a timeout is under-documented, but set a read_timeout: key in your blacklight.yml to a number of seconds; or if you have RSolr 2.4.0+, set key timeout. Both will do the same thing, pass the value timeout to an underlying faraday client).
  • Maybe someone restarted the Solr being used live, which is not a good idea if you’re going for zero uptime, but maybe you aren’t that ambitious, or if you’re me maybe your SaaS solr provider did it without telling you to resolve the Log4Shell bug.
    • And btw, if this happens, can appear as a series of connection refused, 503 responses, and 404 responses, for maybe a second or three.
  • (By the way also note well: Your blacklight app may be encountering these without you knowing, even if you think you are monitoring errors. Blacklight default will take pretty much all Solr errors, including timeouts, and rescue them, responding with an HTTP 200 status page with a message “Sorry, I don’t understand your search.” And HoneyBadger or other error monitoring you may be using will probably never know. Which I think is broken and would like to fix it, but have been having trouble getting consensus and PR reviews to do so. You can fix it with some code locally, but that’s a separate topic, ANYWAY…)

So I said to myself, self, is there any way we could get Blacklight to automatically retry these sorts of temporary/intermittent failures, maybe once or twice, maybe after a delay? So there would be fewer errors presented to users (and fewer errors alerting me, after I fixed Blacklight to alert on em), in exhange for some users in those temporary error conditions waiting a bit longer for a page?

Blacklight talks to Solr via RSolr — can use 1.x or 2.x — and RSolr, if you’re using 2.x, uses faraday for it’s solr http connections. So one nice way might be to configure the Blacklight/RSolr faraday connection with the faraday retry middleware. (1.x rubydoc). (moved into its own gem in the recently released faraday 2.0).

Can you configure custom faraday middleware for the Blacklight faraday client? Yeesss…. but it requires making and configuring a custom Blacklight::Solr::Repository class, most conveniently by sub-classing the Blacklight class and overriding a private method. :( But it seems to work out quite well after you jump through some a bit kludgey hoops! Details below.

Questions for the Blacklight/Rsolr community:

  • Is this actually safe/forwards-compatible/supported, to be sub-classing Blacklight::Solr::Repository and over-riding build_connection with a call to super? Is this a bad idea?
  • Should Blacklight have it’s own supported and more targeted API for supplying custom faraday middleware generally (there are lots of ways this might be useful), or setting automatic retries specifically? i’d PR it, if there was some agreement about what it should look like and some chance of it getting reviewed/merged.
  • Is there anyone, anyone at all, who is interested in giving me emotional/political/sounding-board/political/code-review support for improving Blacklight’s error handling so it doesn’t swallow all connection/timeout/permanent configuration errors by returning an http 200 and telling the user “Sorry, I don’t understand your search”?

Oops, this may break in Faraday 2?

I haven’t actually tested this on the just-released Faraday 2.0, that was released right after I finished working on this. :( If faraday changes something that makes this approach infeasible, that might be added motivation to make Blacklight just have an API for customizing faraday middleware without having to hack into it like this.

The code for automatic retries in Blacklight 7

(and probably many other versions, but tested in Blacklight 7).

Here’s my whole local pull request if you find that more covenient, but I’ll also walk you through it a bit below and paste in frozen code.

There were some tricks to figuring out how to access and change the middleware on the existing faraday client returned by the super call; and how to remove the already-configured Blacklight middleware that would otherwise interfere with what we wanted to do (including an existing use of the retry middleware that I think is configured in a way that isn’t very useful or as intended). But overall it works out pretty well.

I’m having it retry timeouts, connection failures, 404 responses, and any 5xx response. Nothing else. (For instance it won’t retry on a 400 which generally indicates an actual request error of some kind that isn’t going to have any different result on retry).

I’m at least for now having it retry twice, waiting a fairly generous 200ms before first retry, then another 400ms before a second retry if needed. Hey, my app can be slow, so it goes.

Extensively annotated:

# ./lib/scihist/blacklight_solr_repository.rb
module Scihist
# Custom sub-class of stock blacklight, to override build_connection
# to provide custom faraday middleware for HTTP retries
#
# This may not be a totally safe forwards-compat Blacklight API
# thing to do, but the only/best way we could find to add-in
# Solr retries.
class BlacklightSolrRepository < Blacklight::Solr::Repository
# this is really only here for use in testing, skip the wait in tests
class_attribute :zero_interval_retry, default: false
# call super, but then mutate the faraday_connection on
# the returned RSolr 2.x+ client, to customize the middleware
# and add retry.
def build_connection(*_args, **_kwargs)
super.tap do |rsolr_client|
faraday_connection = rsolr_client.connection
# remove if already present, so we can add our own
faraday_connection.builder.delete(Faraday::Request::Retry)
# remove so we can make sure it's there AND added AFTER our
# retry, so our retry can succesfully catch it's exceptions
faraday_connection.builder.delete(Faraday::Response::RaiseError)
# add retry middleware with our own confiuration
# https://github.com/lostisland/faraday/blob/main/docs/middleware/request/retry.md
#
# Retry at most twice, once after 300ms, then if needed after
# another 600 ms (backoff_factor set to result in that)
# Slow, but the idea is slow is better than an error, and our
# app is already kinda slow.
#
# Retry not only the default Faraday exception classes (including timeouts),
# but also Solr returning a 404 or 502. Which gets converted to
# Faraday error because RSolr includes raise_error middleware already.
#
# Log retries. I wonder if there's a way to have us alerted if
# there are more than X in some time window Y…
faraday_connection.request :retry, {
interval: (zero_interval_retry ? 0 : 0.300),
# exponential backoff 2 means: 1) 0.300; 2) .600; 3) 1.2; 4) 2.4
backoff_factor: 2,
# But we only allow the first two before giving up.
max: 2,
exceptions: [
# default faraday retry exceptions
Errno::ETIMEDOUT,
Timeout::Error,
Faraday::TimeoutError,
Faraday::RetriableResponse, # important to include when overriding!
# we add some that could be Solr/jetty restarts, based
# on our observations:
Faraday::ConnectionFailed, # nothing listening there at all,
Faraday::ResourceNotFound, # HTTP 404
Faraday::ServerError # any HTTP 5xx
],
retry_block: -> (env, options, retries_remaining, exc) do
Rails.logger.warn("Retrying Solr request: HTTP #{env["status"]}: #{exc.class}: retry #{options.maxretries_remaining}")
# other things we could log include `env.url` and `env.response.body`
end
}
# important to add this AFTER retry, to make sure retry can
# rescue and retry it's errors
faraday_connection.response :raise_error
end
end
end
end

Then in my local CatalogController config block, nothing more than:

config.repository_class = Scihist::BlacklightSolrRepository

I had some challenges figuring out how to test this. I ended up testing against a live running Solr instance, which my app’s test suite does sometimes (via solr_wrapper, for better or worse).

One test that’s just a simple smoke test that this thing seems to still function properly as a Blacklight::Solr::Repository without raising. And one that of a sample error

require "rails_helper"
describe Scihist::BlacklightSolrRepository do
# a way to get a configured repository class…
let(:repository) do
Scihist::BlacklightSolrRepository.new(CatalogController.blacklight_config).tap do |repo|
# if we are testing retries, don't actually wait between them
repo.zero_interval_retry = true
end
end
# A simple smoke test against live solr hoping to be a basic test that the
# thing works like a Blacklight::Solr::Repository, our customization attempt
# hopefully didn't break it.
describe "ordinary behavior smoke test", solr: true do
before do
create(:public_work).update_index
end
it "can return results" do
response = repository.search
expect(response).to be_kind_of(Blacklight::Solr::Response)
expect(response.documents).to be_present
end
end
# We're actually going to use webmock to try to mock some error conditions
# to actually test retry behavior, not going to use live solr.
describe "retry behavior", solr:true do
let(:solr_select_url_regex) { /^#{Regexp.escape(ScihistDigicoll::Env.lookup!(:solr_url) + "/select")}/ }
describe "with solr 400 response" do
before do
stub_request(:any, solr_select_url_regex).to_return(status: 400, body: "error")
end
it "does not retry" do
expect {
response = repository.search
}.to raise_error(Blacklight::Exceptions::InvalidRequest)
expect(WebMock).to have_requested(:any, solr_select_url_regex).once
end
end
describe "with solr 404 response" do
before do
stub_request(:any, solr_select_url_regex).to_return(status: 404, body: "error")
end
it "retries twice" do
expect {
response = repository.search
}.to raise_error(Blacklight::Exceptions::InvalidRequest)
expect(WebMock).to have_requested(:any, solr_select_url_regex).times(3)
end
end
end
end
Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s