Sometimes my Blacklight app makes a request to Solr and it fails in a temporary/intermittent way.
- Maybe there was a temporary network interupting, resulting in a failed connection or timeout
- Maybe Solr was overloaded and being slow, and timed out
- (Warning: Blacklight by default sets no timeouts, and is willing to wait forever for Solr, which you probably don’t want. How to set a timeout is under-documented, but set a
read_timeout:
key in your blacklight.yml to a number of seconds; or if you have RSolr 2.4.0+, set keytimeout
. Both will do the same thing, pass the valuetimeout
to an underlying faraday client).
- (Warning: Blacklight by default sets no timeouts, and is willing to wait forever for Solr, which you probably don’t want. How to set a timeout is under-documented, but set a
- Maybe someone restarted the Solr being used live, which is not a good idea if you’re going for zero uptime, but maybe you aren’t that ambitious, or if you’re me maybe your SaaS solr provider did it without telling you to resolve the Log4Shell bug.
- And btw, if this happens, can appear as a series of connection refused, 503 responses, and 404 responses, for maybe a second or three.
- (By the way also note well: Your blacklight app may be encountering these without you knowing, even if you think you are monitoring errors. Blacklight default will take pretty much all Solr errors, including timeouts, and rescue them, responding with an HTTP 200 status page with a message “Sorry, I don’t understand your search.” And HoneyBadger or other error monitoring you may be using will probably never know. Which I think is broken and would like to fix it, but have been having trouble getting consensus and PR reviews to do so. You can fix it with some code locally, but that’s a separate topic, ANYWAY…)
So I said to myself, self, is there any way we could get Blacklight to automatically retry these sorts of temporary/intermittent failures, maybe once or twice, maybe after a delay? So there would be fewer errors presented to users (and fewer errors alerting me, after I fixed Blacklight to alert on em), in exhange for some users in those temporary error conditions waiting a bit longer for a page?
Blacklight talks to Solr via RSolr — can use 1.x or 2.x — and RSolr, if you’re using 2.x, uses faraday for it’s solr http connections. So one nice way might be to configure the Blacklight/RSolr faraday connection with the faraday retry middleware. (1.x rubydoc). (moved into its own gem in the recently released faraday 2.0).
Can you configure custom faraday middleware for the Blacklight faraday client? Yeesss…. but it requires making and configuring a custom Blacklight::Solr::Repository
class, most conveniently by sub-classing the Blacklight class and overriding a private method. :( But it seems to work out quite well after you jump through some a bit kludgey hoops! Details below.
Questions for the Blacklight/Rsolr community:
- Is this actually safe/forwards-compatible/supported, to be sub-classing
Blacklight::Solr::Repository
and over-ridingbuild_connection
with a call to super? Is this a bad idea? - Should Blacklight have it’s own supported and more targeted API for supplying custom faraday middleware generally (there are lots of ways this might be useful), or setting automatic retries specifically? i’d PR it, if there was some agreement about what it should look like and some chance of it getting reviewed/merged.
- Is there anyone, anyone at all, who is interested in giving me emotional/political/sounding-board/political/code-review support for improving Blacklight’s error handling so it doesn’t swallow all connection/timeout/permanent configuration errors by returning an http 200 and telling the user “Sorry, I don’t understand your search”?
Oops, this may break in Faraday 2?
I haven’t actually tested this on the just-released Faraday 2.0, that was released right after I finished working on this. :( If faraday changes something that makes this approach infeasible, that might be added motivation to make Blacklight just have an API for customizing faraday middleware without having to hack into it like this.
The code for automatic retries in Blacklight 7
(and probably many other versions, but tested in Blacklight 7).
Here’s my whole local pull request if you find that more covenient, but I’ll also walk you through it a bit below and paste in frozen code.
There were some tricks to figuring out how to access and change the middleware on the existing faraday client returned by the super
call; and how to remove the already-configured Blacklight middleware that would otherwise interfere with what we wanted to do (including an existing use of the retry
middleware that I think is configured in a way that isn’t very useful or as intended). But overall it works out pretty well.
I’m having it retry timeouts, connection failures, 404 responses, and any 5xx response. Nothing else. (For instance it won’t retry on a 400 which generally indicates an actual request error of some kind that isn’t going to have any different result on retry).
I’m at least for now having it retry twice, waiting a fairly generous 200ms before first retry, then another 400ms before a second retry if needed. Hey, my app can be slow, so it goes.
Extensively annotated:
Then in my local CatalogController
config block, nothing more than:
config.repository_class = Scihist::BlacklightSolrRepository
I had some challenges figuring out how to test this. I ended up testing against a live running Solr instance, which my app’s test suite does sometimes (via solr_wrapper, for better or worse).
One test that’s just a simple smoke test that this thing seems to still function properly as a Blacklight::Solr::Repository
without raising. And one that of a sample error