Setting and handling solr timeouts in Blacklight

When using the Blacklight gem for a Solr search front-end (most used by institutions in the library/cultural heritage sector), you may wish to set a timeout on how long to wait for Solr connection/response.

It turns out, if you are using Rsolr 2.x, you can do this by setting a read_timeout key in your blacklight.yml file. (This under-documented key is a general timeout, despite the name; I have not investigated with Rsolr 1.x).

But the way it turns into an exception and the way that exception is handled is probably not what you would consider useful for your app. You can then change this by over-riding the handle_request_error method in your CatalogController.

I am planning on submitting some PR’s to RSolr and Blacklight to improve some of these things.

Read on for details.

Why set a timeout?

It’s generally considered important to always set a timeout value on an external network request. If you don’t do this, your application may wait indefinitely for the remote server to respond, if the remote server is being slow or hung; or it may depend on underlying library default timeouts that may not be what you want.

What can happen to a Blackligh that does not a set a Solr timeout? We could have a Solr server that takes a really long time — or is entirely hung — on returning a response for one request, or many, or all of them.

Your web workers (eg puma or passenger) will be waiting a while for Solr. Either indefinitely, or maybe there’s a default timeout in the HTTP client (I’m actually not sure, but maybe 60s for net-http?). During this time, the web workers are busy, and unable to handle other requests. This will reduce the traffic capacity of your app, for a very slow/repeatedly misbehaving Solr possibly catastrophically leading to an app that appears unresponsive.

There may be some other part of the stack that will timeout waiting for the web worker to return a response (while the web worker is waiting for Solr). For instance, heroku is willing to wait a maximum of 30 seconds, and I think Passenger also has timeouts (although may default to as long as 10 minutes??). But this may be much longer than you really want your app to wait on Solr for reasons above, and when it does get triggered you’ll get a generic “Timed out waiting for app response” in your logs/monitoring, it won’t be clear the web worker was waiting on solr, making operational debugging harder.

How to set a Blacklight Solr timeout

A network connection to Solr in the Blacklight stack first goes through RSolr, which then (in Rsolr 2.x) uses the Faraday ruby gem, which can use multiple http drivers but default uses net-http from the stdlib.

For historical reasons, how to handle timeouts has been pretty under-documented (and sometimes changing) at all these levels! They’re not making it easy to figure out how to effectively set timeouts! It took the ruby community a bit of time to really internalize the importance of timeouts on HTTP calls.

So I did some research, in code and in manual tests.

Faraday timeouts?

If we start in the middle at Faraday, it’s not clearly documented… and may be http-adapter-specific? Faraday really doesn’t make this easy for us!

But from googling, it looks like Faraday generally means to support keys open_timeout (waiting for a network connection to open), and timeout (often waiting for a response to be returned, but really… everything else, and sometimes includes open_timeout too).

If you want some details….

For instance, if we look at the faraday adapter for http-rb, we can see that the faraday timeout option is passed to http-rb for each of connect, read, and write.

  • (Which really means if you set it to 5 seconds… it could wait 5 seconds for connect then another 5 seconds for write and another 5 seconds for read 😦. http-rb actually provided a general/global timeout at one point, but faraday doens’t take advatnage of it. 😦😦).

And then http-rb adapter uses the open_timeout value just for connect and write. That is, setting both faraday options timeout and open_timeout to the same value would be redundant for the the http-rb adapter at present. the http-rb adapter doesn’t seem to do anything with any other faraday timeout options.

If we look at the default net-http adapter… It’s really confusing! We have to look at this method in faraday generic too. But I confirmed by manual testing that net-http actually supports faraday read_timeout, write_timeout, and open_timeout (different values than http-rb), but will also use timeout as a default for any of them. (Again your actual end-to-end timeout can be sum of open/read/write. 😦).

It looks like different Faraday adapters can use different timeout values, but Faraday tries to make the basic timeout value at least do something useful/general for each adapter?

Most blacklight users are probably using the default net-http adapter (Curious to hear about anyone who isn’t?)

What will Blacklight actually pass to Faraday?

This gets confusing too!

Blacklight seems to take whatever keys you have set in your blacklight.yml for the given environment, and pass them to RSolr.connect. With one exception, you have to say http_adapter in blacklight.config to translate to adapter passed to Rsolr.

  • (I didn’t find the code that makes that blacklight_config invocation be your environment-specific hash from blackight.yml, but I confirmed that’s what it is!)

What does Rsolr 2.x do? It does not pass on anything to Faraday, but only certain allow-listed items, after translating. Confusingly, it’s only wiling to pass on open_timeout, and also translate a read_timeout value from blacklight.yml to Faraday timeout.

Phew! So Blacklight/Rsolr only supports two timeout values to be passed to faraday:

  • open_timeout to Faraday open_timeout
  • read_timeout to Faraday timeout.

PR to Rsolr on timeout arguments?

I think ideally RSolr would pass on any of the values Faraday seems to recognize, at least with some adapters, for timeouts: read_timeout, open_timeout, write_timeout, as well as just timeout.

But to get from what it does now to there in a backwards compatible way… kind of impossible because of how it’s currently translating read_timeout to timeout. :(

I think I may PR one that just recognizes timeout too, while leaving read_timeout as a synonym with a deprecation warning telling you to use timeout? Still thinking this through.

What happens when a Timeout is triggered?

Here we have another complexity. Just as the timeout configuration values are translated on the way down the stack, the exceptions raised when a timeout happens are translated again on the way up, HTTP library => Faraday => RSolr => Blacklight.

Faraday basically has two exception classes it tries to normalize all underlying HTTP library timeouts to: Faraday::ConnectionFailed < Faraday::Error (for timeouts opening the connection) and Faraday::TimeoutError < Faraday::ServerError < Faraday::Error for other timeouts, such as read timeouts.

What happens with a connection timeout?

  1. Faraday raises a Faraday::ConnectionFailed error. (For instance from the default Net::HTTP Adapter)
  2. RSolr rescues it, and re-raises as an RSolr::Error::ConnectionRefused, which sub-classes the ruby stdlib Errno::ECONNREFUSED
  3. Blacklight rescues that Errno::ECONNREFUSED, and translates it to a Blacklight::Exceptions::ECONNREFUSED, (which is still a sub-class of stdlib Errno::ECONNREFUSED)

That just rises up to your application, to give the end-user probably a generic error message, be logged, be caught by any error-monitoring services you have, etc. Or you can configure your application to handle these Blacklight::Exceptions::ECONNREFUSED errors in some custom way using standard Rails rescue_from functionality, etc.

This is all great, just what we expect from exception handling.

The one weirdness is that the exception suggests connection refused, when really it was a timeout, which is somewhat different… but Faraday itself doesn’t distinguish between those two situations, which some people would like to improve for a while now, but there isn’t much a client of Faraday can do about in the meantime.

What happens with other timeouts?

Say, the network connection opened fine, but Solr is just being really slow returning a response (it totally happens) and exceeding a Faraday timeout value set.

The picture here is a bit less good.

  1. Faraday will raise a Faraday::TimeoutError (eg from the net-http adapter).
  2. RSolr does not treat this specially, but just rescues and re-raises it just like any other Faraday::Error as a generic RSolr::Error::Http
  3. Blacklight will take it, just as any other RSolr::Error::Http, and rescues and re-raise as a generic Blacklight::Exceptions::InvalidRequest
  4. Blacklight does not allow this to just rise up through the app, but instead uses Rails rescue_from to register it’s own handler for it, a handle_request_error method.
  5. The handle_request_error method will log the error, and then just display the current Blacklight controller “index” page (ie search form), with a message “Sorry, I don’t understand your search.”

This is… not great.

  • From a UX point of view, this is not good, we’re telling the user “sorry I don’t understand your search” when the probelm was a Solr timeout… it makes it seem like there’s something the user did wrong or could do differently, but that’s not what’s going on.
    • In fact that’s true for a lot of errors Blacklight catches this way. Solr is down? Solr collection doesn’t exist? Solr configuration has a mismatch with Blacklight configuration? All of these will result in this behavior, none of them are something the end-user can do anything about.
  • If you have an error monitoring service like Honeybadger, it won’t record this error, since the app handled it instead of letting it rise unhandled. So you may not even know this is going on.
  • If you have an uptime monitoring service, it might not catch this either, since the app is returning a 200. You could have an app pretty much entirely down and erroring for any attempt to do a search… but returning all HTTP 200 responses.
  • While Blacklight does log the error, it does it in a DIFFERENT way than Rails ordinarily does… you aren’t going to get a stack trace, or any other contextual information, it’s not really going to be clear what’s going on at all, if you mention it at all.

Not great. One option is to override the handle_request_error method in your own CatalogController to: 1) Disable this functionality entirely, don’t swallow the error with a “Sorry, I don’t understand your search” message, just re-raise it; and 2) unwrap the underlying Faraday::TimeoutError before re-raising, so that gets specifically reported instead of a generic “Blacklight::Exceptions::InvalidRequest”, so we can distinguish this specific situation more easily in our logs and error monitoring.

Here’s an implementation that does both, to be put in your catalog_controller.rb:

  # OVERRIDE of Blacklight method. Blacklight by default takes ANY kind
  # of Solr error (could be Solr misconfiguraiton or down etc), and just swallows
  # it, redirecting to Blacklight search page with a message "Sorry, I don't understand your search."
  #
  # This is wrong.  It's misleading feedback for user for something that is usually not
  # something they can do something about, and it suppresses our error monitoring
  # and potentially misleads our uptime checking.
  #
  # We just want to actually raise the error!
  #
  # Additionally, Blacklight/Rsolr wraps some errors that we don't want wrapped, mainly
  # the Faraday timeout error -- we want to be able to distinguish it, so will unwrap it.
  private def handle_request_error(exception)
    exception_causes = []
    e = exception
    until e.cause.nil?
      e = e.cause
      exception_causes << e
    end

    # Raies the more specific original Faraday::TimeoutError instead of
    # the generic wrapping Blacklight::Exceptions::InvalidRequest!
    if faraday_timeout = exception_causes.find { |e| e.kind_of?(Faraday::TimeoutError) }
      raise faraday_timeout
    end

    raise exception
  end

PRs to RSolr and Blacklight for more specific exception?

RSolr and Blacklight both have a special error class for the connection failed/timed out condition. But just lump Faraday::Timeout in with any other kind of error.

I think this logic is probably many years old, and pre-dates Faraday’s current timeout handling.

I think they should both have a new exception class which can be treated differently. Say RSolr::Error::Timeout and Blacklight::Exceptions::RepositoryTimeout?

I plan to make these PRs.

PR to Blacklight to disable that custom handle_request_error behavior

I think the original idea here was that something in the user’s query entry would trigger an exception. That’s what makes rescueing it and re-displaying it with the message “Sorry, I don’t understand your search” make some sense.

At the moment, I have no idea how to reproduce that, figure out a user-entered query that actually results in a Blacklight::Exceptions::InvalidRequest. Maybe it used to be possible to do in an older version of Solr but isn’t anymore? Or maybe it still is, but I just don’t know how?

But I can reproduce ALL SORTS of errors that were not about the user’s entry and which the end-user can do nothing about, but which still result in this misleading error message, and the error getting swallowed by Blacklight and avoiding your error- and uptime-monitoring services. Solr down entirely; Solr collection/core not present or typo’d. Mis-match between Solr configuration and Blacklight configuration, like Blacklight mentioning an Solr field that doens’t actually exist.

All of these result in Blacklight swallowing the exception, and returning an HTTP 200 response with the message “Sorry, I don’t understand your search”. This is not right!

I think this behavior should be removed in a future Blacklight version.

I would like to PR such a thing, but I’m not sure if I can get it reviewed/merged?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s