Github Action setup-ruby needs to quote ‘3.0’ or will end up with ruby 3.1

You may be running builds in Github Actions using the setup-ruby action to install a chosen version of ruby, looking something like this:

    - name: Set up Ruby
      uses: ruby/setup-ruby@v1
        ruby-version: 3.0

A week ago, that would have installed the latest ruby 3.0.x. But as of the christmas release of ruby 3.1, it will install the latest ruby 3.1.x.

The workaround and/or correction is to quote the ruby version number. If you actually want to get latest ruby 3.0.x, say:

        ruby-version: '3.0'

This is reported here, with reference to this issue on the Github Actions runner itself. It is not clear to me that this is any kind of a bug in the github actions runner, rather than just an unanticipated consequence of using a numeric value in YAML here. 3.0 is of course the same number as 3, it’s not obvious to me it’s a bug that the YAML parser treats them as such.

Perhaps it’s a bug or mis-design in the setup-ruby action. But in lieu of any developers deciding it’s a bug… quote your 3.0 version number, or perhaps just quote all ruby version numbers with the setup-ruby task?

If your 3.0 builds started failing and you have no idea why — this could be it. It can be a bit confusing to diagnose, because I’m not sure anything in the Github Actions output will normally echo the ruby version in use? I guess there’s a clue in the “Installing Bundler” sub-head of the “Setup Ruby” task:

Of course it’s possible your build will succeed anyway on ruby 3.1 even if you meant to run it on ruby 3.0! Mine failed with LoadError: cannot load such file -- net/smtp, so if yours happened to do the same, maybe you got here from google. :) (Clearly net/smtp has been moved to a different status of standard gem in ruby 3.1, I’m not dealing with this further becuase I wasn’t intentionally supporting ruby 3.1 yet).

Note that if you are building with a Github actions matrix for ruby version, the same issue applies. Maybe something like:

          - ruby: '3.0' 
    - uses: actions/checkout@v2

    - name: Set up Ruby
      uses: ruby/setup-ruby@v1
        ruby-version: ${{ matrix.ruby }}


Setting and handling solr timeouts in Blacklight

When using the Blacklight gem for a Solr search front-end (most used by institutions in the library/cultural heritage sector), you may wish to set a timeout on how long to wait for Solr connection/response.

It turns out, if you are using Rsolr 2.x, you can do this by setting a read_timeout key in your blacklight.yml file. (This under-documented key is a general timeout, despite the name; I have not investigated with Rsolr 1.x).

But the way it turns into an exception and the way that exception is handled is probably not what you would consider useful for your app. You can then change this by over-riding the handle_request_error method in your CatalogController.

I am planning on submitting some PR’s to RSolr and Blacklight to improve some of these things.

Read on for details.

Why set a timeout?

It’s generally considered important to always set a timeout value on an external network request. If you don’t do this, your application may wait indefinitely for the remote server to respond, if the remote server is being slow or hung; or it may depend on underlying library default timeouts that may not be what you want.

What can happen to a Blackligh that does not a set a Solr timeout? We could have a Solr server that takes a really long time — or is entirely hung — on returning a response for one request, or many, or all of them.

Your web workers (eg puma or passenger) will be waiting a while for Solr. Either indefinitely, or maybe there’s a default timeout in the HTTP client (I’m actually not sure, but maybe 60s for net-http?). During this time, the web workers are busy, and unable to handle other requests. This will reduce the traffic capacity of your app, for a very slow/repeatedly misbehaving Solr possibly catastrophically leading to an app that appears unresponsive.

There may be some other part of the stack that will timeout waiting for the web worker to return a response (while the web worker is waiting for Solr). For instance, heroku is willing to wait a maximum of 30 seconds, and I think Passenger also has timeouts (although may default to as long as 10 minutes??). But this may be much longer than you really want your app to wait on Solr for reasons above, and when it does get triggered you’ll get a generic “Timed out waiting for app response” in your logs/monitoring, it won’t be clear the web worker was waiting on solr, making operational debugging harder.

How to set a Blacklight Solr timeout

A network connection to Solr in the Blacklight stack first goes through RSolr, which then (in Rsolr 2.x) uses the Faraday ruby gem, which can use multiple http drivers but default uses net-http from the stdlib.

For historical reasons, how to handle timeouts has been pretty under-documented (and sometimes changing) at all these levels! They’re not making it easy to figure out how to effectively set timeouts! It took the ruby community a bit of time to really internalize the importance of timeouts on HTTP calls.

So I did some research, in code and in manual tests.

Faraday timeouts?

If we start in the middle at Faraday, it’s not clearly documented… and may be http-adapter-specific? Faraday really doesn’t make this easy for us!

But from googling, it looks like Faraday generally means to support keys open_timeout (waiting for a network connection to open), and timeout (often waiting for a response to be returned, but really… everything else, and sometimes includes open_timeout too).

If you want some details….

For instance, if we look at the faraday adapter for http-rb, we can see that the faraday timeout option is passed to http-rb for each of connect, read, and write.

  • (Which really means if you set it to 5 seconds… it could wait 5 seconds for connect then another 5 seconds for write and another 5 seconds for read 😦. http-rb actually provided a general/global timeout at one point, but faraday doens’t take advatnage of it. 😦😦).

And then http-rb adapter uses the open_timeout value just for connect and write. That is, setting both faraday options timeout and open_timeout to the same value would be redundant for the the http-rb adapter at present. the http-rb adapter doesn’t seem to do anything with any other faraday timeout options.

If we look at the default net-http adapter… It’s really confusing! We have to look at this method in faraday generic too. But I confirmed by manual testing that net-http actually supports faraday read_timeout, write_timeout, and open_timeout (different values than http-rb), but will also use timeout as a default for any of them. (Again your actual end-to-end timeout can be sum of open/read/write. 😦).

It looks like different Faraday adapters can use different timeout values, but Faraday tries to make the basic timeout value at least do something useful/general for each adapter?

Most blacklight users are probably using the default net-http adapter (Curious to hear about anyone who isn’t?)

What will Blacklight actually pass to Faraday?

This gets confusing too!

Blacklight seems to take whatever keys you have set in your blacklight.yml for the given environment, and pass them to RSolr.connect. With one exception, you have to say http_adapter in blacklight.config to translate to adapter passed to Rsolr.

  • (I didn’t find the code that makes that blacklight_config invocation be your environment-specific hash from blackight.yml, but I confirmed that’s what it is!)

What does Rsolr 2.x do? It does not pass on anything to Faraday, but only certain allow-listed items, after translating. Confusingly, it’s only wiling to pass on open_timeout, and also translate a read_timeout value from blacklight.yml to Faraday timeout.

Phew! So Blacklight/Rsolr only supports two timeout values to be passed to faraday:

  • open_timeout to Faraday open_timeout
  • read_timeout to Faraday timeout.

PR to Rsolr on timeout arguments?

I think ideally RSolr would pass on any of the values Faraday seems to recognize, at least with some adapters, for timeouts: read_timeout, open_timeout, write_timeout, as well as just timeout.

But to get from what it does now to there in a backwards compatible way… kind of impossible because of how it’s currently translating read_timeout to timeout. :(

I think I may PR one that just recognizes timeout too, while leaving read_timeout as a synonym with a deprecation warning telling you to use timeout? Still thinking this through.

What happens when a Timeout is triggered?

Here we have another complexity. Just as the timeout configuration values are translated on the way down the stack, the exceptions raised when a timeout happens are translated again on the way up, HTTP library => Faraday => RSolr => Blacklight.

Faraday basically has two exception classes it tries to normalize all underlying HTTP library timeouts to: Faraday::ConnectionFailed < Faraday::Error (for timeouts opening the connection) and Faraday::TimeoutError < Faraday::ServerError < Faraday::Error for other timeouts, such as read timeouts.

What happens with a connection timeout?

  1. Faraday raises a Faraday::ConnectionFailed error. (For instance from the default Net::HTTP Adapter)
  2. RSolr rescues it, and re-raises as an RSolr::Error::ConnectionRefused, which sub-classes the ruby stdlib Errno::ECONNREFUSED
  3. Blacklight rescues that Errno::ECONNREFUSED, and translates it to a Blacklight::Exceptions::ECONNREFUSED, (which is still a sub-class of stdlib Errno::ECONNREFUSED)

That just rises up to your application, to give the end-user probably a generic error message, be logged, be caught by any error-monitoring services you have, etc. Or you can configure your application to handle these Blacklight::Exceptions::ECONNREFUSED errors in some custom way using standard Rails rescue_from functionality, etc.

This is all great, just what we expect from exception handling.

The one weirdness is that the exception suggests connection refused, when really it was a timeout, which is somewhat different… but Faraday itself doesn’t distinguish between those two situations, which some people would like to improve for a while now, but there isn’t much a client of Faraday can do about in the meantime.

What happens with other timeouts?

Say, the network connection opened fine, but Solr is just being really slow returning a response (it totally happens) and exceeding a Faraday timeout value set.

The picture here is a bit less good.

  1. Faraday will raise a Faraday::TimeoutError (eg from the net-http adapter).
  2. RSolr does not treat this specially, but just rescues and re-raises it just like any other Faraday::Error as a generic RSolr::Error::Http
  3. Blacklight will take it, just as any other RSolr::Error::Http, and rescues and re-raise as a generic Blacklight::Exceptions::InvalidRequest
  4. Blacklight does not allow this to just rise up through the app, but instead uses Rails rescue_from to register it’s own handler for it, a handle_request_error method.
  5. The handle_request_error method will log the error, and then just display the current Blacklight controller “index” page (ie search form), with a message “Sorry, I don’t understand your search.”

This is… not great.

  • From a UX point of view, this is not good, we’re telling the user “sorry I don’t understand your search” when the probelm was a Solr timeout… it makes it seem like there’s something the user did wrong or could do differently, but that’s not what’s going on.
    • In fact that’s true for a lot of errors Blacklight catches this way. Solr is down? Solr collection doesn’t exist? Solr configuration has a mismatch with Blacklight configuration? All of these will result in this behavior, none of them are something the end-user can do anything about.
  • If you have an error monitoring service like Honeybadger, it won’t record this error, since the app handled it instead of letting it rise unhandled. So you may not even know this is going on.
  • If you have an uptime monitoring service, it might not catch this either, since the app is returning a 200. You could have an app pretty much entirely down and erroring for any attempt to do a search… but returning all HTTP 200 responses.
  • While Blacklight does log the error, it does it in a DIFFERENT way than Rails ordinarily does… you aren’t going to get a stack trace, or any other contextual information, it’s not really going to be clear what’s going on at all, if you mention it at all.

Not great. One option is to override the handle_request_error method in your own CatalogController to: 1) Disable this functionality entirely, don’t swallow the error with a “Sorry, I don’t understand your search” message, just re-raise it; and 2) unwrap the underlying Faraday::TimeoutError before re-raising, so that gets specifically reported instead of a generic “Blacklight::Exceptions::InvalidRequest”, so we can distinguish this specific situation more easily in our logs and error monitoring.

Here’s an implementation that does both, to be put in your catalog_controller.rb:

  # OVERRIDE of Blacklight method. Blacklight by default takes ANY kind
  # of Solr error (could be Solr misconfiguraiton or down etc), and just swallows
  # it, redirecting to Blacklight search page with a message "Sorry, I don't understand your search."
  # This is wrong.  It's misleading feedback for user for something that is usually not
  # something they can do something about, and it suppresses our error monitoring
  # and potentially misleads our uptime checking.
  # We just want to actually raise the error!
  # Additionally, Blacklight/Rsolr wraps some errors that we don't want wrapped, mainly
  # the Faraday timeout error -- we want to be able to distinguish it, so will unwrap it.
  private def handle_request_error(exception)
    exception_causes = []
    e = exception
    until e.cause.nil?
      e = e.cause
      exception_causes << e

    # Raies the more specific original Faraday::TimeoutError instead of
    # the generic wrapping Blacklight::Exceptions::InvalidRequest!
    if faraday_timeout = exception_causes.find { |e| e.kind_of?(Faraday::TimeoutError) }
      raise faraday_timeout

    raise exception

PRs to RSolr and Blacklight for more specific exception?

RSolr and Blacklight both have a special error class for the connection failed/timed out condition. But just lump Faraday::Timeout in with any other kind of error.

I think this logic is probably many years old, and pre-dates Faraday’s current timeout handling.

I think they should both have a new exception class which can be treated differently. Say RSolr::Error::Timeout and Blacklight::Exceptions::RepositoryTimeout?

I plan to make these PRs.

PR to Blacklight to disable that custom handle_request_error behavior

I think the original idea here was that something in the user’s query entry would trigger an exception. That’s what makes rescueing it and re-displaying it with the message “Sorry, I don’t understand your search” make some sense.

At the moment, I have no idea how to reproduce that, figure out a user-entered query that actually results in a Blacklight::Exceptions::InvalidRequest. Maybe it used to be possible to do in an older version of Solr but isn’t anymore? Or maybe it still is, but I just don’t know how?

But I can reproduce ALL SORTS of errors that were not about the user’s entry and which the end-user can do nothing about, but which still result in this misleading error message, and the error getting swallowed by Blacklight and avoiding your error- and uptime-monitoring services. Solr down entirely; Solr collection/core not present or typo’d. Mis-match between Solr configuration and Blacklight configuration, like Blacklight mentioning an Solr field that doens’t actually exist.

All of these result in Blacklight swallowing the exception, and returning an HTTP 200 response with the message “Sorry, I don’t understand your search”. This is not right!

I think this behavior should be removed in a future Blacklight version.

I would like to PR such a thing, but I’m not sure if I can get it reviewed/merged?

Blacklight 7.x, deprecation of view overrides, paths forward

This post will only be of interest to those who use the blacklight ruby gem/platform, mostly my collagues in library/cultural heritage sector.

When I recently investigated updating our Rails app from Blacklight to the latest 7.19.2, I encountered a lot of deprecation notices. They were related to code both in my local app and a plugin trying to override parts of Blacklight views — specifically the “constraints” (ie search limits/query “breadcrumbs” display) area in the code I encountered, I’m not sure if it applies to more areas of view customization.

Looking into this more to see if I could get a start on changing logic to avoid deprecation warnings — I had trouble figuring out any non-deprecated way to achieve the overrides.  After more research, I think it’s not totally worked out how to make these overrides keep working at all in future Blacklight 8, and that this affects plugins including blacklght_range_limit, blacklight_advanced_search, geoblacklight, and possibly spotlight. Some solutions need to be found if these plugins are to be updated keep working in future Blacklight 8.

I have documented what I found/understood, and some ideas for moving forward, hoping it will help start the community process of figuring out solutions to keep all this stuff working. I may not have gotten everything right or thought of everything, this is meant to help start the discussion, suggestions and corrections welcome.

This does get wordy, I hope you can find it useful to skip around or skim if it’s not all of interest. I believe the deprecations start around Blacklight 7.12 (released October 2020). I believe Blacklight 7.14 is the first version to suport ruby 3.0, so anyone wanting to upgrade to ruby 3 will encounter these issues.


Over blacklight’s 10+ year existence, it has been a common use-case to customize specific parts of Blacklight, including customizing what shows up on one portion of a page while leaving other portions ‘stock’. An individual local application can do this with it’s own custom code; it is also common from many of shared blacklight “plug-in”/”extension” engine gems.

Blacklight had tradtionally implemented it’s “view” layer in a typical Rails way, involving “helper” methods and view templates. Customizations and overrides, by local apps or plugins, were implemented by over-riding these helper methods and partials. This traditional method of helper and partial overrides is still described in the Blacklight project wiki — it possibly could use updating for recent deprecations/new approaches).

This view/helper/override approach has some advantages: It just uses standard ruby and Rails, not custom Blacklight abstractions; multiple different plugins can override the same method, so long as they all call “super”, to cooperatively add funtionality; it is very flexible and allows overrides that “just work”.

It also has some serious disadvantages. Rails helpers and views are known in general for leading to “spaghetti” or “ball of mud” code, where everything ends up depending on everything/anything else, and it’s hard to make changes without breaking things.

In the context of shared gem code like Blacklight and it’s ecosystem, it can get even messier to not know what is meant to be public API for an override. Blacklight’s long history has different maintainers with different ideas, and varying documentation or institutional memory of intents can make it even more confusing. Several generations of ideas can be present in the current codebase for both backwards-compatibility and “lack of resources to remove it” reasons. It can make it hard to make any changes at all without breaking existing code, a problem we were experiencing with Blacklight.

One solution that has appeared for Rails is the ViewComponent gem (written by github, actually), which facilitates better encapsulation, separation of concerns, and clear boundaries between different pieces of view code.The current active Blacklight maintainers (largely from Stanford I think?) put in some significant work — in Blacklight 7.x — to rewrite some significant parts of Blacklight’s view architecture based on the ViewComponent gem. This is a welcome contribution to solving real problems! Additionally, they did some frankly rather heroic things to get this replacement with ViewComponent to be, as a temporary transition step, very backwards compatible, even to existing code doing extensive helper/partial overrides, which was tricky to accomplish and shows their concern for current users.

Normally, when we see deprecation warnings, we like to fix them, to get them out of our logs, and prepare our apps for the future version where deprecated behavior stops working entirely. To do otherwise is considered leaving “technical debt” for the future, since a deprecation warning is telling you that code will have to be changed eventually.

The current challenge here is that it’s not clear (at least to me) how to change the code to still work in current Blacklight 7.x and upcoming Blacklight 8x. Which is a challenge both for running in current BL 7 without deprecation, and for the prospects of code continuing to work in future BL 8. I’ll explain more with examples.

Blacklight_range_limit (and geoblacklight): Add a custom “constraint”

blacklight_range_limit introduces new query parameters for range limit filters, not previously recognized by Blacklight, that look eg like &range[year_facet][begin]=1910 In addition to having these effect the actual Solr search, it also needs to display this limit (that Blacklight core is ignoring) in the “constraints” area above the search results:

To do this it overrides the render_constraints_filters helper method from Blacklight, through some fancy code effectively calling super to render the ordinary Blacklight constraints filters but then adding on it’s rendering of the contraints only blacklight_range_limit knows about. One advantage of this “override, call super, but add on” approach is that multiple add-ons can do it, and they don’t interfere with each other — so long as they all call super, and only want to add additional content, not replace pre-existing content.

But overriding this helper method is deprecated in recent Blacklight 7.x. If Blacklight detects any override to this method (among other constraints-related methods), it will issue a deprecation notice, and also switch into a “legacy” mode of view rendering, so the override will still work.

OK, what if we wanted to change how blacklight_range_limit does this, to avoid triggering the deprecation warnings, and to have blacklight continue to use the “new” (rather than “legacy”) logic, that will be the logic it insists on using in Blacklight 8?

The new logic is to render with the new view_component, Which is rendered in the catalog/_constraints.html.erb partial. I guess if we want the rendering to behave differently in that new system, we need to introduce a new view component that is like Blacklight::ConstraintsComponent but behaves differently (perhaps a sub-class, or a class using delegation). Or, hey, that component takes some dependent view_components as args, maybe we just need to get the ConstraintsComponent to be given an arg for a different version of one of the _component args, not sure if that will do it.

It’s easy enough to write a new version of one of these components… but how would we get Blacklight to use it?

I guess we would have to override catalog/_constraints.html.erb. But this is unsastisfactory:

  • I thougth we were trying to get out of overriding partials, but even if it’s okay in this situation…
  • It’s difficult and error-prone for an engine gem to override partials, you need to make sure it ends up in the right order in Rails “lookup paths” for templates, but even if you do this…
  • What if multiple things want to add on a section to the “constraints” area? Only one can override this partial, there is no way for a partial to call super.

So perhaps we need to ask the local app to override catalog/_constraints.html.erb (or generate code into it), and that code calls our alternate component, or calls the stock component with alternate dependency args.

  • This is already seeming a bit more complex and fragile than the simpler one-method override we did before, we have to copy-and-paste the currently non-trivial implementation in _constraints.html.erb, but even if we aren’t worried about that….
  • Again, what happens if multiple different things want to add on to what’s in the “constraints” area?
  • What if there are multiple places that need to render constraints, including other custom code? (More on this below). They all need to be identically customized with this getting-somewhat-complex code?

That multiple things might want to add on isn’t just theoretical, geoblacklight also wants to add some things to the ‘constraints’ area and also does it by overriding the render_constraints_filters method.

Actually, if we’re just adding on to existing content… I guess the local app could override catalog/_constraints.html.erb, copy the existing blacklight implementation, then just add on the END a call to both say <%= render(BlacklightRangeLimit::RangeConstraintsComponent %> and then also <%= <%= render(GeoBlacklight::GeoConstraintsComponent) %>… it actually could work… but it seems fragile, especially when we start dealing with “generators” to automatically create these in a local app for CI in the plugins, as blacklight plugins do?

My local app (and blacklight_advanced_search): Change the way the “query” constraint looks

If you just enter the query ‘cats’, “generic” out of the box Blacklight shows you your search with this as a sort of ‘breadcrumb’ constraint in a simple box at the top of the search:

My local app (in addition to changing the styling) changes that to an editable form to change your query (while keeping other facet etc filters exactly the same). Is this a great UX? Not sure! But it’s what happens right now:

It does this by overriding `render_constraints_query` and not calling super, replace the standard implementation with my own.

How do we do this in the new non-deprecated way?

I guess again we have to either replace Blacklight::ConstraintsComponent with a new custom version… or perhaps pass in a custom component for query_constraint_component… this time we can’t just render and add on, we really do need to replace something.

What options do we? Maybe, again, customizing _constraints.html.erb to call that custom component and/or custom-arg. And make sure any customization is consistent with any customization done by say blacklight_range_limit or geoblacklight, make sure they aren’t all trying to provide mutually incompatible custom components.

I still don’t like:

  • having to override a view partial (when before I only overrode a helper method), in local app instead of plugin it’s more feasible, but we still have to copy-and-paste some non-trivial code from Blacklight to our local override, and hope it doesn’t change
  • Pretty sensitive to implementation of Blacklight::ConstraintsComponent if we’re sub-classing it or delegating it. I’m not sure what parts of it are considered public API, or how frequently they are to change… if we’re not careful, we’re not going ot have any more stable/reliable/forwards-compatible code than we did under the old way.
  • This solution doesn’t provide a way for custom code to render a constraints area with all customizations added by any add-ons, which is a current use case, see next section.

It turns out blacklight_advanced_search also customizes the “query constraint” (in order to handle the multi-field queries that the plugin can do), also by overriding render_constraints_query, so this exact use case affects that plug-in too, with a bit more challenge in a plugin instead of a local app.

I don’t think any of these solutions we’ve brainstormed are suitable and reliable.

But calling out to Blacklight function blocks too, as in spotlight….

In addition to overriding a helper method to customize what appears on the screen, traditionally custom logic in a local app or plug-in can call a helper method to render some piece of Blacklight functionality on screen.

For instance, the spotlight plug-in calls the render_constraints method in one of it’s own views, to include that whole “constraints” area on one of it’s own custom pages.

Using the legacy helper method architecture, spotlight can render the constraints including any customizations the local app or other plug-ins have made via their overriding of helper methods. For instance, when spotlight calls render_constraints, it will get the additional constraints that were added by blacklight_range_limit or geoblacklight too.

How would spotlight render constraints using the new architecture? I guess it would call the Blacklight view_component directly, render( But how does it manage to use any customizations added by plug-ins like blacklight_range_limit? Not sure. None of the solutions we brainstormed above seem to get us there.

I suppose (Eg) spotlight could actually render the constraints.html.erb partial, that becomes the one canonical standardized “API” for constraints rendering, to be customized in the local app and re-used every time constraints view is needed? That might work, but seems a step backwards to go toward view partial as API to me, I feel like we were trying to get away from that for good reasons, it just feels messy.

This makes me think new API might be required in Blacklight, if we are not to have reduction in “view extension” functionality for Blacklgiht 8 (which is another option, say, well, you just cant’ do those things anymore, significantly trimming the scope of what is possible with plugins, possibly abandoning some plugins).

There are other cases where blacklight_range_limit for example calls helper methods to re-use functionality. I haven’t totally analyzed them. It’s possible that in some cases, the plug-in just should copy-and-paste hard-coded HTML or logic, without allowing for other actors to customize them. Examples of what blacklight_range_limit calls here include

New API? Dependency Injection?

Might there be some new API that Blacklight could implement that would make this all work smoother and more consistently?

“If we want a way to tell Blacklight “use my own custom component instead of Blacklight::ConstraintsComponent“, ideally without having to override a view template, at first that made me think “Inversion of Control with Dependency Injection“? I’m not thrilled with this generic solution, but thinking it through….

What if there was some way the local app or plugin could do Blacklight::ViewComponentRegistration.constraints_component_class = MyConstraintsComponent, and then when blacklight wants to call it, instead of doing, like it does now, <%= render( stuff) %>, it’d do something like: `<%= stuff) %>.

That lets us “inject” a custom class without having to override the view component and every other single place it might be used, including new places from plugins etc. The specific arguments the component takes would have to be considered/treated as public API somehow.

It still doesn’t let multiple add-ons cooperate to each add a new constraint item though. i guess to do that, the registry could have an array for each thing….

Blacklight::ViewComponentRegistration.constraints_component_classes = [

# And then I guess we really need a convenience method for calling
# ALL of them in a row and concatenating their results....

Blacklight::ViewComponentRegistration.render(:constraints_component_class, search_state: stuff)

On the plus side, now something like spotlight can call that too to render a “constraints area” including customizations from BlacklightRangeLimit, GeoBlacklight, etc.

But I have mixed feelings about this, it seems like the kind of generic-universal yet-more-custom-abstraction thing that sometimes gets us in trouble and over-complexified. Not sure.

API just for constraints view customization?

OK, instead of trying to make a universal API for customizing “any view component”, what if we just focus on the actual use cases in front of us here? All the ones I’ve encountered so far are about the “constraints” area? Can we add custom API just for that?

It might look almost exactly the same as the generic “IoC” solution above, but on the Blacklight::ConstraintsComponent class…. Like, we want to customize the component Blacklight::ConstraintsComponent uses to render the ‘query constraint’ (for my local app and advanced search use cases), right now we have to change the call site for every place it exists, to have a different argument… What if instead we can just:

Blacklight::ConstraintsComponent.query_constraint_component =

And ok, for these “additional constraint items” we want to add… in “legacy” architecture we overrode “render_constraints_filters” (normally used for facet constraints) and called super… but that’s just cause that’s what we had, really this is a different semantic thing, let’s just call it what it is:

Blacklight::ConstraintsComponent.additional_render_components <<

Blacklight::ConstraintsComponent.additional_render_components <<

All those component “slots” would still need to have their initializer arguments be established as “public API” somehow, so you can register one knowing what args it’s initializer is going to get.

Note this solves the spotlight case too, spotlight can just simply call render Blacklight::ConstraintsComponent(..., and it now does get customizations added by other add-ons, because they were registered with the Blacklight::ConstraintsComponent.

I think this API may meet all the use cases I’ve identified? Which doesn’t mean there aren’t some I haven’t identified. I’m not really sure what architecture is best here, I’ve just trained to brainstorm possibilities. It would be good to choose carefully, as we’d ideally find something that can work through many future Blacklight versions without having to be deprecated again.

Need for Coordinated Switchover to non-deprecated techniques

The way Blacklight implements backwards-compatible support for the constraints render, is if it detects anything in the app is overriding a relevant method or partial, it continues rendering the “legacy” way with helpers and partials.

So if I were to try upgrading my app to do something using a new non-deprecatred method, while my app is still using blacklight_range_limit doing things the old way… it woudl be hard to keep them both working. If you have more than one Blacklight plug-in overriding relevant view helpers, it of course gets even more complicated.

It pretty much has to be all-or-nothing. Which also makes it hard for say blacklight_range_limit to do a release that uses a new way (if we figured one out) — it’s probably only going to work in apps that have changed ALL their parts over to the new way. I guess all the plug-ins could do releases that offered you a choice of configuration/installation instructions, where the host app could choose new way or old way.

I think the complexity of this makes it more realistic, especially based on actual blacklight community maintenance resources, that a lot of apps are just going to keep running in deprecated mode, and a lot of plugins only available triggering deprecation warnings, until Blacklight 8.0 comes out and the deprecated behavior simply breaks, and then we’ll need Blacklight 8-only versions of all the plugins, with apps switching everything over all at once.

If different plugins approach this in an uncoordianted fashion, each trying to investnt a way to do it, they really risk stepping on each others toes and being incompatible with each other. I think really something has to be worked out as the Blacklgiht-recommended consensus/best practice approach to view overrides, so everyone can just use it in a consistent and compatible way. Whether that requires new API not yet in Blacklight, or a clear pattern with what’s in current Blacklight 7 releasees.

Ideally all worked out by currently active Blacklight maintainers and/or community before Blacklight 8 comes out, so people at least know what needs to be done to update code. Many Blacklight users may not be using Blacklight 7.x at all yet (7.0 released Dec 2018) — for instance hyrax still uses Blacklight 6 — so I’m not sure what portion of the community is already aware this is coming up on the horizon.

I hope the time I’ve spent investigating and considering and documenting in this piece can be helpful to the community as one initial step, to understanding the lay of the land.

For now, silence deprecations?

OK, so I really want to upgrade to latest Blacklight 7.19.2, from my current 7.7.0. To just stay up to date, and to be ready for ruby 3.0. (My app def can’t pass tests on ruby 3 with BL 7.7; it looks like BL added ruby 3.0 support in BL 7.14.0? Which does already have the deprecations).

It’s not feasible right now to eliminate all the deprecated calls. But my app does seem to work fine, just with deprecation calls.

I don’t really want to leave all those “just ignore them for now”. deprecation messages in my CI and production logs though. They just clutter things up and make it hard to pay attention to the things Iwant to be noticing.

Can we silence them? Blacklight uses the deprecation gem for it’s deprecation messages; the gem is by cbeer, with logic taken out of ActiveSupport.

We could wrap all calls to deprecated methods in Deprecation.silence do…. including making a PR to blacklight_range_limit to do that? I’m not sure I like the idea of making blacklight_range_limit silent on this problem, it needs more attention at this point! Also I’m not sure how to use Deprecation.silence to effect that clever conditional check in the _constraints.html.erb template.

We could entirely silence everything from the deprecation gem with Deprecation.default_deprecation_behavior — I don’t love this, we might be missing deprecations we want?

The Deprecation gem API made me think there might be a way to silence deprecation warnings from individual classes with things like Blacklight::RenderConstraintsHelperBehavior.deprecation_behavior = :silence, but I think I was misinterpreting the API, there didn’t seem to be actually methods like that available in Blacklight to silence what I wanted in a targetted way.

Looking/brainstormign more in Deprecation gem API… I *could* change it’s behavior to it’s “notify” strategy that sends ActiveSupport::Notification events instead of writing to stdout/log… and then write a custom ActiveSupport::Notification subscriber which ignored the ones I wanted to ignore… ideally still somehow keeping the undocumented-but-noticed-and-welcome default behavior in test/rspec environment where it somehow reports out a summary of deprecations at the end…

This seemed too much work. I realized that the only things that use the Deprecation gem in my project are Blacklight itself and the qa gem (I don’t think it has caught on outside blacklight/samvera communities), and I guess I am willing to just silence deprecations from all of them, although I don’t love it.

Notes on retrying all jobs with ActiveJob retry_on

I would like to configure all my ActiveJobs to retry on failure, and I’d like to do so with the ActiveJob retry_on method.

So I’m going to configure it in my ApplicationJob class, in order to retry on any error, maybe something like:

class ApplicationJob < ActiveJob::Base
  retry_on StandardError # other args to be discussed

Why use ActiveJob retry_on for this? Why StandardError?

Many people use backend-specific logic for retries, especially with Sidekiq. That’s fine!

I like the idea of using the ActiveJob functionality:

  • I currently use resque (more on challenges with retry here later), but plan to switch to something else at some point medium-term. Maybe sideqkiq, but maybe delayed_job or good_job. (Just using the DB and not having a redis is attractive to me, as is open source). I like the idea of not having to redo this setup when I switch back-ends, or am trying out different ones.
  • In general, I like the promise of ActiveJob as swappable commoditized backends
  • I like what I see as good_job’s philosophy here, why have every back-end reinvent the wheel when a feature can be done at the ActiveJob level? That can help keep the individual back-end smaller, and less “expensive” to maintain. good_job encourages you to use ActiveJob retries I think.

Note, dhh is on record from 2018 saying he thinks setting up retries for all StandardError is a bad idea. But I don’t really understand why! He says “You should know why you’d want to retry, and the code should document that knowledge.” — but the fact that so many ActiveJob back-ends provide “retry all jobs” functionality makes it seem to me an established common need and best practice, and why shouldn’t you be able to do it with ActiveJob alone?

dhh thinks ActiveJob retry is for specific targetted retries maybe, and the backend retry should be used for generic universal ones? Honestly I don’t see myself doing much specific targetted retries, making all your jobs idempotent (important! Best practice for ActiveJob always!), and just having them all retry on any error seems to me to be the way to go, a more efficient use of developer time and sufficient for at least a relatively simple app.

One situation I have where a retry is crucial, is when I have a fairly long-running job (say it takes more than 60 seconds to run; I have some unavoidably!), and the machine running the jobs needs to restart. It might interrupt the job. It is convenient if it is just automatically retried — put back in the queue to be run again by restarted or other job worker hosts! Otherwise it’s just sitting there failed, never to run again, requiring manual action. An automatic retry will take care of it almost invisibly.

Resque and Resque Scheduler

Resque by default doens’t supprot future-scheduled jobs. You can add them with the resque-scheduler plugin. But I had a perhaps irrational desire to avoid this — resque and it’s ecosystem have at different times had different amounts of maintenance/abandonment, and I’m (perhaps irrationally) reluctant to complexify my resque stack.

And do I need future scheduling for retries? For my most important use cases, it’s totally fine if I retry just once, immediately, with a wait: 0. Sure, that won’t take care of all potential use cases, but it’s a good start.

I thought even without resque supporting future-scheduling, i could get away with:

retry_on StandardError, wait: 0

Alas, this won’t actually work, it still ends up being converted to a future-schedule call, which gets rejected by the resque_adapter bundled with Rails unless you have resque-scheduler installed.

But of course, resque can handle wait:0 semantically, if the code was willing to do it by queing an ordinary resque job…. I don’t know if it’s a good idea, but this simple patch to Rails-bundled resque_adapter will make it willing to accept “scheduled” jobs when the time to be scheduled is actually “now”, just scheduling them normally, while still raising on attempts to future schedule. For me, it makes retry_on.... wait: 0 work with just plain resque.

Note: retry_on attempts count includes first run

So wanting to retry just once, I tried something like this:

# Will never actually retry
retry_on StandardError, attempts: 1

My job was never actually retried this way! It looks like the attempts count includes the first non-error run, the total number of times job will be run, including the very first one before any “retries”! So attempts 1 means “never retry” and does nothing. Oops. If you actually want to retry only once, in my Rails 6.1 app this is what did it for me:

# will actually retry once
retry_on StandardError, attempts: 2

(I think this means the default, attempts: 5 actually means your job can be run a total of 5 times– one original time and 4 retries. I guess that’s what was intended?)

Note: job_id stays the same through retries, hooray

By the way, I checked, and at least in Rails 6.1, the ActiveJob#job_id stays the same on retries. If the job runs once and is retried twice more, it’ll have the same job_id each time, you’ll see three Performing lines in your logs, with the same job_id.

Phew! I think that’s the right thing to do, so we can easily correlate these as retries of the same jobs in our logs. And if we’re keeping the job_id somewhere to check back and see if it succeeded or failed or whatever, it stays consistent on retry.

Glad this is what ActiveJob is doing!

Logging isn’t great, but can be customized

Rails will automatically log retries with a line that looks like this:

Retrying TestFailureJob in 0 seconds, due to a RuntimeError.
# logged at `info` level

Eventually when it decides it’s attempts are exhausted, it’ll say something like:

Stopped retrying TestFailureJob due to a RuntimeError, which reoccurred on 2 attempts.
# logged at `error` level

This does not include the job-id though, which makes it harder than it should be to correlate with other log lines about this job, and follow the job’s whole course through your log file.

It’s also inconsistent with other default ActiveJob log lines, which include:

  • the Job ID in text
  • tags (Rails tagged logging system) with the job id and the string "[ActiveJob]". Because of the way the Rails code applies these only around perform/enqueue, retry/discard related log lines apparently end up not included.
  • The Exception message not just the class when there’s a class.

You can see all the built-in ActiveJob logging in the nicely compact ActiveJob::LogSubscriber class. And you can see how the log line for retry is kind of inconsistent with eg perform.

Maybe this inconsistency has persisted so long in part because few people actually use ActiveJob retry, they’re all still using their backends backend-specific functionality? I did try a PR to Rails for at least consistent formatting (my PR doesn’t do tagging), not sure if it will go anywhere, I think blind PR’s to Rails usually do not.

In the meantime, after trying a bunch of different things, I think I figured out the reasonable way to use the ActiveSupport::Notifications/LogSubscriber API to customize logging for the retry-related events while leaving it untouched from Rails for the others? See my solution here.

(Thanks to BigBinary blog for showing up in google and giving me a head start into figuring out how ActiveJob retry logging was working.)

(note: There’s also this: But I’m not sure how working/maintained it is. It seems to only customize activejob exception reports, not retry and other events. It would be an interesting project to make an up-to-date activejob-lograge that applied to ALL ActiveJob logging, expressing every event as key/values and using lograge formatter settings to output. I think we see exactly how we’d do that, with a custom log subscriber as we’ve done above!)

Warning: ApplicationJob configuration won’t work for emails

You might think since we configured retry_on on ApplicationJob, all our bg jobs are now set up for retrying.

Oops! Not deliver_later emails.

Good_job README explains that ActiveJob mailers don’t descend from ApplicationMailer. (I am curious if there’s any good reason for this, it seems like it would be nice if they did!)

The good_job README provides one way to configure the built-in Rails mailer superclass for retries.

You could maybe also try setting delivery_job on that mailer superclass to use a custom delivery job (thanks again BigBinary for the pointer)… maybe one that subclasses the default class to deliver emails as normal, but let you set some custom options like retry_on? Not sure if this would be preferable in any way.

logging URI query params with lograge

The lograge gem for taming Rails logs by default will lot the path component of the URI, but leave out the query string/query params.

For instance, perhaps you have a URL to your app /search?q=libraries.

lograge will log something like:

method=GET path=/search format=html

The q=libraries part is completely left out of the log. I kinda want that part, it’s important.

The lograge README provides instructions for “logging request parameters”, by way of the params hash.

I’m going to modify them a bit slightly to:

  • use the more recent custom_payload config instead of custom_options. (I’m not certain why there are both, but I think mostly for legacy reasons and newer custom_payload? is what you should read for?)
  • If we just put params in there, then a bunch of ugly <ActionController::Parameters show up in the log if you have nested hash params. We could fix that with params.to_unsafe_h, but…
  • We should really use request.filtered_parameters instead to make sure we’re not logging anything that’s been filtered out with Rails 6 config.filter_parameters. (Thanks /u/ezekg on reddit). This also converts to an ordinary hash that isn’t ActionController::Parameters, taking care of previous bullet point.
  • (It kind of seems like lograge README could use a PR updating it?)
  config.lograge.custom_payload do |controller|
    exceptions = %w(controller action format id)
    params: controller.request.filtered_parameters.except(*exceptions)

That gets us a log line that might look something like this:

method=GET path=/search format=html controller=SearchController action=index status=200 duration=107.66 view=87.32 db=29.00 params={"q"=>"foo"}

OK. The params hash isn’t exactly the same as the query string, it can include things not in the URL query string (like controller and action, that we have to strip above, among others), and it can in some cases omit things that are in the query string. It just depends on your routing and other configuration and logic.

The params hash itself is what default rails logs… but what if we just log the actual URL query string instead? Benefits:

  • it’s easier to search the logs for actually an exact specific known URL (which can get more complicated like /search?q=foo&range%5Byear_facet_isim%5D%5Bbegin%5D=4&source=foo or something). Which is something I sometimes want to do, say I got a URL reported from an error tracking service and now I want to find that exact line in the log.
  • I actually like having the exact actual URL (well, starting from path) in the logs.
  • It’s a lot simpler, we don’t need to filter out controller/action/format/id etc.
  • It’s actually a bit more concise? And part of what I’m dealing with in general using lograge is trying to reduce my bytes of logfile for papertrail!


  • if you had some kind of structured log search (I don’t at present, but I guess could with papertrail features by switching to json format?), it might be easier to do something like “find a /search with q=foo and source=ef without worrying about other params)
  • To the extent that params hash can include things not in the actual url, is that important to log like that?
  • ….?

Curious what other people think… am I crazy for wanting the actual URL in there, not the params hash?

At any rate, it’s pretty easy to do. Note we use filtered_path rather than fullpath to again take account of Rails 6 parameter filtering, and thanks again /u/ezekg:

  config.lograge.custom_payload do |controller|
      path: controller.request.filtered_path

This is actually overwriting the default path to be one that has the query string too:

method=GET path=/search?q=libraries format=html ...

You could of course add a different key fullpath instead, if you wanted to keep path as it is, perhaps for easier collation in some kind of log analyzing system that wants to group things by same path invariant of query string.

I’m gonna try this out!

Meanwhile, on lograge…

As long as we’re talking about lograge…. based on commit history, history of Issues and Pull Requests… the fact that CI isn’t currently running ( grr) and doesn’t even try to test on Rails 6.0+ (although lograge seems to work fine)… one might worry that lograge is currently un/under-maintained…. No comment on a GH issue filed in May asking about project status.

It still seems to be one of the more popular solutions to trying to tame Rails kind of out of control logs. It’s mentioned for instance in docs from papertrail and honeybadger, and many many other blog posts.

What will it’s future be?

Looking around for other possibilties, I found semantic_logger (rails_semantic_logger). It’s got similar features. It seems to be much more maintained. It’s got a respectable number of github stars, although not nearly as many as lograge, and it’s not featured in blogs and third-party platform docs nearly as much.

It’s also a bit more sophisticated and featureful. For better or worse. For instance mainly I’m thinking of how it tries to improve app performance by moving logging to a background thread. This is neat… and also can lead to a whole new class of bug, mysterious warning, or configuration burden.

For now I’m sticking to the more popular lograge, but I wish it had CI up that was testing with Rails 6.1, at least!

Incidentally, trying to get Rails to log more compactly like both lograge and rails_semantic_logger do… is somewhat more complicated than you might expect, as demonstrated by the code in both projects that does it! Especially semantic_logger is hundreds of lines of somewhat baroque code split accross several files. A refactor of logging around Rails 5 (I think?) to use ActiveSupport::LogSubscriber made it possible to customize Rails logging like this (although I think both lograge and rails_semantic_logger still do some monkey-patching too!), but in the end didn’t make it all that easy or obvious or future-proof. This may discourage too many other alternatives for the initial primary use case of both lograge and rails_semantic_logger — turn a rails action into one log line, with a structured format.

Notes on Cloudfront in front of Rails Assets on Heroku, with CORS

Heroku really recommends using a CDN in front of your Rails app static assets — which, unlike in non-heroku circumstances where a web server like nginx might be taking care of it, otherwise on heroku static assets will be served directly by your Rails app, consuming limited/expensive dyno resources.

After evaluating a variety of options (including some heroku add-ons), I decided AWS Cloudfront made the most sense for us — simple enough, cheap, and we are already using other direct AWS services (including S3 and SES).

While heroku has an article on using Cloudfront, which even covers Rails specifically, and even CORS issues specifically, I found it a bit too vague to get me all the way there. And while there are lots of blog posts you can find on this topic, I found many of them outdated (Rails has introduced new API; Cloudfront has also changed it’s configuration options!), or otherwise spotty/thin.

So while I’m not an expert on this stuff, i’m going to tell you what I was able to discover, and what I did to set up Cloudfront as a CDN in front of Rails static assets running on heroku — although there’s really nothing at all specific to heroku here, if you have any other context where Rails is directly serving assets in production.

First how I set up Rails, then Cloudfront, then some notes and concerns. Btw, you might not need to care about CORS here, but one reason you might is if you are serving any fonts (including font-awesome or other icon fonts!) from Rails static assets.

Rails setup

In config/environments/production.rb

# set heroku config var RAILS_ASSET_HOST to your cloudfront
# hostname, will look like ``
config.asset_host = ENV['RAILS_ASSET_HOST']

config.public_file_server.headers = {
  # CORS:
  'Access-Control-Allow-Origin' => "*", 
  # tell Cloudfront to cache a long time:
  'Cache-Control' => 'public, max-age=31536000' 

Cloudfront Setup

I changed some things from default. The only one that absolutely necessary — if you want CORS to work — seemed to be changing Allowed HTTP Methods to include OPTIONS.

Click on “Create Distribution”. All defaults except:

  • Origin Domain Name: your heroku app host like
  • Origin protocol policy: Switch to “HTTPS Only”. Seems like a good idea to ensure secure traffic between cloudfront and origin, no?
  • Allowed HTTP Methods: Switch to GET, HEAD, OPTIONS. In my experimentation, necessary for CORS from a browser to work — which AWS docs also suggest.
  • Cached HTTP Methods: Click “OPTIONS” too now that we’re allowing it, I don’t see any reason not to?
  • Compress objects automatically: yes
    • Sprockets is creating .gz versions of all your assets, but they’re going to be completely ignored in a Cloudfront setup either way. ☹️ (Is there a way to tell Sprockets to stop doing it? WHO KNOWS not me, it’s so hard to figure out how to reliably talk to Sprockets). But we can get what it was trying to do by having Cloudfront encrypt stuff for us, seems like a good idea, Google PageSpeed will like it, etc?
    • I noticed by experimentation that Cloudfront will compress CSS and JS (sometimes with brotli sometimes gz, even with the same browser, don’t know how it decides, don’t care), but is smart enough not to bother trying to compress a .jpg or .png (which already has internal compression).
  • Comment field: If there’s a way to edit it after you create the distribution, I haven’t found it, so pick a good one!

Notes on CORS

AWS docs here and here suggest for CORS support you also need to configure the Cloudfront distribution to forward additional headers — Origin, and possibly Access-Control-Request-Headers and Access-Control-Request-Method. Which you can do by setting up a custom “cache policy”. Or maybe instead by by setting the “Origin Request Policy”. Or maybe instead by setting custom cache header settings differently using the Use legacy cache settings option. It got confusing — and none of these settings seemed to be necessary to me for CORS to be working fine, nor could I see any of these settings making any difference in CloudFront behavior or what headers were included in responses.

Maybe they would matter more if I were trying to use a more specific Access-Control-Allow-Origin than just setting it to *? But about that….

If you set Access-Control-Allow-Origin to a single host, MDN docs say you have to also return a Vary: Origin header. Easy enough to add that to your Rails config.public_file_server.headers. But I couldn’t get Cloudfront to forward/return this Vary header with it’s responses. Trying all manner of cache policy settings, referring to AWS’s quite confusing documentation on the Vary header in Cloudfront and trying to do what it said — couldn’t get it to happen.

And what if you actually need more than one allowed origin? Per spec Access-Control-Allow-Origin as again explained by MDN, you can’t just include more than one in the header, the header is only allowed one: ” If the server supports clients from multiple origins, it must return the origin for the specific client making the request.” And you can’t do that with Rails static/global config.public_file_server.headers, we’d need to use and setup rack-cors instead, or something else.

So I just said, eh, * is probably just fine. I don’t think it actually involves any security issues for rails static assets to do this? I think it’s probably what everyone else is doing?

The only setup I needed for this to work was setting Cloudfront to allow OPTIONS HTTP method, and setting Rails config.public_file_server.headers to include 'Cache-Control' => 'public, max-age=31536000'.

Notes on Cache-Control max-age

A lot of the existing guides don’t have you setting config.public_file_server.headers to include 'Cache-Control' => 'public, max-age=31536000'.

But without this, will Cloudfront actually be caching at all? If with every single request to cloudfront, cloudfront makes a request to the Rails app for the asset and just proxies it — we’re not really getting much of the point of using Cloudfront in the first place, to avoid the traffic to our app!

Well, it turns out yes, Cloudfront will cache anyway. Maybe because of the Cloudfront Default TTL setting? My Default TTL was left at the Cloudfront default, 86400 seconds (one day). So I’d think that maybe Cloudfront would be caching resources for a day when I’m not supplying any Cache-Control or Expires headers?

In my observation, it was actually caching for less than this though. Maybe an hour? (Want to know if it’s caching or not? Look at headers returned by Cloudfront. One easy way to do this? curl -IXGET, you’ll see a header either x-cache: Miss from cloudfront or x-cache: Hit from cloudfront).

Of course, Cloudfront doesn’t promise to cache for as long as it’s allowed to, it can evict things for it’s own reasons/policies before then, so maybe that’s all that’s going on.

Still, Rails assets are fingerprinted, so they are cacheable forever, so why not tell Cloudfront that? Maybe more importantly, if Rails isn’t returning a Cache-Cobntrol header, then Cloudfront isn’t either to actual user-agents, which means they won’t know they can cache the response in their own caches, and they’ll keep requesting/checking it on every reload too, which is not great for your far too large CSS and JS application files!

So, I think it’s probably a great idea to set the far-future Cache-Control header with config.public_file_server.headers as I’ve done above. We tell Cloudfront it can cache for the max-allowed-by-spec one year, and this also (I checked) gets Cloudfront to forward the header on to user-agents who will also know they can cache.

Note on limiting Cloudfront Distribution to just static assets?

The CloudFront distribution created above will actually proxy/cache our entire Rails app, you could access dynamic actions through it too. That’s not what we intend it for, our app won’t generate any URLs to it that way, but someone could.

Is that a problem?

I don’t know?

Some blog posts try to suggest limiting it only being willing to proxy/cache static assets instead, but this is actually a pain to do for a couple reasons:

  1. Cloudfront has changed their configuration for “path patterns” since many blog posts were written (unless you are using “legacy cache settings” options), such that I’m confused about how to do it at all, if there’s a way to get a distribution to stop caching/proxying/serving anything but a given path pattern anymore?
  2. Modern Rails with webpacker has static assets at both /assets and /packs, so you’d need two path patterns, making it even more confusing. (Why Rails why? Why aren’t packs just at public/assets/packs so all static assets are still under /assets?)

I just gave up on figuring this out and figured it isn’t really a problem that Cloudfront is willing to proxy/cache/serve things I am not intending for it? Is it? I hope?

Note on Rails asset_path helper and asset_host

You may have realized that Rails has both asset_path and asset_url helpers for linking to an asset. (And similar helpers with dashes instead of underscores in sass, and probably different implementations, via sass-rails)

Normally asset_path returns a relative URL without a host, and asset_url returns a URL with a hostname in it. Since using an external asset_host requires we include the host with all URLs for assets to properly target CDN… you might think you have to stop using asset_path anywhere and just use asset_urlYou would be wrong.

It turns out if config.asset_host is set, asset_path starts including the host too. So everything is fine using asset_path. Not sure if at that point it’s a synonym for asset_url? I think not entirely, because I think in fact once I set config.asset_host, some of my uses of asset_url actually started erroring and failing tests? And I had to actually only use asset_path? In ways I don’t really understand what’s going on and can’t explain it?

Ah, Rails.

ActiveSupport::Cache via ActiveRecord (note to self)

There are a variety of things written to use flexible back-end key/value datastores via the ActiveSupport::Cache API.

For instance, say, activejob-status.

I have sometimes in the past wanted to be able to use such things storing the data in an rdbms, say vai ActiveRecord. Make a table for it. Sure, this won’t be nearly as fast or “scalable” as, say, redis, but for so many applications it’s just fine. And I often avoid using a feature at all if it is going to require to me to add another service (like another redis instance).

So I’ve considered writing an ActiveSupport::Cache adapter for ActiveRecord, but never really gotten around to it, so I keep avoiding using things I’d be trying out if I had it….

Well, today I discovered the ruby gem that’s a key/value store swiss army knife, moneta. Look, it has an ActiveSupport::Cache adapter so you can use any moneta-supported store as an ActiveSupport::Cache API. AND then if you want to use an rdbms as your moneta-supported store, you can do it through ActiveRecord or Sequel.

Great, I don’t have to write the adapter after all, it’s already been done! Assuming it works out okay, which I haven’t actually checked in practice yet.

Writing this in part as a note-to-self so next time I have an itch that can be scratched this way, I remember moneta is there — to at least explore further.

Not sure where to find the docs, but here’s the source for ActiveRecord moneta adapter. It looks like I can create different caches that use different tables, which is the first thing I thought to ensure.

The second thing I thought to look for — can it handle expiration, and purging expired keys? Unclear, I can’t find it. Maybe I could PR it if needed.

And hey, if for some reason you want an ActiveSupport::Cache backed by PStore or BerkelyDB (don’t do it!), or Cassandara (you got me, no idea?), moneta has you too.

Heroku release phase, rails db:migrate, and command failure

If you use capistrano to deploy a Rails app, it will typically run a rails db:migrate with every deploy, to apply any database schema changes.

If you are deploying to heroku you might want to do the same thing. The heroku “release phase” feature makes this possible. (Introduced in 2017, the release phase feature is one of heroku’s more recent major features, as heroku dev has seemed to really stabilize and/or stagnate).

The release phase docs mention “running database schema migrations” as a use case, and there are a few ((1), (2), (3)) blog posts on the web suggesting doing exactly that with Rails. Basically as simple as adding release: bundle exec rake db:migrate to your Procfile.

While some of the blog posts do remind you that “If the Release Phase fails the app will not be deployed”, I have found the implications of this to be more confusing in practice than one would originally assume. Particularly because on heroku changing a config var triggers a release; and it can be confusing to notice when such a release has failed.

It pays to consider the details a bit so you understand what’s going on, and possibly consider somewhat more complicated release logic than simply calling out to rake db:migrate.

1) What if a config var change makes your Rails app unable to boot?

I don’t know how unusual this is, but I actually had a real-world bug like this when in the process of setting up our heroku app. Without confusing things with the details, we can simulate such a bug simply by putting this in, say, config/application.rb:

  raise "I am refusing to boot"

Obviously my real bug was weirder, but the result was the same — with some settings of one or more heroku configuration variables, the app would raise an exception during boot. And we hadn’t noticed this in testing, before deploying to heroku.

Now, on heroku, using CLI or web dashboard, set the config var FAIL_TO_BOOT to “true”.

Without a release phase, what happens?

  • The release is successful! If you look at the release in the dashboard (“Activity” tab) or heroku releases, it shows up as successful. Which means heroku brings up new dynos and shuts down the previous ones, that’s what a release is.
  • The app crashes when heroku tries to start it in the new dynos.
  • The dynos will be in “crashed” state when looked at in heroku ps or dashboard.
  • If a user tries to access the web app, they will get the generic heroku-level “could not start app” error screen (unless you’ve customized your heroku error screens, as usual).
  • You can look in your heroku logs to see the error and stack trace that prevented app boot.

Downside: your app is down.

Upside: It is pretty obvious that your app is down, and (relatively) why.

With a db:migrate release phase, what happens?

The Rails db:migrate rake task has a dependency on the rails :environment task, meaning it boots the Rails app before executing. You just changed your config variable FAIL_TO_BOOT: true such that the Rails app can’t boot. Changing the config variable triggered a release.

As part of the release, the db:migrate release phase is run… which fails.

  • The release is not succesful, it failed.
  • You don’t get any immediate feedback to that effect in response to your heroku config:add command or on the dashboard GUI in the “settings” tab. You may go about your business assuming it succeeded.
  • If you look at the release in heroku releases or dashboard “activity” tab you will see it failed.
  • You do get an email that it failed. Maybe you notice it right away, or maybe you notice it later, and have to figure out “wait, which release failed? And what were the effects of that? Should I be worried?”
  • The effects are:
    • The config variable appears changed in heroku’s dashboard or in response to heroku config:get etc.
    • The old dynos without the config variable change are still running. They don’t have the change. If you open a one-off dyno, it will be using the old release, and have the old (eg) ENV[‘FAIL_TO_BOOT’] value.
    • ANY subsequent attempts at a releases will keep fail, so long as the app is in a state (based on teh current config variables) that it can’t boot.

Again, this really happened to me! It is a fairly confusing situation.

Upside: Your app is actually still up, even though you broke it, the old release that is running is still running, that’s good?

Downside: It’s really confusing what happened. You might not notice at first. Things remain in a messed up inconsistent and confusing state until you notice, figure out what’s going on, what release caused it, and how to fix it.

It’s a bit terrifying that any config variable change could do this. But I guess most people don’t run into it like I did, since I haven’t seen it mentioned?

2) A heroku pg:promote is a config variable change, that will create a release in which db:migrate release phase fails.

heroku pg:promote is a command that will change which of multiple attached heroku postgreses are attached as the “primary” database, pointed to by the DATABASE_URL config variable.

For a typical app with only one database, you still might use pg:promote for a database upgrade process; for setting up or changing a postgres high-availability leader/follower; or, for what I was experimenting with it for, using heroku’s postgres-log-based rollback feature.

I had assumed that pg:promote was a zero-downtime operation. But, in debugging it’s interaction with my release phase, I noticed that pg:promote actually creates TWO heroku releases.

  1. First it creates a release labelled Detach DATABASE , in which there is no DATABASE_URL configuration variable at all.
  2. Then it creates another release labelled Attach DATABASE in which the DATABASE_URL configuration variable is defined to it’s new value.

Why does it do this instead of one release that just changes the DATABASE_URL? I don’t know. My app (like most Rails and probably other apps) can’t actually function without DATABASE_URL set, so if that first release ever actually runs, it will just error out. Does this mean there’s an instant with a “bad” release deployed, that pg:promote isn’t actually zero-downtime? I am not sure, it doens’t seem right (I did file a heroku support ticket asking….).

But under normal circumstances, either it’s not a problem, or most people(?) don’t notice.

But what if you have a db:migrate release phase?

When it tries to do release (1) above, that release will fail. Because it tries to run db:migrate, and it can’t do that without a DATABASE_URL set, so it raises, the release phase exits in an error condition, the release fails.

Actually what happens is without DATABASE_URL set, the Rails app will assume a postgres URL in a “default” location, try to connect to, and fail, with an error message (hello googlers?), like:

ActiveRecord::ConnectionNotEstablished: could not connect to server: No such file or directory
	Is the server running locally and accepting
	connections on Unix domain socket "/var/run/postgresql/.s.PGSQL.5432"?

Now, release (2) is coming down the pike seconds later, this is actually fine, and will be zero outage. We had a release that failed (so never was deployed), and seconds later the next correct release succeeds. Great!

The only problem is that we got an email notifying us that release 1 failed, and it’s also visible as failing in the heroku release list, etc.

A “background” (not in response to a git push or other code push to heroku) release failing is already a confusing situation — a”false positives” that actually mean “nothing unexpected or problematic happened, just ignore this and carry on.” is… really not something I want. (I call this the “error notification crying wolf”, right? I try to make sure my error notifications never do it, because it takes your time away from flow unecessarily, and/or makes it much harder to stay vigilant to real errors).

Now, there is a fairly simple solution to this particular problem. Here’s what I did. I changed my heroku release phase from rake db:migrate to a custom rake task, say release: bundle exec rake my_custom_heroku_release_phase, defined like so:

task :my_custom_heroku_release_phase do
$stderr.puts "\n!!! WARNING, no ENV['DATABASE_URL'], not running rake db:migrate as part of heroku release !!!\n\n"
view raw custom.rake hosted with ❤ by GitHub

Now that release (1) above at least won’t fail, it has the same behavior as a “traditional” heroku app without a release phase.

Swallow-and-report all errors?

When a release fails because a release phase has failed as result of a git push to heroku, that’s quite clear and fine!

But the confusion of the “background” release failure, triggered by a config var change, is high enough that part of me wants to just rescue StandardError in there, and prevent a failed release phase from ever exiting with a failure code, so heroku will never use a db:migrate release phase to abort a release.

Just return the behavior to the pre-release-phase heroku behavior — you can put your app in a situation where it will be crashed and not work, but maybe that’s better not a mysterious inconsistent heroku app state that happens in the background and you find out about only through asynchronous email notifications from heroku that are difficult to understand/diagnose. It’s all much more obvious.

On the other hand, if a db:migrate has failed not becuase of some unrelated boot process problem that is going to keep the app from launching too even if it were released, but simply because the db:migrate itself actually failed… you kind of want the release to fail? That’s good? Keep the old release running, not a new release with code that expects a db migration that didn’t happen?

So I’m not really sure.

If you did want to rescue-swallow-and-notify, the custom rake task for your heroku release logic — instead of just telling heroku to run a standard thing like db:migrate on release — is certainly convenient.

Also, do you really always want to db:migrate anyway? What about db:schema:load?

Another alternative… if you are deploying an app with an empty database, standard Rails convention is to run rails db:schema:load instead of db:migrate. The db:migrate will probably work anyway, but will be slower, and somewhat more error-prone.

I guess this could come up on heroku with an initial deploy or (for some reason) a database that’s been nuked and restarted, or perhaps a Heroku “Review app”? (I don’t use those yet)

stevenharman has a solution that actually checks the database, and runs the appropriate rails task depending on state, here in this gist.

I’d probably do it as a rake task instead of a bash file if I were going to do that. I’m not doing it at all yet.

Note that stevenharman’s solution will actually catch a non-existing or non-connectable database and not try to run migrations… but it will print an error message and exit 1 in that case, failing the release — meaning that you will get a failed release in the pg:promote case mentioned above!

Code that Lasts: Sustainable And Usable Open Source Code

A presentation I gave at online conference Code4Lib 2021, on Monday March 21.

I have realized that the open source projects I am most proud of are a few that have existed for years now, increasing in popularity, with very little maintenance required. Including traject and bento_search. While community aspects matter for open source sustainability, the task gets so much easier when the code requires less effort to keep alive, for maintainers and utilizers. Using these projects as examples, can we as developers identify what makes code “inexpensive” to use and maintain over the long haul with little “churn”, and how to do that?

Slides on Google Docs

Rough transcript (really the script I wrote for myself)

Hi, I’m Jonathan Rochkind, and this is “Code that Lasts: Sustainable and Usable Open Source Code”

So, who am I? I have been developing open source library software since 2006, mainly in ruby and Rails. 

Over that time, I have participated in a variety open source projects meant to be used by multiple institutions, and I’ve often seen us having challenges with long-term maintenance sustainability and usability of our software. This includes in projects I have been instrumental in creating myself, we’ve all been there! 

We’re used to thinking of this problem in terms of needing more maintainers.

But let’s first think more about what the situation looks like, before we assume what causes it. In addition to features  or changes people want not getting done, it also can look like, for instance:

Being stuck using out-of-date dependencies like old, even end-of-lifed, versions of Rails or ruby.

A reduction in software “polish” over time. 

What do I mean by “polish”?

Engineer Richard Schneeman writes: [quote] “When we say something is “polished” it means that it is free from sharp edges, even the small ones. I view polished software to be ones that are mostly free from frustration. They do what you expect them to and are consistent.” 

I have noticed that software can start out very well polished, but over time lose that polish. 

This usually goes along with decreasing “cohesion” in software over time, a feeling like that different parts of the software start to no longer tell the developer a consistent story together. 

While there can be an element of truth in needing more maintainers in some cases – zero maintainers is obviously too few — there are also ways that increasing the number of committers or maintainers can result in diminishing returns and additional challenges.

One of the theses of Fred Brooks famous 1975 book “The Mythical Man-Month” is sometimes called ”Brooks Law”:  “under certain conditions, an incremental person when added to a project makes the project take more, not less time.”

Why? One of the main reasons Brooks discusses is the the additional time taken for communication and coordination between more people – with every person you add, the number of connections between people goes up combinatorily. 

That may explain the phenomenon we sometimes see with so-called “Design  by committee” where “too many cooks in the kitchen” can produce inconsistency or excessive complexity.

Cohesion and polish require a unified design vision— that’s  not incompatible with increasing numbers of maintainers, but it does make it more challenging because it takes more time to get everyone on the same page, and iterate while maintaining a unifying vision.  (There’s also more to be said here about the difference between just a bunch of committers committing PR’s, and the maintainers role of maintaining historical context and design vision for how all the parts fit together.)

Instead of assuming adding more committers or maintainers is the solution, can there instead be ways to reduce the amount of maintenance required?

I started thinking about this when I noticed a couple projects of mine which had become more widely successful than I had any right  to expect, considering how little maintainance was being put into them. 

Bento_search is a toolkit for searching different external search engines in a consistent way. It’s especially but not exclusively for displaying multiple search results in “bento box” style, which is what Tito Sierra from NCSU first called these little side by side search results. 

I wrote bento_search  for use at a former job in 2012.  55% of all commits to the project were made in 2012.  95% of all commits in 2016 or earlier. (I gave it a bit of attention for a contracting project in 2016).

But bento_search has never gotten a lot of maintenance, I don’t use it anymore myself. It’s not in wide use, but I found  it kind of amazing, when I saw people giving me credit in conference presentations for the gem (thanks!), when I didn’t even know they were using it and I hadn’t been paying it any attention at all! It’s still used by a handful of institutions for whom it just works with little attention from maintainers. (The screenshot from Cornell University Libraries)

Traject is a Marc-to-Solr indexing tool written in ruby  (or, more generally, can be a general purpose extract-transform-load tool), that I wrote with Bill Dueber from the University of Michigan in 2013. 

We hoped it would catch on in the Blacklight community, but for the first couple years, it’s uptake was slow. 

However, since then, it has come to be pretty popular in Blacklight and Samvera communities, and a few other library technologist uses.  You can see the spikes of commit activity in the graph for a 2.0 release in 2015 and a 3.0 release in 2018 – but for the most part at other times, nobody has really been spending much time on maintaining traject.   Every once in a while a community member submits a minor Pull Request, and it’s usually me who reviews it. Me and Bill remain the only maintainers. 

And yet traject just keeps plugging along, picking up adoption and working well for adopters.  

So, this made me start thinking, based on what I’ve seen in my career, what are some of the things that might make open source projects both low-maintenance and successful in their adoption and ease-of-use for developers?

One thing both of these projects did was take backwards compatibility very seriously. 

The first step of step there is following “semantic versioning” a set of rules whose main point is that releases can’t include backwards incompatible changes unless they are a new major version, like going from 1.x to 2.0. 

This is important, but it’s not alone enough to minimize backwards incompatible changes that add maintenance burden to the ecosystem. If the real goal is preventing the pain of backwards incompatibility, we also need to limit the number of major version releases, and limit the number and scope of backwards breaking changes in each major release!

The Bento_search gem has only had one major release, it’s never had a 2.0 release, and it’s still backwards compatible to it’s initial release. 

Traject is on a 3.X release after 8 years, but the major releases of traject have had extremely few backwards breaking changes, most people could upgrade through major versions changing very little or most often nothing in their projects. 

So OK, sure, everyone wants to minimize backwards incompatibility, but that’s easy to say, how do you DO it? Well, it helps to have less code overall, that changes less often overall all  – ok, again, great, but how do you do THAT? 

Parsimony is a word in general English that means “The quality of economy or frugality in the use of resources.”

In terms of software architecture, it means having as few as possible moving parts inside your code: fewer classes, types, components, entities, whatever: Or most fundamentally, I like to think of it in terms of minimizing the concepts in the mental model a programmer needs to grasp how the code works and what parts do what.

The goal of architecture design is, what is the smallest possible architecture we can create to make [quote] “simple things simple and complex things possible”, as computer scientist Alan Kay described the goal of software design. 

We can see this in bento_search has very few internal architectural concepts. 

The main thing bento_search does is provide a standard API for querying a search engine and representing results of a search. These are consistent across different searche engines,, with common metadata vocabulary for what results look like. This makes search engines  interchangeable to calling code.  And then it includes half a dozen or so search engine implementations for services I needed or wanted to evaluate when I wrote it.  

This search engine API at the ruby level can be used all by itself even without the next part, the actual “bento style” which is a built-in support for displaying search engine results in a boxes on a page of your choice in a Rails app, way to,  writing very little boilerplate code.  

Traject has an architecture which basically has just three parts at the top.

There is a reader which sends objects into the pipeline. 

There are some indexing rules which are transformation steps from source object to build an output Hash object. 

And then a writer which which translates the Hash object to write to some store, such as Solr.

The reader, transformation steps, and writer are all independent and uncaring about each other, and can be mixed and matched.  

That’s MOST of traject right there. It seems simple and obvious once you have it, but it can take a lot of work to end up with what’s simple and obvious in retrospect! 

When designing code I’m often reminded of the apocryphal quote: “I would have written a shorter letter, but I did not have the time”

And, to be fair, there’s a lot of complexity within that “indexing rules” step in traject, but it’s design was approached the same way. We have use cases about supporting configuration settings in a  file or on command line; or about allowing re-usable custom transformation logic – what’s the simplest possible architecture we can come up with to support those cases.

OK, again, that sounds nice, but how do you do it? I don’t have a paint by numbers, but I can say that for both these projects I took some time – a few weeks even – at the beginning to work out these architectures, lots of diagraming, some prototyping I was prepared to throw out,  and in some cases “Documentation-driven design” where I wrote some docs for code I hadn’t written yet. For traject it was invaluable to have Bill Dueber at University of Michigan also interested in spending some design time up front, bouncing ideas back and forth with – to actually intentionally go through an architectural design phase before the implementation. 

Figuring out a good parsimonious architecture takes domain knowledge: What things your “industry” – other potential institutions — are going to want to do in this area, and specifically what developers are going to want to do with your tool. 

We’re maybe used to thinking of “use cases” in terms of end-users, but it can be useful at the architectural design stage, to formalize this in terms of developer use cases. What is a developer going to want to do, how can I come up with a small number of software pieces she can use to assemble together to do those things.

When we said “make simple things simple and complex things possible”, we can say domain analysis and use cases is identifying what things we’re going to put in either or neither of those categories. 

The “simple thing” for bento_search , for instance is just “do a simple keyword search in a search engine, and display results, without having the calling code need to know anything about the specifics of that search engine.”

Another way to get a head-start on solid domain knowledge is to start with another tool you have experience with, that you want to create a replacement for. Before Traject, I and other users used a tool written in Java called SolrMarc —  I knew how we had used it, and where we had had roadblocks or things that we found harder or more complicated than we’d like, so I knew my goals were to make those things simpler.

We’re used to hearing arguments about avoiding rewrites, but like most things in software engineering, there can be pitfalls on either either extreme.

I was amused to notice, Fred Brooks in the previously mentioned Mythical Man Month makes some arguments in both directions. 

Brooks famously warns about a “second-system effect”, the [quote] “tendency of small, elegant, and successful systems to be succeeded by over-engineered, bloated systems, due to inflated expectations and overconfidence” – one reason to be cautious of a rewrite. 

But Brooks in the very same book ALSO writes [quote] “In most projects, the first system built is barely usable….Hence plan to throw one away; you will, anyhow.”

It’s up to us figure out when we’re in which case. I personally think an application is more likely to be bitten by the “second-system effect” danger of a rewrite, while a shared re-usable library is more likely to benefit from a rewrite (in part because a reusable library is harder to change in place without disruption!). 

We could sum up a lot of different princples as variations of “Keep it small”. 

Both traject and bento_search are tools that developers can use to build something. Bento_search just puts search results in a box on a page; the developer is responsible for the page and an overall app. 

Yes, this means that you have to be a ruby developer to use it. Does this limit it’s audience? While we might aspire to make tools that even not-really-developers can just use out of the box, my experience has been that our open source attempts at shrinkwrapped “solutions” often end up still needing development expertise to successfully deploy.  Keeping our tools simple and small and not trying to supply a complete app can actually leave more time for these developers to focus on meeting local needs, instead of fighting with a complicated frameworks that doesn’t do quite what they need.

It also means we can limit interactions with any external dependencies. Traject was developed for use with a Blacklight project, but traject code does not refer to Blacklight or even Rails at all, which means new releases of Blacklight or Rails can’t possibly break traject. 

Bento_search , by doing one thing and not caring about the details of it’s host application, has kept working from Rails 3.2 all the way up to current Rails 6.1 with pretty much no changes needed except to the test suite setup. 

Sometimes when people try to have lots of small tools working together, it can turn into a nightmare where you get a pile of cascading software breakages every time one piece changes. Keeping assumptions and couplings down is what lets us avoid this maintenance nightmare. 

And another way of keeping it small is don’t be afraid to say “no” to features when you can’t figure out how to fit them in without serious harm to the parsimony of your architecture. Your domain knowledge is what lets you take an educated guess as to what features are core to your audience and need to be accomodated, and which are edge cases and can be fulfilled by extension points, or sometimes not at all. 

By extension points we mean we prefer opportunities for developer-users to write their own code which works with your tools, rather than trying to build less commonly needed features in as configurable features. 

As an example, Traject does include some built-in logic, but one of it’s extension point use cases is making sure it’s simple to add whatever transformation logic a developer-user wants, and have it look just as “built-in” as what came with traject. And since traject makes it easy to write your own reader or writer, it’s built-in readers and writers don’t need to include every possible feature –we plan for developers writing their own if they need something else. 

Looking at bento_search, it makes it easy to write your own search engine_adapter — that will be useable interchangeably with the built-in ones. Also, bento_search provides a standard way to add custom search arguments specific to a particular adapter – these won’t be directly interchangeable with other adapters, but they are provided for in the architecture, and won’t break in future bento_search releases – it’s another form of extension point. 

These extension points are the second half of “simple things simple, complex things possible.” – the complex things possible. Planning for them is part of understanding your developer use-cases, and designing an architecture that can easily handle them. Ideally, it takes no extra layers of abstraction to handle them, you are using the exact  architectural join points the out-of-the-box code is using, just supplying custom components. 

So here’s an example of how these things worked out in practice with traject, pretty well I think.

Stanford ended up writing a package of extensions to traject called TrajectPlus, to take care of some features they needed that traject didn’t provide. Commit history suggests it was written in 2017, which was Traject 2.0 days.  

I can’t recall, but I’d guess they approached me with change requests to traject at that time and I put them off because I couldn’t figure out how to fit them in parsimoniously, or didn’t have time to figure it out. 

But the fact that they were *able* to extend traject in this way I consider a validation of traject’s architecture, that they could make it do what they needed, without much coordination with me, and use it in many projects (I think beyond just Stanford). 

Much of the 3.0 release of traject was “back-port”ing some features that TrajectPlus had implemented, including out-of-the-box support for XML sources. But I didn’t always do them with the same implementation or API as TrajectPlus – this is another example of being able to use a second go at it to figure out how to do something even more parsimoniously, sometimes figuring out small changes to traject’s architecture to support flexibility in the right dimensions. 

When Traject 3.0 came out – the TrajectPlus users didn’t necessarily want to retrofit all their code to the new traject way of doing it. But TrajectPlus could still be used with traject 3.0 with few or possibly no changes, doing things the old way, they weren’t forced to upgrade to the new way. This is a huge win for traject’s backwards compat – everyone was able to do what they needed to do, even taking separate paths, with relatively minimized maintenance work. 

As I think about these things philosophically, one of my takeaways is that software engineering is still a craft – and software design is serious thing to be studied and engaged in. Especially for shared libraries rather than local apps, it’s not always to be dismissed as so-called “bike-shedding”. 

It’s worth it to take time to think about design, self-reflectively and with your peers, instead of just rushing to put our fires or deliver features, it will reduce maintenance costs and increase values over the long-term. 

And I want to just briefly plug “kithe”, a project of mine which tries to be guided by these design goals to create a small focused toolkit for building Digital Collections applications in Rails. 

I could easily talk about all of this this another twenty minutes, but that’s our time! I’m always happy to talk more, find me on slack or IRC or email. 

This last slide has some sources mentioned in the talk. Thanks for your time! 

Product management

In my career working in the academic sector, I have realized that one thing that is often missing from in-house software development is “product management.”

But what does that mean exactly? You don’t know it’s missing if you don’t even realize it’s a thing and people can use different terms to mean different roles/responsibilities.

Basically, deciding what the software should do. This is not about colors on screen or margins (what our stakeholderes often enjoy micro-managing) — I’d consider those still the how of doing it, rather than the what to do. The what is often at a much higher level, about what features or components to develop at all.

When done right, it is going to be based on both knowledge of the end-user’s needs and preferences (user research); but also knowledge of internal stakeholder’s desires and preferences (overall organiational strategy, but also just practically what is going to make the right people happy to keep us resourced). Also knowledge of the local capacity, what pieces do we need to put in place to get these things developed. When done seriously, it will necessarily involve prioritization — there are many things we could possibly done, some subset of them we very well may do eventually, but which ones should we do now?

My experience tells me it is a very big mistake to try to have a developer doing this kind of product management. Not because a developer can’t have the right skillset to do them. But because having the same person leading development and product management is a mistake. The developer is too close to the development lense, and there’s just a clarification that happens when these roles are separate.

My experience also tells me that it’s a mistake to have a committee doing these things, much as that is popular in the academic sector. Because, well, just of course it is.

But okay this is all still pretty abstract. Things might become more clear if we get more specific about the actual tasks and work of this kind of product management role.

I found Damilola Ajiboye blog post on “Product Manager vs Product Marketing Manager vs Product Owner” very clear and helpful here. While it is written so as to distinguish between three different product management related roles, but Ajiboye also acknowledges that in a smaller organization “a product manager is often tasked with the duty of these 3 roles.

Regardless of if the responsibilities are to be done by one or two or three person, Ajiboye’s post serves as a concise listing of the work to be done in managing a product — deciding the what of the product, in an ongoing iterative and collaborative manner, so that developers and designers can get to the how and to implementation.

I recommend reading the whole article, and I’ll excerpt much of it here, slightly rearranged.

The Product Manager

These individuals are often referred to as mini CEOs of a product. They conduct customer surveys to figure out the customer’s pain and build solutions to address it. The PM also prioritizes what features are to be built next and prepares and manages a cohesive and digital product roadmap and strategy.

The Product Manager will interface with the users through user interviews/feedback surveys or other means to hear directly from the users. They will come up with hypotheses alongside the team and validate them through prototyping and user testing. They will then create a strategy on the feature and align the team and stakeholders around it. The PM who is also the chief custodian of the entire product roadmap will, therefore, be tasked with the duty of prioritization. Before going ahead to carry out research and strategy, they will have to convince the stakeholders if it is a good choice to build the feature in context at that particular time or wait a bit longer based on the content of the roadmap.

The Product Marketing Manager
The PMM communicates vital product value — the “why”, “what” and “when” of a product to intending buyers. He manages the go-to-market strategy/roadmap and also oversees the pricing model of the product. The primary goal of a PMM is to create demand for the products through effective messaging and marketing programs so that the product has a shorter sales cycle and higher revenue.

The product marketing manager is tasked with market feasibility and discovering if the features being built align with the company’s sales and revenue plan for the period. They also make research on how sought-after the feature is being anticipated and how it will impact the budget. They communicate the values of the feature; the why, what, and when to potential buyers — In this case users in countries with poor internet connection.

[While expressed in terms of a for-profit enterprise selling something, I think it’s not hard to translate this to a non-profit or academic environment. You still have an audience whose uptake you need to be succesful, whether internal or external. — jrochkind ]

The Product Owner
A product owner (PO) maximizes the value of a product through the creation and management of the product backlog, creation of user stories for the development team. The product owner is the customer’s representative to the development team. He addresses customer’s pain points by managing and prioritizing a visible product backlog. The PO is the first point of call when the development team needs clarity about interpreting a product feature to be implemented.

The product owner will first have to prioritize the backlog to see if there are no important tasks to be executed and if this new feature is worth leaving whatever is being built currently. They will also consider the development effort required to build the feature i.e the time, tools, and skill set that will be required. They will be the one to tell if the expertise of the current developers is enough or if more engineers or designers are needed to be able to deliver at the scheduled time. The product owner is also armed with the task of interpreting the product/feature requirements for the development team. They serve as the interface between the stakeholders and the development team.

When you have someone(s) doing these roles well, it ensures that the development team is actually spending time on things that meet user and business needs. I have found that it makes things so much less stressful and more rewarding for everyone involved.

When you have nobody doing these roles, or someone doing it in a cursory or un-intentional way not recognized as part of their core job responsibilities, or have a lead developer trying to do it on top of develvopment, I find it leads to feelings of: spinning wheels, everything-is-an-emergency, lack of appreciation, miscommunication and lack of shared understanding between stakeholders and developers, general burnout and dissatisfaction — and at the root, a product that is not meeting user or business needs well, leading to these inter-personal and personal problems.