Notes on Cloudfront in front of Rails Assets on Heroku, with CORS

Heroku really recommends using a CDN in front of your Rails app static assets — which, unlike in non-heroku circumstances where a web server like nginx might be taking care of it, otherwise on heroku static assets will be served directly by your Rails app, consuming limited/expensive dyno resources.

After evaluating a variety of options (including some heroku add-ons), I decided AWS Cloudfront made the most sense for us — simple enough, cheap, and we are already using other direct AWS services (including S3 and SES).

While heroku has an article on using Cloudfront, which even covers Rails specifically, and even CORS issues specifically, I found it a bit too vague to get me all the way there. And while there are lots of blog posts you can find on this topic, I found many of them outdated (Rails has introduced new API; Cloudfront has also changed it’s configuration options!), or otherwise spotty/thin.

So while I’m not an expert on this stuff, i’m going to tell you what I was able to discover, and what I did to set up Cloudfront as a CDN in front of Rails static assets running on heroku — although there’s really nothing at all specific to heroku here, if you have any other context where Rails is directly serving assets in production.

First how I set up Rails, then Cloudfront, then some notes and concerns. Btw, you might not need to care about CORS here, but one reason you might is if you are serving any fonts (including font-awesome or other icon fonts!) from Rails static assets.

Rails setup

In config/environments/production.rb

# set heroku config var RAILS_ASSET_HOST to your cloudfront
# hostname, will look like `xxxxxxxx.cloudfront.net`
config.asset_host = ENV['RAILS_ASSET_HOST']

config.public_file_server.headers = {
  # CORS:
  'Access-Control-Allow-Origin' => "*", 
  # tell Cloudfront to cache a long time:
  'Cache-Control' => 'public, max-age=31536000' 
}

Cloudfront Setup

I changed some things from default. The only one that absolutely necessary — if you want CORS to work — seemed to be changing Allowed HTTP Methods to include OPTIONS.

Click on “Create Distribution”. All defaults except:

  • Origin Domain Name: your heroku app host like app-name.herokuapp.com
  • Origin protocol policy: Switch to “HTTPS Only”. Seems like a good idea to ensure secure traffic between cloudfront and origin, no?
  • Allowed HTTP Methods: Switch to GET, HEAD, OPTIONS. In my experimentation, necessary for CORS from a browser to work — which AWS docs also suggest.
  • Cached HTTP Methods: Click “OPTIONS” too now that we’re allowing it, I don’t see any reason not to?
  • Compress objects automatically: yes
    • Sprockets is creating .gz versions of all your assets, but they’re going to be completely ignored in a Cloudfront setup either way. ☹️ (Is there a way to tell Sprockets to stop doing it? WHO KNOWS not me, it’s so hard to figure out how to reliably talk to Sprockets). But we can get what it was trying to do by having Cloudfront encrypt stuff for us, seems like a good idea, Google PageSpeed will like it, etc?
    • I noticed by experimentation that Cloudfront will compress CSS and JS (sometimes with brotli sometimes gz, even with the same browser, don’t know how it decides, don’t care), but is smart enough not to bother trying to compress a .jpg or .png (which already has internal compression).
  • Comment field: If there’s a way to edit it after you create the distribution, I haven’t found it, so pick a good one!

Notes on CORS

AWS docs here and here suggest for CORS support you also need to configure the Cloudfront distribution to forward additional headers — Origin, and possibly Access-Control-Request-Headers and Access-Control-Request-Method. Which you can do by setting up a custom “cache policy”. Or maybe instead by by setting the “Origin Request Policy”. Or maybe instead by setting custom cache header settings differently using the Use legacy cache settings option. It got confusing — and none of these settings seemed to be necessary to me for CORS to be working fine, nor could I see any of these settings making any difference in CloudFront behavior or what headers were included in responses.

Maybe they would matter more if I were trying to use a more specific Access-Control-Allow-Origin than just setting it to *? But about that….

If you set Access-Control-Allow-Origin to a single host, MDN docs say you have to also return a Vary: Origin header. Easy enough to add that to your Rails config.public_file_server.headers. But I couldn’t get Cloudfront to forward/return this Vary header with it’s responses. Trying all manner of cache policy settings, referring to AWS’s quite confusing documentation on the Vary header in Cloudfront and trying to do what it said — couldn’t get it to happen.

And what if you actually need more than one allowed origin? Per spec Access-Control-Allow-Origin as again explained by MDN, you can’t just include more than one in the header, the header is only allowed one: ” If the server supports clients from multiple origins, it must return the origin for the specific client making the request.” And you can’t do that with Rails static/global config.public_file_server.headers, we’d need to use and setup rack-cors instead, or something else.

So I just said, eh, * is probably just fine. I don’t think it actually involves any security issues for rails static assets to do this? I think it’s probably what everyone else is doing?

The only setup I needed for this to work was setting Cloudfront to allow OPTIONS HTTP method, and setting Rails config.public_file_server.headers to include 'Cache-Control' => 'public, max-age=31536000'.

Notes on Cache-Control max-age

A lot of the existing guides don’t have you setting config.public_file_server.headers to include 'Cache-Control' => 'public, max-age=31536000'.

But without this, will Cloudfront actually be caching at all? If with every single request to cloudfront, cloudfront makes a request to the Rails app for the asset and just proxies it — we’re not really getting much of the point of using Cloudfront in the first place, to avoid the traffic to our app!

Well, it turns out yes, Cloudfront will cache anyway. Maybe because of the Cloudfront Default TTL setting? My Default TTL was left at the Cloudfront default, 86400 seconds (one day). So I’d think that maybe Cloudfront would be caching resources for a day when I’m not supplying any Cache-Control or Expires headers?

In my observation, it was actually caching for less than this though. Maybe an hour? (Want to know if it’s caching or not? Look at headers returned by Cloudfront. One easy way to do this? curl -IXGET https://whatever.cloudfront.net/my/asset.jpg, you’ll see a header either x-cache: Miss from cloudfront or x-cache: Hit from cloudfront).

Of course, Cloudfront doesn’t promise to cache for as long as it’s allowed to, it can evict things for it’s own reasons/policies before then, so maybe that’s all that’s going on.

Still, Rails assets are fingerprinted, so they are cacheable forever, so why not tell Cloudfront that? Maybe more importantly, if Rails isn’t returning a Cache-Cobntrol header, then Cloudfront isn’t either to actual user-agents, which means they won’t know they can cache the response in their own caches, and they’ll keep requesting/checking it on every reload too, which is not great for your far too large CSS and JS application files!

So, I think it’s probably a great idea to set the far-future Cache-Control header with config.public_file_server.headers as I’ve done above. We tell Cloudfront it can cache for the max-allowed-by-spec one year, and this also (I checked) gets Cloudfront to forward the header on to user-agents who will also know they can cache.

Note on limiting Cloudfront Distribution to just static assets?

The CloudFront distribution created above will actually proxy/cache our entire Rails app, you could access dynamic actions through it too. That’s not what we intend it for, our app won’t generate any URLs to it that way, but someone could.

Is that a problem?

I don’t know?

Some blog posts try to suggest limiting it only being willing to proxy/cache static assets instead, but this is actually a pain to do for a couple reasons:

  1. Cloudfront has changed their configuration for “path patterns” since many blog posts were written (unless you are using “legacy cache settings” options), such that I’m confused about how to do it at all, if there’s a way to get a distribution to stop caching/proxying/serving anything but a given path pattern anymore?
  2. Modern Rails with webpacker has static assets at both /assets and /packs, so you’d need two path patterns, making it even more confusing. (Why Rails why? Why aren’t packs just at public/assets/packs so all static assets are still under /assets?)

I just gave up on figuring this out and figured it isn’t really a problem that Cloudfront is willing to proxy/cache/serve things I am not intending for it? Is it? I hope?

Note on Rails asset_path helper and asset_host

You may have realized that Rails has both asset_path and asset_url helpers for linking to an asset. (And similar helpers with dashes instead of underscores in sass, and probably different implementations, via sass-rails)

Normally asset_path returns a relative URL without a host, and asset_url returns a URL with a hostname in it. Since using an external asset_host requires we include the host with all URLs for assets to properly target CDN… you might think you have to stop using asset_path anywhere and just use asset_urlYou would be wrong.

It turns out if config.asset_host is set, asset_path starts including the host too. So everything is fine using asset_path. Not sure if at that point it’s a synonym for asset_url? I think not entirely, because I think in fact once I set config.asset_host, some of my uses of asset_url actually started erroring and failing tests? And I had to actually only use asset_path? In ways I don’t really understand what’s going on and can’t explain it?

Ah, Rails.

ActiveSupport::Cache via ActiveRecord (note to self)

There are a variety of things written to use flexible back-end key/value datastores via the ActiveSupport::Cache API.

For instance, say, activejob-status.

I have sometimes in the past wanted to be able to use such things storing the data in an rdbms, say vai ActiveRecord. Make a table for it. Sure, this won’t be nearly as fast or “scalable” as, say, redis, but for so many applications it’s just fine. And I often avoid using a feature at all if it is going to require to me to add another service (like another redis instance).

So I’ve considered writing an ActiveSupport::Cache adapter for ActiveRecord, but never really gotten around to it, so I keep avoiding using things I’d be trying out if I had it….

Well, today I discovered the ruby gem that’s a key/value store swiss army knife, moneta. Look, it has an ActiveSupport::Cache adapter so you can use any moneta-supported store as an ActiveSupport::Cache API. AND then if you want to use an rdbms as your moneta-supported store, you can do it through ActiveRecord or Sequel.

Great, I don’t have to write the adapter after all, it’s already been done! Assuming it works out okay, which I haven’t actually checked in practice yet.

Writing this in part as a note-to-self so next time I have an itch that can be scratched this way, I remember moneta is there — to at least explore further.

Not sure where to find the docs, but here’s the source for ActiveRecord moneta adapter. It looks like I can create different caches that use different tables, which is the first thing I thought to ensure.

The second thing I thought to look for — can it handle expiration, and purging expired keys? Unclear, I can’t find it. Maybe I could PR it if needed.

And hey, if for some reason you want an ActiveSupport::Cache backed by PStore or BerkelyDB (don’t do it!), or Cassandara (you got me, no idea?), moneta has you too.

Heroku release phase, rails db:migrate, and command failure

If you use capistrano to deploy a Rails app, it will typically run a rails db:migrate with every deploy, to apply any database schema changes.

If you are deploying to heroku you might want to do the same thing. The heroku “release phase” feature makes this possible. (Introduced in 2017, the release phase feature is one of heroku’s more recent major features, as heroku dev has seemed to really stabilize and/or stagnate).

The release phase docs mention “running database schema migrations” as a use case, and there are a few ((1), (2), (3)) blog posts on the web suggesting doing exactly that with Rails. Basically as simple as adding release: bundle exec rake db:migrate to your Procfile.

While some of the blog posts do remind you that “If the Release Phase fails the app will not be deployed”, I have found the implications of this to be more confusing in practice than one would originally assume. Particularly because on heroku changing a config var triggers a release; and it can be confusing to notice when such a release has failed.

It pays to consider the details a bit so you understand what’s going on, and possibly consider somewhat more complicated release logic than simply calling out to rake db:migrate.

1) What if a config var change makes your Rails app unable to boot?

I don’t know how unusual this is, but I actually had a real-world bug like this when in the process of setting up our heroku app. Without confusing things with the details, we can simulate such a bug simply by putting this in, say, config/application.rb:

if ENV['FAIL_TO_BOOT']
  raise "I am refusing to boot"
end

Obviously my real bug was weirder, but the result was the same — with some settings of one or more heroku configuration variables, the app would raise an exception during boot. And we hadn’t noticed this in testing, before deploying to heroku.

Now, on heroku, using CLI or web dashboard, set the config var FAIL_TO_BOOT to “true”.

Without a release phase, what happens?

  • The release is successful! If you look at the release in the dashboard (“Activity” tab) or heroku releases, it shows up as successful. Which means heroku brings up new dynos and shuts down the previous ones, that’s what a release is.
  • The app crashes when heroku tries to start it in the new dynos.
  • The dynos will be in “crashed” state when looked at in heroku ps or dashboard.
  • If a user tries to access the web app, they will get the generic heroku-level “could not start app” error screen (unless you’ve customized your heroku error screens, as usual).
  • You can look in your heroku logs to see the error and stack trace that prevented app boot.

Downside: your app is down.

Upside: It is pretty obvious that your app is down, and (relatively) why.

With a db:migrate release phase, what happens?

The Rails db:migrate rake task has a dependency on the rails :environment task, meaning it boots the Rails app before executing. You just changed your config variable FAIL_TO_BOOT: true such that the Rails app can’t boot. Changing the config variable triggered a release.

As part of the release, the db:migrate release phase is run… which fails.

  • The release is not succesful, it failed.
  • You don’t get any immediate feedback to that effect in response to your heroku config:add command or on the dashboard GUI in the “settings” tab. You may go about your business assuming it succeeded.
  • If you look at the release in heroku releases or dashboard “activity” tab you will see it failed.
  • You do get an email that it failed. Maybe you notice it right away, or maybe you notice it later, and have to figure out “wait, which release failed? And what were the effects of that? Should I be worried?”
  • The effects are:
    • The config variable appears changed in heroku’s dashboard or in response to heroku config:get etc.
    • The old dynos without the config variable change are still running. They don’t have the change. If you open a one-off dyno, it will be using the old release, and have the old (eg) ENV[‘FAIL_TO_BOOT’] value.
    • ANY subsequent attempts at a releases will keep fail, so long as the app is in a state (based on teh current config variables) that it can’t boot.

Again, this really happened to me! It is a fairly confusing situation.

Upside: Your app is actually still up, even though you broke it, the old release that is running is still running, that’s good?

Downside: It’s really confusing what happened. You might not notice at first. Things remain in a messed up inconsistent and confusing state until you notice, figure out what’s going on, what release caused it, and how to fix it.

It’s a bit terrifying that any config variable change could do this. But I guess most people don’t run into it like I did, since I haven’t seen it mentioned?

2) A heroku pg:promote is a config variable change, that will create a release in which db:migrate release phase fails.

heroku pg:promote is a command that will change which of multiple attached heroku postgreses are attached as the “primary” database, pointed to by the DATABASE_URL config variable.

For a typical app with only one database, you still might use pg:promote for a database upgrade process; for setting up or changing a postgres high-availability leader/follower; or, for what I was experimenting with it for, using heroku’s postgres-log-based rollback feature.

I had assumed that pg:promote was a zero-downtime operation. But, in debugging it’s interaction with my release phase, I noticed that pg:promote actually creates TWO heroku releases.

  1. First it creates a release labelled Detach DATABASE , in which there is no DATABASE_URL configuration variable at all.
  2. Then it creates another release labelled Attach DATABASE in which the DATABASE_URL configuration variable is defined to it’s new value.

Why does it do this instead of one release that just changes the DATABASE_URL? I don’t know. My app (like most Rails and probably other apps) can’t actually function without DATABASE_URL set, so if that first release ever actually runs, it will just error out. Does this mean there’s an instant with a “bad” release deployed, that pg:promote isn’t actually zero-downtime? I am not sure, it doens’t seem right (I did file a heroku support ticket asking….).

But under normal circumstances, either it’s not a problem, or most people(?) don’t notice.

But what if you have a db:migrate release phase?

When it tries to do release (1) above, that release will fail. Because it tries to run db:migrate, and it can’t do that without a DATABASE_URL set, so it raises, the release phase exits in an error condition, the release fails.

Actually what happens is without DATABASE_URL set, the Rails app will assume a postgres URL in a “default” location, try to connect to, and fail, with an error message (hello googlers?), like:

ActiveRecord::ConnectionNotEstablished: could not connect to server: No such file or directory
	Is the server running locally and accepting
	connections on Unix domain socket "/var/run/postgresql/.s.PGSQL.5432"?

Now, release (2) is coming down the pike seconds later, this is actually fine, and will be zero outage. We had a release that failed (so never was deployed), and seconds later the next correct release succeeds. Great!

The only problem is that we got an email notifying us that release 1 failed, and it’s also visible as failing in the heroku release list, etc.

A “background” (not in response to a git push or other code push to heroku) release failing is already a confusing situation — a”false positives” that actually mean “nothing unexpected or problematic happened, just ignore this and carry on.” is… really not something I want. (I call this the “error notification crying wolf”, right? I try to make sure my error notifications never do it, because it takes your time away from flow unecessarily, and/or makes it much harder to stay vigilant to real errors).

Now, there is a fairly simple solution to this particular problem. Here’s what I did. I changed my heroku release phase from rake db:migrate to a custom rake task, say release: bundle exec rake my_custom_heroku_release_phase, defined like so:

task :my_custom_heroku_release_phase do
if ENV['DATABASE_URL']
Rake::Task["db:migrate"].invoke
else
$stderr.puts "\n!!! WARNING, no ENV['DATABASE_URL'], not running rake db:migrate as part of heroku release !!!\n\n"
end
end
view raw custom.rake hosted with ❤ by GitHub

Now that release (1) above at least won’t fail, it has the same behavior as a “traditional” heroku app without a release phase.

Swallow-and-report all errors?

When a release fails because a release phase has failed as result of a git push to heroku, that’s quite clear and fine!

But the confusion of the “background” release failure, triggered by a config var change, is high enough that part of me wants to just rescue StandardError in there, and prevent a failed release phase from ever exiting with a failure code, so heroku will never use a db:migrate release phase to abort a release.

Just return the behavior to the pre-release-phase heroku behavior — you can put your app in a situation where it will be crashed and not work, but maybe that’s better not a mysterious inconsistent heroku app state that happens in the background and you find out about only through asynchronous email notifications from heroku that are difficult to understand/diagnose. It’s all much more obvious.

On the other hand, if a db:migrate has failed not becuase of some unrelated boot process problem that is going to keep the app from launching too even if it were released, but simply because the db:migrate itself actually failed… you kind of want the release to fail? That’s good? Keep the old release running, not a new release with code that expects a db migration that didn’t happen?

So I’m not really sure.

If you did want to rescue-swallow-and-notify, the custom rake task for your heroku release logic — instead of just telling heroku to run a standard thing like db:migrate on release — is certainly convenient.

Also, do you really always want to db:migrate anyway? What about db:schema:load?

Another alternative… if you are deploying an app with an empty database, standard Rails convention is to run rails db:schema:load instead of db:migrate. The db:migrate will probably work anyway, but will be slower, and somewhat more error-prone.

I guess this could come up on heroku with an initial deploy or (for some reason) a database that’s been nuked and restarted, or perhaps a Heroku “Review app”? (I don’t use those yet)

stevenharman has a solution that actually checks the database, and runs the appropriate rails task depending on state, here in this gist.

I’d probably do it as a rake task instead of a bash file if I were going to do that. I’m not doing it at all yet.

Note that stevenharman’s solution will actually catch a non-existing or non-connectable database and not try to run migrations… but it will print an error message and exit 1 in that case, failing the release — meaning that you will get a failed release in the pg:promote case mentioned above!

Code that Lasts: Sustainable And Usable Open Source Code

A presentation I gave at online conference Code4Lib 2021, on Monday March 21.

I have realized that the open source projects I am most proud of are a few that have existed for years now, increasing in popularity, with very little maintenance required. Including traject and bento_search. While community aspects matter for open source sustainability, the task gets so much easier when the code requires less effort to keep alive, for maintainers and utilizers. Using these projects as examples, can we as developers identify what makes code “inexpensive” to use and maintain over the long haul with little “churn”, and how to do that?

Slides on Google Docs

Rough transcript (really the script I wrote for myself)

Hi, I’m Jonathan Rochkind, and this is “Code that Lasts: Sustainable and Usable Open Source Code”

So, who am I? I have been developing open source library software since 2006, mainly in ruby and Rails. 

Over that time, I have participated in a variety open source projects meant to be used by multiple institutions, and I’ve often seen us having challenges with long-term maintenance sustainability and usability of our software. This includes in projects I have been instrumental in creating myself, we’ve all been there! 

We’re used to thinking of this problem in terms of needing more maintainers.

But let’s first think more about what the situation looks like, before we assume what causes it. In addition to features  or changes people want not getting done, it also can look like, for instance:

Being stuck using out-of-date dependencies like old, even end-of-lifed, versions of Rails or ruby.

A reduction in software “polish” over time. 

What do I mean by “polish”?

Engineer Richard Schneeman writes: [quote] “When we say something is “polished” it means that it is free from sharp edges, even the small ones. I view polished software to be ones that are mostly free from frustration. They do what you expect them to and are consistent.” 

I have noticed that software can start out very well polished, but over time lose that polish. 

This usually goes along with decreasing “cohesion” in software over time, a feeling like that different parts of the software start to no longer tell the developer a consistent story together. 

While there can be an element of truth in needing more maintainers in some cases – zero maintainers is obviously too few — there are also ways that increasing the number of committers or maintainers can result in diminishing returns and additional challenges.

One of the theses of Fred Brooks famous 1975 book “The Mythical Man-Month” is sometimes called ”Brooks Law”:  “under certain conditions, an incremental person when added to a project makes the project take more, not less time.”

Why? One of the main reasons Brooks discusses is the the additional time taken for communication and coordination between more people – with every person you add, the number of connections between people goes up combinatorily. 

That may explain the phenomenon we sometimes see with so-called “Design  by committee” where “too many cooks in the kitchen” can produce inconsistency or excessive complexity.

Cohesion and polish require a unified design vision— that’s  not incompatible with increasing numbers of maintainers, but it does make it more challenging because it takes more time to get everyone on the same page, and iterate while maintaining a unifying vision.  (There’s also more to be said here about the difference between just a bunch of committers committing PR’s, and the maintainers role of maintaining historical context and design vision for how all the parts fit together.)

Instead of assuming adding more committers or maintainers is the solution, can there instead be ways to reduce the amount of maintenance required?

I started thinking about this when I noticed a couple projects of mine which had become more widely successful than I had any right  to expect, considering how little maintainance was being put into them. 

Bento_search is a toolkit for searching different external search engines in a consistent way. It’s especially but not exclusively for displaying multiple search results in “bento box” style, which is what Tito Sierra from NCSU first called these little side by side search results. 

I wrote bento_search  for use at a former job in 2012.  55% of all commits to the project were made in 2012.  95% of all commits in 2016 or earlier. (I gave it a bit of attention for a contracting project in 2016).

But bento_search has never gotten a lot of maintenance, I don’t use it anymore myself. It’s not in wide use, but I found  it kind of amazing, when I saw people giving me credit in conference presentations for the gem (thanks!), when I didn’t even know they were using it and I hadn’t been paying it any attention at all! It’s still used by a handful of institutions for whom it just works with little attention from maintainers. (The screenshot from Cornell University Libraries)

Traject is a Marc-to-Solr indexing tool written in ruby  (or, more generally, can be a general purpose extract-transform-load tool), that I wrote with Bill Dueber from the University of Michigan in 2013. 

We hoped it would catch on in the Blacklight community, but for the first couple years, it’s uptake was slow. 

However, since then, it has come to be pretty popular in Blacklight and Samvera communities, and a few other library technologist uses.  You can see the spikes of commit activity in the graph for a 2.0 release in 2015 and a 3.0 release in 2018 – but for the most part at other times, nobody has really been spending much time on maintaining traject.   Every once in a while a community member submits a minor Pull Request, and it’s usually me who reviews it. Me and Bill remain the only maintainers. 

And yet traject just keeps plugging along, picking up adoption and working well for adopters.  

So, this made me start thinking, based on what I’ve seen in my career, what are some of the things that might make open source projects both low-maintenance and successful in their adoption and ease-of-use for developers?

One thing both of these projects did was take backwards compatibility very seriously. 

The first step of step there is following “semantic versioning” a set of rules whose main point is that releases can’t include backwards incompatible changes unless they are a new major version, like going from 1.x to 2.0. 

This is important, but it’s not alone enough to minimize backwards incompatible changes that add maintenance burden to the ecosystem. If the real goal is preventing the pain of backwards incompatibility, we also need to limit the number of major version releases, and limit the number and scope of backwards breaking changes in each major release!

The Bento_search gem has only had one major release, it’s never had a 2.0 release, and it’s still backwards compatible to it’s initial release. 

Traject is on a 3.X release after 8 years, but the major releases of traject have had extremely few backwards breaking changes, most people could upgrade through major versions changing very little or most often nothing in their projects. 

So OK, sure, everyone wants to minimize backwards incompatibility, but that’s easy to say, how do you DO it? Well, it helps to have less code overall, that changes less often overall all  – ok, again, great, but how do you do THAT? 

Parsimony is a word in general English that means “The quality of economy or frugality in the use of resources.”

In terms of software architecture, it means having as few as possible moving parts inside your code: fewer classes, types, components, entities, whatever: Or most fundamentally, I like to think of it in terms of minimizing the concepts in the mental model a programmer needs to grasp how the code works and what parts do what.

The goal of architecture design is, what is the smallest possible architecture we can create to make [quote] “simple things simple and complex things possible”, as computer scientist Alan Kay described the goal of software design. 

We can see this in bento_search has very few internal architectural concepts. 

The main thing bento_search does is provide a standard API for querying a search engine and representing results of a search. These are consistent across different searche engines,, with common metadata vocabulary for what results look like. This makes search engines  interchangeable to calling code.  And then it includes half a dozen or so search engine implementations for services I needed or wanted to evaluate when I wrote it.  

This search engine API at the ruby level can be used all by itself even without the next part, the actual “bento style” which is a built-in support for displaying search engine results in a boxes on a page of your choice in a Rails app, way to,  writing very little boilerplate code.  

Traject has an architecture which basically has just three parts at the top.

There is a reader which sends objects into the pipeline. 

There are some indexing rules which are transformation steps from source object to build an output Hash object. 

And then a writer which which translates the Hash object to write to some store, such as Solr.

The reader, transformation steps, and writer are all independent and uncaring about each other, and can be mixed and matched.  

That’s MOST of traject right there. It seems simple and obvious once you have it, but it can take a lot of work to end up with what’s simple and obvious in retrospect! 

When designing code I’m often reminded of the apocryphal quote: “I would have written a shorter letter, but I did not have the time”

And, to be fair, there’s a lot of complexity within that “indexing rules” step in traject, but it’s design was approached the same way. We have use cases about supporting configuration settings in a  file or on command line; or about allowing re-usable custom transformation logic – what’s the simplest possible architecture we can come up with to support those cases.

OK, again, that sounds nice, but how do you do it? I don’t have a paint by numbers, but I can say that for both these projects I took some time – a few weeks even – at the beginning to work out these architectures, lots of diagraming, some prototyping I was prepared to throw out,  and in some cases “Documentation-driven design” where I wrote some docs for code I hadn’t written yet. For traject it was invaluable to have Bill Dueber at University of Michigan also interested in spending some design time up front, bouncing ideas back and forth with – to actually intentionally go through an architectural design phase before the implementation. 

Figuring out a good parsimonious architecture takes domain knowledge: What things your “industry” – other potential institutions — are going to want to do in this area, and specifically what developers are going to want to do with your tool. 


We’re maybe used to thinking of “use cases” in terms of end-users, but it can be useful at the architectural design stage, to formalize this in terms of developer use cases. What is a developer going to want to do, how can I come up with a small number of software pieces she can use to assemble together to do those things.

When we said “make simple things simple and complex things possible”, we can say domain analysis and use cases is identifying what things we’re going to put in either or neither of those categories. 

The “simple thing” for bento_search , for instance is just “do a simple keyword search in a search engine, and display results, without having the calling code need to know anything about the specifics of that search engine.”

Another way to get a head-start on solid domain knowledge is to start with another tool you have experience with, that you want to create a replacement for. Before Traject, I and other users used a tool written in Java called SolrMarc —  I knew how we had used it, and where we had had roadblocks or things that we found harder or more complicated than we’d like, so I knew my goals were to make those things simpler.

We’re used to hearing arguments about avoiding rewrites, but like most things in software engineering, there can be pitfalls on either either extreme.

I was amused to notice, Fred Brooks in the previously mentioned Mythical Man Month makes some arguments in both directions. 

Brooks famously warns about a “second-system effect”, the [quote] “tendency of small, elegant, and successful systems to be succeeded by over-engineered, bloated systems, due to inflated expectations and overconfidence” – one reason to be cautious of a rewrite. 

But Brooks in the very same book ALSO writes [quote] “In most projects, the first system built is barely usable….Hence plan to throw one away; you will, anyhow.”

It’s up to us figure out when we’re in which case. I personally think an application is more likely to be bitten by the “second-system effect” danger of a rewrite, while a shared re-usable library is more likely to benefit from a rewrite (in part because a reusable library is harder to change in place without disruption!). 

We could sum up a lot of different princples as variations of “Keep it small”. 

Both traject and bento_search are tools that developers can use to build something. Bento_search just puts search results in a box on a page; the developer is responsible for the page and an overall app. 

Yes, this means that you have to be a ruby developer to use it. Does this limit it’s audience? While we might aspire to make tools that even not-really-developers can just use out of the box, my experience has been that our open source attempts at shrinkwrapped “solutions” often end up still needing development expertise to successfully deploy.  Keeping our tools simple and small and not trying to supply a complete app can actually leave more time for these developers to focus on meeting local needs, instead of fighting with a complicated frameworks that doesn’t do quite what they need.

It also means we can limit interactions with any external dependencies. Traject was developed for use with a Blacklight project, but traject code does not refer to Blacklight or even Rails at all, which means new releases of Blacklight or Rails can’t possibly break traject. 

Bento_search , by doing one thing and not caring about the details of it’s host application, has kept working from Rails 3.2 all the way up to current Rails 6.1 with pretty much no changes needed except to the test suite setup. 

Sometimes when people try to have lots of small tools working together, it can turn into a nightmare where you get a pile of cascading software breakages every time one piece changes. Keeping assumptions and couplings down is what lets us avoid this maintenance nightmare. 

And another way of keeping it small is don’t be afraid to say “no” to features when you can’t figure out how to fit them in without serious harm to the parsimony of your architecture. Your domain knowledge is what lets you take an educated guess as to what features are core to your audience and need to be accomodated, and which are edge cases and can be fulfilled by extension points, or sometimes not at all. 

By extension points we mean we prefer opportunities for developer-users to write their own code which works with your tools, rather than trying to build less commonly needed features in as configurable features. 

As an example, Traject does include some built-in logic, but one of it’s extension point use cases is making sure it’s simple to add whatever transformation logic a developer-user wants, and have it look just as “built-in” as what came with traject. And since traject makes it easy to write your own reader or writer, it’s built-in readers and writers don’t need to include every possible feature –we plan for developers writing their own if they need something else. 

Looking at bento_search, it makes it easy to write your own search engine_adapter — that will be useable interchangeably with the built-in ones. Also, bento_search provides a standard way to add custom search arguments specific to a particular adapter – these won’t be directly interchangeable with other adapters, but they are provided for in the architecture, and won’t break in future bento_search releases – it’s another form of extension point. 

These extension points are the second half of “simple things simple, complex things possible.” – the complex things possible. Planning for them is part of understanding your developer use-cases, and designing an architecture that can easily handle them. Ideally, it takes no extra layers of abstraction to handle them, you are using the exact  architectural join points the out-of-the-box code is using, just supplying custom components. 

So here’s an example of how these things worked out in practice with traject, pretty well I think.

Stanford ended up writing a package of extensions to traject called TrajectPlus, to take care of some features they needed that traject didn’t provide. Commit history suggests it was written in 2017, which was Traject 2.0 days.  

I can’t recall, but I’d guess they approached me with change requests to traject at that time and I put them off because I couldn’t figure out how to fit them in parsimoniously, or didn’t have time to figure it out. 

But the fact that they were *able* to extend traject in this way I consider a validation of traject’s architecture, that they could make it do what they needed, without much coordination with me, and use it in many projects (I think beyond just Stanford). 

Much of the 3.0 release of traject was “back-port”ing some features that TrajectPlus had implemented, including out-of-the-box support for XML sources. But I didn’t always do them with the same implementation or API as TrajectPlus – this is another example of being able to use a second go at it to figure out how to do something even more parsimoniously, sometimes figuring out small changes to traject’s architecture to support flexibility in the right dimensions. 

When Traject 3.0 came out – the TrajectPlus users didn’t necessarily want to retrofit all their code to the new traject way of doing it. But TrajectPlus could still be used with traject 3.0 with few or possibly no changes, doing things the old way, they weren’t forced to upgrade to the new way. This is a huge win for traject’s backwards compat – everyone was able to do what they needed to do, even taking separate paths, with relatively minimized maintenance work. 

As I think about these things philosophically, one of my takeaways is that software engineering is still a craft – and software design is serious thing to be studied and engaged in. Especially for shared libraries rather than local apps, it’s not always to be dismissed as so-called “bike-shedding”. 

It’s worth it to take time to think about design, self-reflectively and with your peers, instead of just rushing to put our fires or deliver features, it will reduce maintenance costs and increase values over the long-term. 

And I want to just briefly plug “kithe”, a project of mine which tries to be guided by these design goals to create a small focused toolkit for building Digital Collections applications in Rails. 

I could easily talk about all of this this another twenty minutes, but that’s our time! I’m always happy to talk more, find me on slack or IRC or email. 

This last slide has some sources mentioned in the talk. Thanks for your time! 

Product management

In my career working in the academic sector, I have realized that one thing that is often missing from in-house software development is “product management.”

But what does that mean exactly? You don’t know it’s missing if you don’t even realize it’s a thing and people can use different terms to mean different roles/responsibilities.

Basically, deciding what the software should do. This is not about colors on screen or margins (what our stakeholderes often enjoy micro-managing) — I’d consider those still the how of doing it, rather than the what to do. The what is often at a much higher level, about what features or components to develop at all.

When done right, it is going to be based on both knowledge of the end-user’s needs and preferences (user research); but also knowledge of internal stakeholder’s desires and preferences (overall organiational strategy, but also just practically what is going to make the right people happy to keep us resourced). Also knowledge of the local capacity, what pieces do we need to put in place to get these things developed. When done seriously, it will necessarily involve prioritization — there are many things we could possibly done, some subset of them we very well may do eventually, but which ones should we do now?

My experience tells me it is a very big mistake to try to have a developer doing this kind of product management. Not because a developer can’t have the right skillset to do them. But because having the same person leading development and product management is a mistake. The developer is too close to the development lense, and there’s just a clarification that happens when these roles are separate.

My experience also tells me that it’s a mistake to have a committee doing these things, much as that is popular in the academic sector. Because, well, just of course it is.

But okay this is all still pretty abstract. Things might become more clear if we get more specific about the actual tasks and work of this kind of product management role.

I found Damilola Ajiboye blog post on “Product Manager vs Product Marketing Manager vs Product Owner” very clear and helpful here. While it is written so as to distinguish between three different product management related roles, but Ajiboye also acknowledges that in a smaller organization “a product manager is often tasked with the duty of these 3 roles.

Regardless of if the responsibilities are to be done by one or two or three person, Ajiboye’s post serves as a concise listing of the work to be done in managing a product — deciding the what of the product, in an ongoing iterative and collaborative manner, so that developers and designers can get to the how and to implementation.

I recommend reading the whole article, and I’ll excerpt much of it here, slightly rearranged.

The Product Manager

These individuals are often referred to as mini CEOs of a product. They conduct customer surveys to figure out the customer’s pain and build solutions to address it. The PM also prioritizes what features are to be built next and prepares and manages a cohesive and digital product roadmap and strategy.

The Product Manager will interface with the users through user interviews/feedback surveys or other means to hear directly from the users. They will come up with hypotheses alongside the team and validate them through prototyping and user testing. They will then create a strategy on the feature and align the team and stakeholders around it. The PM who is also the chief custodian of the entire product roadmap will, therefore, be tasked with the duty of prioritization. Before going ahead to carry out research and strategy, they will have to convince the stakeholders if it is a good choice to build the feature in context at that particular time or wait a bit longer based on the content of the roadmap.

The Product Marketing Manager
The PMM communicates vital product value — the “why”, “what” and “when” of a product to intending buyers. He manages the go-to-market strategy/roadmap and also oversees the pricing model of the product. The primary goal of a PMM is to create demand for the products through effective messaging and marketing programs so that the product has a shorter sales cycle and higher revenue.

The product marketing manager is tasked with market feasibility and discovering if the features being built align with the company’s sales and revenue plan for the period. They also make research on how sought-after the feature is being anticipated and how it will impact the budget. They communicate the values of the feature; the why, what, and when to potential buyers — In this case users in countries with poor internet connection.

[While expressed in terms of a for-profit enterprise selling something, I think it’s not hard to translate this to a non-profit or academic environment. You still have an audience whose uptake you need to be succesful, whether internal or external. — jrochkind ]

The Product Owner
A product owner (PO) maximizes the value of a product through the creation and management of the product backlog, creation of user stories for the development team. The product owner is the customer’s representative to the development team. He addresses customer’s pain points by managing and prioritizing a visible product backlog. The PO is the first point of call when the development team needs clarity about interpreting a product feature to be implemented.

The product owner will first have to prioritize the backlog to see if there are no important tasks to be executed and if this new feature is worth leaving whatever is being built currently. They will also consider the development effort required to build the feature i.e the time, tools, and skill set that will be required. They will be the one to tell if the expertise of the current developers is enough or if more engineers or designers are needed to be able to deliver at the scheduled time. The product owner is also armed with the task of interpreting the product/feature requirements for the development team. They serve as the interface between the stakeholders and the development team.

When you have someone(s) doing these roles well, it ensures that the development team is actually spending time on things that meet user and business needs. I have found that it makes things so much less stressful and more rewarding for everyone involved.

When you have nobody doing these roles, or someone doing it in a cursory or un-intentional way not recognized as part of their core job responsibilities, or have a lead developer trying to do it on top of develvopment, I find it leads to feelings of: spinning wheels, everything-is-an-emergency, lack of appreciation, miscommunication and lack of shared understanding between stakeholders and developers, general burnout and dissatisfaction — and at the root, a product that is not meeting user or business needs well, leading to these inter-personal and personal problems.

Rails auto-scaling on Heroku

We are investigating moving our medium-small-ish Rails app to heroku.

We looked at both the Rails Autoscale add-on available on heroku marketplace, and the hirefire.io service which is not listed on heroku marketplace and I almost didn’t realize it existed.

I guess hirefire.io doesn’t have any kind of a partnership with heroku, but still uses the heroku API to provide an autoscale service. hirefire.io ended up looking more fully-featured and lesser priced than Rails Autoscale; so the main service of this post is just trying to increase visibility of hirefire.io and therefore competition in the field, which benefits us consumers.

Background: Interest in auto-scaling Rails background jobs

At first I didn’t realize there was such a thing as “auto-scaling” on heroku, but once I did, I realized it could indeed save us lots of money.

I am more interested in scaling Rails background workers than I a web workers though — our background workers are busiest when we are doing “ingests” into our digital collections/digital asset management system, so the work is highly variable. Auto-scaling up to more when there is ingest work piling up can give us really nice inget throughput while keeping costs low.

On the other hand, our web traffic is fairly low and probably isn’t going to go up by an order of magnitude (non-profit cultural institution here). And after discovering that a “standard” dyno is just too slow, we will likely be running a performance-m or performance-l anyway — which likely can handle all anticipated traffic on it’s own. If we have an auto-scaling solution, we might configure it for web dynos, but we are especially interested in good features for background scaling.

There is a heroku built-in autoscale feature, but it only works for performance dynos, and won’t do anything for Rails background job dynos, so that was right out.

That could work for Rails bg jobs, the Rails Autoscale add-on on the heroku marketplace; and then we found hirefire.io.

Pricing: Pretty different

hirefire

As of now January 2021, hirefire.io has pretty simple and affordable pricing. $15/month/heroku application. Auto-scaling as many dynos and process types as you like.

hirefire.io by default can only check into your apps metrics to decide if a scaling event can occur once per minute. If you want more frequent than that (up to once every 15 seconds), you have to pay an additional $10/month, for $25/month/heroku application.

Even though it is not a heroku add-on, hirefire does advertise that they bill pro-rated to the second, just like heroku and heroku add-ons.

Rails autoscale

Rails autoscale has a more tiered approach to pricing that is based on number and type of dynos you are scaling. Starting at $9/month for 1-3 standard dynos, the next tier up is $39 for up to 9 standard dynos, all the way up to $279 (!) for 1 to 99 dynos. If you have performance dynos involved, from $39/month for 1-3 performance dynos, up to $599/month for up to 99 performance dynos.

For our anticipated uses… if we only scale bg dynos, I might want to scale from (low) 1 or 2 to (high) 5 or 6 standard dynos, so we’d be at $39/month. Our web dynos are likely to be performance and I wouldn’t want/need to scale more than probably 2, but that puts us into performance dyno tier, so we’re looking at $99/month.

This is of course significantly more expensive than hirefire.io’s flat rate.

Metric Resolution

Since Hirefire had an additional charge for finer than 1-minute resolution on checks for autoscaling, we’ll discuss resolution here in this section too. Rails Autoscale has same resolution for all tiers, and I think it’s generally 10 seconds, so approximately the same as hirefire if you pay the extra $10 for increased resolution.

Configuration

Let’s look at configuration screens to get a sense of feature-sets.

Rails Autoscale

web dynos

To configure web dynos, here’s what you get, with default values:

The metric Rails Autoscale uses for scaling web dynos is time in heroku routing queue, which seems right to me — when things are spending longer in heroku routing queue before getting to a dyno, it means scale up.

worker dynos

For scaling worker dynos, Rails Autoscale can scale dyno type named “worker” — it can understand ruby queuing libraries Sidekiq, Resque, Delayed Job, or Que. I’m not certain if there are options for writing custom adapter code for other backends.

Here’s what the configuration options are — sorry these aren’t the defaults, I’ve already customized them and lost track of what defaults are.

You can see that worker dynos are scaled based on the metric “number of jobs queued”, and you can tell it to only pay attention to certain queues if you want.

Hirefire

Hirefire has far more options for customization than Rails Autoscale, which can make it a bit overwhelming, but also potentially more powerful.

web dynos

You can actually configure as many Heroku process types as you have for autoscale, not just ones named “web” and “worker”. And for each, you have your choice of several metrics to be used as scaling triggers.

For web, I think Queue Time (percentile, average) matches what Rails Autoscale does, configured to percentile, 95, and is probably the best to use unless you have a reason to use another. (“Rails Autoscale tracks the 95th percentile queue time, which for most applications will hover well below the default threshold of 100ms.“)

Here’s what configuration Hirefire makes available if you are scaling on “queue time” like Rails Autoscale, configuration may vary for other metrics.

I think if you fill in the right numbers, you can configure to work equivalently to Rails Autoscale.

worker dynos

If you have more than one heroku process type for workers — say, working on different queues — Hirefire can scale the independently, with entirely separate configuration. This is pretty handy, and I don’t think Rails Autoscale offers this. (update i may be wrong, Rails Autoscale says they do support this, so check on it yourself if it matters to you).

For worker dynos, you could choose to scale based on actual “dyno load”, but I think this is probably mostly for types of processes where there isn’t the ability to look at “number of jobs”. A “number of jobs in queue” like Rails Autoscale does makes a lot more sense to me as an effective metric for scaling queue-based bg workers.

Hirefire’s metric is slightly difererent than Rails Autoscale’s “jobs in queue”. For recognized ruby queue systems (a larger list than Rails Autoscale’s; and you can write your own custom adapter for whatever you like), it actually measures jobs in queue plus workers currently busy. So queued+in-progress, rather than Rails Autoscale’s just queued. I actually have a bit of trouble wrapping my head around the implications of this, but basically, it means that Hirefire’s “jobs in queue” metric strategy is intended to try to scale all the way to emptying your queue, or reaching your max scale limit, whichever comes first. I think this may make sense and work out at least as well or perhaps better than Rails Autoscale’s approach?

Here’s what configuration Hirefire makes available for worker dynos scaling on “job queue” metric.

Since the metric isn’t the same as Rails Autosale, we can’t configure this to work identically. But there are a whole bunch of configuration options, some similar to Rails Autoscale’s.

The most important thing here is that “Ratio” configuration. It may not be obvious, but with the way the hirefire metric works, you are basically meant to configure this to equal the number of workers/threads you have on each dyno. I have it configured to 3 because my heroku worker processes use resque, with resque_pool, configured to run 3 resque workers on each dyno. If you use sidekiq, set ratio to your configured concurrency — or if you are running more than one sidekiq process, processes*concurrency. Basically how many jobs your dyno can be concurrently working is what you should normally set for ‘ratio’.

Hirefire not a heroku plugin

Hirefire isn’t actually a heroku plugin. In addition to that meaning separate invoicing, there can be some other inconveniences.

Since hirefire only can interact with heroku API, for some metrics (including the “queue time” metric that is probably optimal for web dyno scaling) you have to configure your app to log regular statistics to heroku’s “Logplex” system. This can add a lot of noise to your log, and for heroku logging add-ons that are tired based on number of log lines or bytes, can push you up to higher pricing tiers.

If you use paperclip, I think you should be able to use the log filtering feature to solve this, keep that noise out of your logs and avoid impacting data log transfer limits. However, if you ever have cause to look at heroku’s raw logs, that noise will still be there.

Support and Docs

I asked a couple questions of both Hirefire and Rails Autoscale as part of my evaluation, and got back well-informed and easy-to-understand answers quickly from both. Support for both seems to be great.

I would say the documentation is decent-but-not-exhaustive for both products. Hirefire may have slightly more complete documentation.

Other Features?

There are other things you might want to compare, various kinds of observability (bar chart or graph of dynos or observed metrics) and notification. I don’t have time to get into the details (and didn’t actually spend much time exploring them to evaluate), but they seem to offer roughly similar features.

Conclusion

Rails Autoscale is quite a bit more expensive than hirefire.io’s flat rate, once you get past Rails Autoscale’s most basic tier (scaling no more than 3 standard dynos).

It’s true that autoscaling saves you money over not, so even an expensive price could be considered a ‘cut’ of that, and possibly for many ecommerce sites even $99 a month might a drop in the bucket (!)…. but this price difference is so significant with hirefire (which has flat rate regardless of dynos), that it seems to me it would take a lot of additional features/value to justify.

And it’s not clear that Rails Autoscale has any feature advantage. In general, hirefire.io seems to have more features and flexibility.

Until 2021, hirefire.io could only analyze metrics with 1-minute resolution, so perhaps that was a “killer feature”?

Honestly I wonder if this price difference is sustained by Rails Autoscale only because most customers aren’t aware of hirefire.io, it not being listed on the heroku marketplace? Single-invoice billing is handy, but probably not worth $80+ a month. I guess hirefire’s logplex noise is a bit inconvenient?

Or is there something else I’m missing? Pricing competition is good for the consumer.

And are there any other heroku autoscale solutions, that can handle Rails bg job dynos, that I still don’t know about?

update a day after writing djcp on a reddit thread writes:

I used to be a principal engineer for the heroku add-ons program.

One issue with hirefire is they request account level oauth tokens that essentially give them ability to do anything with your apps, where Rails Autoscaling worked with us to create a partnership and integrate with our “official” add-on APIs that limits security concerns and are scoped to the application that’s being scaled.

Part of the reason for hirefire working the way it does is historical, but we’ve supported the endpoints they need to scale for “official” partners for years now.

A lot of heroku customers use hirefire so please don’t think I’m spreading FUD, but you should be aware you’re giving a third party very broad rights to do things to your apps. They probably won’t, of course, but what if there’s a compromise?

“Official” add-on providers are given limited scoped tokens to (mostly) only the actions / endpoints they need, minimizing blast radius if they do get compromised.

You can read some more discussion at that thread.

Managed Solr SaaS Options

I was recently looking for managed Solr “software-as-a-service” (SaaS) options, and had trouble figuring out what was out there. So I figured I’d share what I learned. Even though my knowledge here is far from exhaustive, and I have only looked seriously at one of the ones I found.

The only managed Solr options I found were: WebSolr; SearchStax; and OpenSolr.

Of these, i think WebSolr and SearchStax are more well-known, I couldn’t find anyone with experience with OpenSolr, which perhaps is newer.

Of them all, SearchStax is the only one I actually took for a test drive, so will have the most to say about.

Why we were looking

We run a fairly small-scale app, whose infrastructure is currently 4 self-managed AWS EC2 instances, running respectively: 1) A rails web app 2) Bg workers for the rails web app 3) Postgres, and 4) Solr.

Oh yeah, there’s also a redis running one of those servers, on #3 with pg or #4 with solr, I forget.

Currently we manage this all ourselves, right on the EC2. But we’re looking to move as much as we can into “managed” servers. Perhaps we’ll move to Heroku. Perhaps we’ll use hatchbox. Or if we do stay on AWS resources we manage directly, we’d look at things like using an AWS RDS Postgres instead of installing it on an EC2 ourselves, an AWS ElastiCache for Redis, maybe look into Elastic Beanstalk, etc.

But no matter what we do, we need a Solr, and we’d like to get it managed. Hatchbox has no special Solr support, AWS doesn’t have a Solr service, Heroku does have a solr add-on but you can also use any Solr with it and we’ll get to that later.

Our current Solr use is pretty small scale. We don’t run “SolrCloud mode“, just legacy ordinary Solr. We only have around 10,000 documents in there (tiny for Solr), our index size is only 70MB. Our traffic is pretty low — when I tried to figure out how low, it doesn’t seem we have sufficient logging turned on to answer that specifically but using proxy metrics to guess I’d say 20K-40K requests a day, query as well as add.

This is a pretty small Solr installation, although it is used centrally for the primary functions of the (fairly low-traffic) app. It currently runs on an EC2 t3a.small, which is a “burstable” EC2 type with only 2G of RAM. It does have two vCPUs (that is one core with ‘hyperthreading’). The t3a.small EC2 instance only costs $14/month on-demand price! We know we’ll be paying more for managed Solr, but we want to do get out of the business of managing servers — we no longer really have the staff for it.

WebSolr (didn’t actually try out)

WebSolr is the only managed Solr currently listed as a Heroku add-on. It is also available as a managed Solr independent of heroku.

The pricing in the heroku plans vs the independent plans seems about the same. As a heroku add-on there is a $20 “staging” plan that doesn’t exist in the independent plans. (Unlike some other heroku add-ons, no time-limited free plan is available for WebSolr). But once we go up from there, the plans seem to line up.

Starting at: $59/month for:

  • 1 million document limit
  • 40K requests/day
  • 1 index
  • 954MB storage
  • 5 concurrent requests limit (this limit is not mentioned on the independent pricing page?)

Next level up is $189/month for:

  • 5 million document limit
  • 150K requests/day
  • 4.6GB storage
  • 10 concurrent request limit (again concurrent request limits aren’t mentioned on independent pricing page)

As you can see, WebSolr has their plans metered by usage.

$59/month is around the price range we were hoping for (we’ll need two, one for staging one for production). Our small solr is well under 1 million documents and ~1GB storage, and we do only use one index at present. However, the 40K requests/day limit I’m not sure about, even if we fit under it, we might be pushing up against it.

And the “concurrent request” limit simply isn’t one I’m even used to thinking about. On a self-managed Solr it hasn’t really come up. What does “concurrent” mean exactly in this case, how is it measured? With 10 puma web workers and sometimes a possibly multi-threaded batch index going on, could we exceed a limit of 4? Seems plausible. What happens when they are exceeded? Your Solr request results in an HTTP 429 error!

Do I need to now write the app to rescue those gracefully, or use connection pooling to try to avoid them, or something? Having to rewrite the way our app functions for a particular managed solr is the last thing we want to do. (Although it’s not entirely clear if those connection limits exist on the non-heroku-plugin plans, I suspect they do?).

And in general, I’m not thrilled with the way the pricing works here, and the price points. I am positive for a lot of (eg) heroku customers an additional $189*2=$378/month is peanuts not even worth accounting for, but for us, a small non-profit whose app’s traffic does not scale with revenue, that starts to be real money.

It is not clear to me if WebSolr installations (at “standard” plans) are set up in “SolrCloud mode” or not; I’m not sure what API’s exist for uploading your custom schema.xml (which we’d need to do), or if they expect you to do this only manually through a web UI (that would not be good); I’m not sure if you can upload custom solrconfig.xml settings (this may be running on a shared solr instance with standard solrconfig.xml?).

Basically, all of this made WebSolr not the first one we looked at.

Does it matter if we’re on heroku using a managed Solr that’s not a Heroku plugin?

I don’t think so.

In some cases, you can get a better price from a Heroku plug-in than you could get from that same vendor not on heroku or other competitors. But that doesn’t seem to be the case here, and other that that does it matter?

Well, all heroku plug-ins are required to bill you by-the-minute, which is nice but not really crucial, other forms of billing could also be okay at the right price.

With a heroku add-on, your billing is combined into one heroku invoice, no need to give a credit card to anyone else, and it can be tracked using heroku tools. Which is certainly convenient and a plus, but not essential if the best tool for the job is not a heroku add-on.

And as a heroku add-on, WebSolr provides a WEBSOLR_URL heroku config/env variable automatically to code running on heroku. OK, that’s kind of nice, but it’s not a big deal to set a SOLR_URL heroku config manually referencing the appropriate address. I suppose as a heroku add-on, WebSolr also takes care of securing and authenticating connections between the heroku dynos and the solr, so we need to make sure we have a reasonable way to do this from any alternative.

SearchStax (did take it for a spin)

SearchStax’s pricing tiers are not based on metering usage. There are no limits based on requests/day or concurrent connections. SearchStax runs on dedicated-to-you individual Solr instances (I would guess running on dedicated-to-you individual (eg) EC2, but I’m not sure). Instead the pricing is based on size of host running Solr.

You can choose to run on instances deployed to AWS, Google Cloud, or Azure. We’ll be sticking to AWS (the others, I think, have a slight price premium).

While SearchStax gives you a pricing pages that looks like the “new-way-of-doing-things” transparent pricing, in fact there isn’t really enough info on public pages to see all the price points and understand what you’re getting, there is still a kind of “talk to a salesperson who has a price sheet” thing going on.

What I think I have figured out from talking to a salesperson and support, is that the “Silver” plans (“Starting at $19 a month”, although we’ll say more about that in a bit) are basically: We give you a Solr, we don’t don’t provide any technical support for Solr.

While the “Gold” plans “from $549/month” are actually about paying for Solr consultants to set up and tune your schema/index etc. That is not something we need, and $549+/month is way more than the price range we are looking for.

While the SearchStax pricing/plan pages kind of imply the “Silver” plan is not suitable for production, in fact there is no real reason not to use it for production I think, and the salesperson I talked to confirmed that — just reaffirming that you were on your own managing the Solr configuration/setup. That’s fine, that’s what we want, we just don’t want to mangage the OS or set up the Solr or upgrade it etc. The Silver plans have no SLA, but as far as I can tell their uptime is just fine. The Silver plans only guarantees 72-hour support response time — but for the couple support tickets I filed asking questions while under a free 14-day trial (oh yeah that’s available), I got prompt same-day responses, and knowledgeable responses that answered my questions.

So a “silver” plan is what we are interested in, but the pricing is not actually transparent.

$19/month is for the smallest instance available, and IF you prepay/contract for a year. They call that small instance an NDN1 and it has 1GB of RAM and 8GB of storage. If you pay-as-you-go instead of contracting for a year, that already jumps to $40/month. (That price is available on the trial page).

When you are paying-as-you-go, you are actually billed per-day, which might not be as nice as heroku’s per-minute, but it’s pretty okay, and useful if you need to bring up a temporary solr instance as part of a migration/upgrade or something like that.

The next step up is an “NDN2” which has 2G of RAM and 16GB of storage, and has an ~$80/month pay-as-you-go — you can find that price if you sign-up for a free trial. The discount price price for an annual contract is a discount similar to the NDN1 50%, $40/month — that price I got only from a salesperson, I don’t know if it’s always stable.

It only occurs to me now that they don’t tell you how many CPUs are available.

I’m not sure if I can fit our Solr in the 1G NDN1, but I am sure I can fit it in the 2G NDN2 with some headroom, so I didn’t look at plans above that — but they are available, still under “silver”, with prices going up accordingly.

All SearchStax solr instances run in “SolrCloud” mode — these NDN1 and NDN2 ones we’re looking at just run one node with one zookeeper, but still in cloud mode. There are also “silver” plans available with more than one node in a “high availability” configuration, but the prices start going up steeply, and we weren’t really interested in that.

Because it’s SolrCloud mode though, you can use the standard Solr API for uploading your configuration. It’s just Solr! So no arbitrary usage limits, no features disabled.

The SearchStax web console seems competently implemented; it let’s you create and delete individual Solr “deployments”, manage accounts to login to console (on “silver” plan you only get two, or can pay $10/month/account for more, nah), and set up auth for a solr deployment. They support IP-based authentication or HTTP Basic Auth to the Solr (no limit to how many Solr Basic Auth accounts you can create). HTTP Basic Auth is great for us, because trying to do IP-based from somewhere like heroku isn’t going to work. All Solrs are available over HTTPS/SSL — great!

SearchStax also has their own proprietary HTTP API that lets you do most anything, including creating/destroying deployments, managing Solr basic auth users, basically everything. There is some API that duplicates the Solr Cloud API for adding configsets, I don’t think there’s a good reason to use it instead of standard SolrCloud API, although their docs try to point you to it. There’s even some kind of webhooks for alerts! (which I haven’t really explored).

Basically, SearchStax just seems to be a sane and rational managed Solr option, it has all the features you’d expect/need/want for dealing with such. The prices seem reasonable-ish, generally more affordable than WebSolr, especially if you stay in “silver” and “one node”.

At present, we plan to move forward with it.

OpenSolr (didn’t look at it much)

I have the least to say about this, have spent the least time with it, after spending time with SearchStax and seeing it met our needs. But I wanted to make sure to mention it, because it’s the only other managed Solr I am even aware of. Definitely curious to hear from any users.

Here is the pricing page.

The prices seem pretty decent, perhaps even cheaper than SearchStax, although it’s unclear to me what you get. Does “0 Solr Clusters” mean that it’s not SolrCloud mode? After seeing how useful SolrCloud APIs are for management (and having this confirmed by many of my peers in other libraries/museums/archives who choose to run SolrCloud), I wouldn’t want to do without it. So I guess that pushes us to “executive” tier? Which at $50/month (billed yearly!) is still just fine, around the same as SearchStax.

But they do limit you to one solr index; I prefer SearchStax’s model of just giving you certain host resources and do what you want with it. It does say “shared infrastructure”.

Might be worth investigating, curious to hear more from anyone who did.

Now, what about ElasticSearch?

We’re using Solr mostly because that’s what various collaborative and open source projects in the library/museum/archive world have been doing for years, since before ElasticSearch even existed. So there are various open source libraries and toolsets available that we’re using.

But for whatever reason, there seem to be SO MANY MORE managed ElasticSearch SaaS available. At possibly much cheaper pricepoints. Is this because the ElasticSearch market is just bigger? Or is ElasticSearch easier/cheaper to run in a SaaS environment? Or what? I don’t know.

But there’s the controversial AWS ElasticSearch Service; there’s the Elastic Cloud “from the creators of ElasticSearch”. On Heroku that lists one Solr add-on, there are THREE ElasticSearch add-ons listed: ElasticCloud, Bonsai ElasticSearch, and SearchBox ElasticSearch.

If you just google “managed ElasticSearch” you immediately see 3 or 4 other names.

I don’t know enough about ElasticSearch to evaluate them. There seem on first glance at pricing pages to be more affordable, but I may not know what I’m comparing and be looking at tiers that aren’t actually usable for anything or will have hidden fees.

But I know there are definitely many more managed ElasticSearch SaaS than Solr.

I think ElasticSearch probably does everything our app needs. If I were to start from scratch, I would definitely consider ElasticSearch over Solr just based on how many more SaaS options there are. While it would require some knowledge-building (I have developed a lot of knowlege of Solr and zero of ElasticSearch) and rewriting some parts of our stack, I might still consider switching to ES in the future, we don’t do anything too too complicated with Solr that would be too too hard to switch to ES, probably.

Gem authors, check your release sizes

Most gems should probably be a couple hundred kb at most. I’m talking about the package actually stored in and downloaded from rubygems by an app using the gem.

After all, source code is just text, and it doesn’t take up much space. OK, maybe some gems have a couple images in there.

But if you are looking at your gem in rubygems and realize that it’s 10MB or bigger… and that it seems to be getting bigger with every release… something is probably wrong and worth looking into it.

One way to look into it is to look at the actual gem package. If you use the handy bundler rake task to release your gem (and I recommend it), you have a ./pkg directory in your source you last released from. Inside it are “.gem” files for each release you’ve made from there, unless you’ve cleaned it up recently.

.gem files are just .tar files it turns out. That have more tar and gz files inside them etc. We can go into it, extract contents, and use the handy unix utility du -sh to see what is taking up all the space.

How I found the bytes

jrochkind-chf kithe (master ?) $ cd pkg

jrochkind-chf pkg (master ?) $ ls
kithe-2.0.0.beta1.gem        kithe-2.0.0.pre.rc1.gem
kithe-2.0.0.gem            kithe-2.0.1.gem
kithe-2.0.0.pre.beta1.gem    kithe-2.0.2.gem

jrochkind-chf pkg (master ?) $ mkdir exploded

jrochkind-chf pkg (master ?) $ cp kithe-2.0.0.gem exploded/kithe-2.0.0.tar

jrochkind-chf pkg (master ?) $ cd exploded

jrochkind-chf exploded (master ?) $ tar -xvf kithe-2.0.0.tar
 x metadata.gz
 x data.tar.gz
 x checksums.yaml.gz

jrochkind-chf exploded (master ?) $  mkdir unpacked_data_tar

jrochkind-chf exploded (master ?) $ tar -xvf data.tar.gz -C unpacked_data_tar/

jrochkind-chf exploded (master ?) $ cd unpacked_data_tar/
/Users/jrochkind/code/kithe/pkg/exploded/unpacked_data_tar

jrochkind-chf unpacked_data_tar (master ?) $ du -sh *
 4.0K    MIT-LICENSE
  12K    README.md
 4.0K    Rakefile
 160K    app
 8.0K    config
  32K    db
 100K    lib
 300M    spec

jrochkind-chf unpacked_data_tar (master ?) $ cd spec

jrochkind-chf spec (master ?) $ du -sh *
 8.0K    derivative_transformers
 300M    dummy
  12K    factories
  24K    indexing
  72K    models
 4.0K    rails_helper.rb
  44K    shrine
  12K    simple_form_enhancements
 8.0K    spec_helper.rb
 188K    test_support
 4.0K    validators

jrochkind-chf spec (master ?) $ cd dummy/

jrochkind-chf dummy (master ?) $ du -sh *
 4.0K    Rakefile
  56K    app
  24K    bin
 124K    config
 4.0K    config.ru
 8.0K    db
 300M    log
 4.0K    package.json
  12K    public
 4.0K    tmp

Doh! In this particular gem, I have a dummy rails app, and it has 300MB of logs, cause I haven’t b bothered trimming them in a while, that are winding up including in the gem release package distributed to rubygems and downloaded by all consumers! Even if they were small, I don’t want these in the released gem package at all!

That’s not good! It only turns into 12MB instead of 300MB, because log files are so compressable and there is compression involved in assembling the rubygems package. But I have no idea how much space it’s actually taking up on consuming applications machines. This is very irresponsible!

What controls what files are included in the gem package?

Your .gemspec file of course. The line s.files = is an array of every file to include in the gem package. Well, plus s.test_files is another array of more files, that aren’t supposed to be necessary to run the gem, but are to test it.

(Rubygems was set up to allow automated *testing* of gems after download, is why test files are included in the release package. I am not sure how useful this is, and who if anyone does it; although I believe that some linux distro packagers try to make use of it, for better or worse).

But nobody wants to list every file in your gem individually, manually editing the array every time you add, remove, or move one. Fortunately, gemspec files are executable ruby code, so you can use ruby as a shortcut.

I have seen two main ways of doing this, with different “gem skeleton generators” taking one of two approaches.

Sometimes a shell out to git is used — the idea is that everything you have checked into your git should be in the gem release package, no more or no less. For instance, one of my gems has this in it, not sure where it came from or who/what generated it.

spec.files = `git ls-files -z`.split("\x0").reject do |f|
 f.match(%r{^(test|spec|features)/})
end

In that case, it wouldn’t have included anything in ./spec already, so this obviously isn’t actually the gem we were looking at before.

But in this case, in addition to using ruby logic to manipulate the results, nothing excluded by your .gitignore file will end up included in your gem package, great!

In kithe we were looking at before, those log files were in the .gitignore (they weren’t in my repo!), so if I had been using that git-shellout technique, they wouldn’t have been included in the ruby release already.

But… I wasn’t. Instead this gem has a gemspec that looks like:

s.test_files = Dir["spec/*/"]

Just include every single file inside ./spec in the test_files list. Oops. Then I get all those log files!

One way to fix

I don’t really know which is to be preferred of the git-shellout approach vs the dir-glob approach. I suspect it is the subject of historical religious wars in rubydom, when there were still more people around to argue about such things. Any opinions? Or another approach?

Without being in the mood to restructure this gemspec in anyway, I just did the simplest thing to keep those log files out…

Dir["spec/*/"].delete_if {|a| a =~ %r{/dummy/log/}}

Build the package without releasing with the handy bundler supplied rake build task… and my gem release package size goes from 12MB to 64K. (which actually kind of sounds like a minimum block size or something, right?)

Phew! That’s a big difference! Sorry for anyone using previous versions and winding up downloading all that cruft! (Actually this particular gem is mostly a proof of concept at this point and I don’t think anyone else is using it).

Check your gem sizes!

I’d be willing to be there are lots of released gems with heavily bloated release packages like this. This isn’t the first one I’ve realized was my fault. Because who pays attention to gem sizes anyway? Apparently not many!

But rubygems does list them, so it’s pretty easy to see. Are your gem release packages multiple megs, when there’s no good reason for them to be? Do they get bigger every release by far more than the bytes of lines of code you think were added? At some point in gem history was there a big jump from hundreds of KB to multiple MB? When nothing particularly actually happened to gem logic to lead to that?

All hints that you might be including things you didn’t mean to include, possibly things that grow each release.

You don’t need to have a dummy rails app in your repo to accidentally do this (I accidentally did it once with a gem that had nothing to do with rails). There could be other kind of log files. Or test coverage or performance metric files, or any other artifacts of your build or your development, especially ones that grow over time — that aren’t actually meant to or needed as part of the gem release package!

It’s good to sanity check your gem release packages now and then. In most cases, your gem release package should be hundreds of KB at most, not MBs. Help keep your users’ installs and builds faster and slimmer!

Updating SolrCloud configuration in ruby

We have an app that uses Solr. We currently run a Solr in legacy “not cloud” mode. Our solr configuration directory is on disk on the Solr server, and it’s up to our processes to get our desired solr configuration there, and to update it when it changes.

We are in the process of moving to a Solr in “SolrCloud mode“, probably via the SearchStax managed Solr service. Our Solr “Cloud” might only have one node, but “SolrCloud mode” gives us access to additional API’s for managing your solr configuration, as opposed to writing it directly to disk (which may not be possible at all in SolrCloud mode? And certainly isn’t using managed SearchStax).

That is, the Solr ConfigSets API, although you might also want to use a few pieces of the Collection Management API for associating a configset with a Solr collection.

Basically, you are taking your desired solr config directory, zipping it up, and uploading it to Solr as a “config set” [or “configset”] with a certain name. Then you can create collections using this config set, or reassign which named configset an existing collection uses.

I wasn’t able to find any existing ruby gems for interacting with these Solr API’s. RSolr is a “ruby client for interacting with solr”, but was written before most of these administrative API’s existed for Solr, and doesn’t seem to have been updated to deal with them (unless I missed it), RSolr seems to be mostly/only about querying solr, and some limited indexing.

But no worries, it’s not too hard to wrap the specific API I want to use in some ruby. Which did seem far better to me than writing the specific HTTP requests each time (and making sure you are dealing with errors etc!). (And yes, I will share the code with you).

I decided I wanted an object that was bound to a particular solr collection at a particular solr instance; and was backed by a particular local directory with solr config. That worked well for my use case, and I wound up with an API that looks like this:

updater = SolrConfigsetUpdater.new(
  solr_url: "https://example.com/solr",
  conf_dir: "./solr/conf",
  collection_name: "myCollection"
)

# will zip up ./solr/conf and upload it as named MyConfigset:
updater.upload("myConfigset")

updater.list #=> ["myConfigSet"]
updater.config_name # what configset name is MyCollection currently configured to use?
# => "oldConfigSet"

# what if we try to delete the one it's using?
updater.delete("oldConfigSet")
# => raises SolrConfigsetUpdater::SolrError with message:
# "Can not delete ConfigSet as it is currently being used by collection [myConfigset]"

# okay let's change it to use the new one and delete the old one

updater.update_config_name("myConfigset")
# now MyCollection uses this new configset, although we possibly
# need to reload the collection to make that so
updater.reload
# now let's delete the one we're not using
updater.delete("oldConfigSet")

OK, great. There were some tricks in there in trying to catch the apparently multiple ways Solr can report different kinds of errors, to make sure Solr-reported errors turn into exceptions ideally with good error messages.

Now, in addition to uploading a configset initially for a collection you are creating to use, the main use case I have is wanting to UPDATE the configuration to new values in an existing collection. Sure, this often requires a reindex afterwards.

If you have the recently released Solr 8.7, it will let you overwrite an existing configset, so this can be done pretty easily.

updater.upload(updater.config_name, overwrite: true)
updater.reload

But prior to Solr 8.7 you can not overwrite an existing configset. And SearchStax doesn’t yet have Solr 8.7. So one way or another, we need to do a dance where we upload the configset under a new name than switch the collection to use it.

Having this updater object that lets us easily execute relevant Solr API lets us easily experiment with different logic flows for this. For instance in a Solr listserv thread, Alex Halovnic suggests a somewhat complicated 8-step process workaround, which we can implement like so:

current_name = updater.config_name
temp_name = "#{current_name}_temp"

updater.create(from: current_name, to: temp_name)
updater.change_config_name(temp_name)
updater.reload
updater.delete(current_name)
updater.upload(configset_name: current_name)
updater.change_config_name(current_name)
updater.reload
updater.delete(temp_name)

That works. But talking to Dann Bohn at Penn State University, he shared a different algorithm, which goes like:

  • Make a cryptographic digest hash of the entire solr directory, which we’re going to use in the configset name.
  • Check if the collection is already using a configset named $name_$digest, which if it already is, you’re done, no change needed.
  • Otherwise, upload the configset with the fingerprint-based name, switch the collection to use it, reload, delete the configset that the collection used to use.

At first this seemed like overkill to me, but after thinking and experimenting with it, I like it! It is really quick to make a digest of a handful of files, that’s not a big deal. (I use first 7 chars of hex SHA256). And even if we had Solr 8.7, I like that we can avoid doing any operation on solr at all if there had been no changes — I really want to use this operation much like a Rails db:migrate, running it on every deploy to make sure the solr schema matches the one in the repo for the depoy.

Dann also shared his open source code with me, which was helpful for seeing how to make the digest, how to make a Zip file in ruby, etc. Thanks Dann!

Sharing my code

So I also wrote some methods to implement those variant updating stragies, Dann’s, and Alex Halovnic’s from the list etc.

I thought about wrapping this all up as a gem, but I didn’t really have the time to make it really good enough for that. My API is a little bit janky, I didn’t spend the extra time think it out really well to minimize the need for future backwards incompat changes like I would if it were a gem. I also couldn’t figure out a great way to write automated tests for this that I would find particularly useful; so in my code base it’s actually not currently test-covered (shhhhh) but in a gem I’d want to solve that somehow.

But I did try to write the code general purpose/flexible so other people could use it for their use cases; I tried to document it to my highest standards; and I put it all in one file which actually might not be the best OO abstraction/design, but makes it easier for you to copy and paste the single file for your own use. :)

So you can find my code here; it is apache-licensed; and you are welcome to copy and paste it and do whatever you like with it, including making a gem yourself if you want. Maybe I’ll get around to making it a gem in the future myself, I dunno, curious if there’s interest.

The SearchStax proprietary API’s

SearchStax has it’s own API’s that can I think be used for updating configsets and setting collections to use certain configsets etc. When I started exploring them, they are’t the worst vendor API’s I’ve seen, but I did find them a bit cumbersome to work with. The auth system involves a lot of steps (why can’t you just create an API Key from the SearchStax Web GUI?).

Overall I found them harder to use than just the standard Solr Cloud API’s, which worked fine in the SearchStax deployment, and have the added bonus of being transferable to any SolrCloud deployment instead of being SearchStax-specific. While the SearchStax docs and support try to steer you to the SearchStax specific API’s, I don’t think there’s really any good reason for this. (Perhaps the custom SearchStax API’s were written long ago when Solr API’s weren’t as complete?)

SearchStax support suggested that the SearchStax APIs were somehow more secure; but my SearchStax Solr API’s are protected behind HTTP basic auth, and if I’ve created basic auth credentials (or IP addr allowlist) those API’s will be available to anyone with auth to access Solr whether I use em or not! And support also suggested that the SearchStax API use would be logged, whereas my direct Solr API use would not be, which seems to be true at least in default setup, I can probably configure solr logging differently, but it just isn’t that important to me for these particular functions.

So after some initial exploration with SearchStax API, I realized that SolrCloud API (which I had never used before) could do everything I need and was more straightforward and transferable to use, and I’m happy with my decision to go with that.