Exploring hosting video in a non-profit archival digital collections web app

We are a small independent academic institution, that hosts a “digital collections” of digitized historical materials, on the web, using a custom in-house Rails app.

We’re getting into including some historical video content for the first time, and I didn’t have much experience with video, it required me to figure outa few things, that I’m going to document here. (Thanks to many folks from Code4Lib and Samvera communities who answered my questions, including mbklein, cjcolvar, and david.schober).

Some more things about us and our content:

  • Our content at least initially will be mostly digitized VHS (480p, fairly low-visual-quality content), although we could also eventually have some digitized 16mm film and other content that could be HD.
  • Our app is entirely cloud-hosted, mainly on heroku and AWS S3. We don’t manage any of our own servers at the OS level (say an EC2), and don’t want to except as a last resort (we don’t really have the in-house capacity to).

Our usage patterns are not necessarily typical for a commercial application! We have a lot (at least eventually we will) of fairly low-res/low-bitrate stuff (old VHSs!), it’s unclear how much will get viewed how often (a library/archives probably stores a lot more content as a proportion of content viewed than a commercial operation), and our revenue doesn’t increase as our content accesses do! So, warning to the general reader, our lense on things may or may not match a commercial enterprise’s. But if you are one of our peers in cultural heritage, you’re thinking, “I know, right?”

All of these notes are as of February 2022, if you are reading far in the future, some things may have changed. Also, I may have some things wrong, this is just what I have figured out, corrections welcome.

Standard Video Format: MP4 (h.264) with html5 video

I originally had in my head that maybe you needed to provide multiple video formats to hit all browsers — but it seems you don’t really anymore.

An MP4 using H.264 codec and AAC audio can be displayed in pretty much any browser of note with an html5 video tag.

There are newer/better/alternate video formats. Some terms (I’m not totally sure which of these are containers are which are codecs and which containers can go with which codecs!) include: WebM, Ogg, Theora, H.265, V8, and V9. Some of these are “better” in various ways (whether better licenses or better quality video at same filesize) — but none are yet supported across all common browsers.

You could choose to provide alternate formats so browsers that can use one of the newer formats will — but I think the days of needing to provide multiple formats to satisfy 99%+ of possible consumers appear gone. (Definitely forget about flash).

The vast majority of our cultural heritage peers I could find are using MP4. (Except for the minority using an adaptive bitrate streaming format…).

Another thing to know about H.264-MP4 is that even at the same screen size/resolution, the bitrate, and therefore the filesize (per second of footage) can vary quite a bit between files. This is because of the (lossy) compression in H.264. Some original source material may just be more compressible than others — not sure how much this comes into play. What definitely comes into play is that when encoding, you can choose to balance higher compression level vs higher quality (similar to in JPG). Most encoding software can let you choose set a maximum bitrate (sometimes using either a variable-bit-rate (VBR) or constant-bit-rate (CBR) algorithm), OR choose a “quality level” on some scale and the encoder will compress differently from second to second of footage, at whatever level will give you that level of “quality”.

An Aside: Tools: ffmpeg and ffprobe

ffmpeg (and its accompanying ffprobe analyzer) are amazing open source tools. They do almost anything. They are at the heart of most other video processing software.

ffprobe is invaluable for figuring out “what do I have here” if say given an mp4 file.

One thing I didn’t notice at first that is really neat is that both ffmpeg and ffprobe can take a URL as an input argument just about anywhere they can take a local file path. This can be very convenient if your video source material is, say, on S3. You don’t need to download it before feeding to ffmpeg or ffprobe, you can just give them a URL, leading to faster more efficient operations as ffmpeg/ffprobe take care of downloading only what they need as they need it. (ffprobe characterization can often happen from only the first portion of the file, for instance).

IF you need an adaptive bitrate streaming format: HLS will do

So then there are “streaming” formats, the two most popular being HLS or MPEG-DASH. While required if you are doing live-streaming, you also can use them for pre-recorded video (which is often called “video-on-demand” (VOD) in internet discussions/documentation, coming out of terminology used in commercial applications).

The main reason to do this is for the “adaptive bitrate” features. You can provide variants at various bitrates, and someone on a slow network (maybe a 3G cellphone) can still watch your video, just at lower resolution/visual quality. An adaptive bitrate streaming format can even change the resolution/quality in mid-view, if network conditions change. Perhaps you’ve noticed this watching commercial video over the internet.

Just like with static video, there seems to be one format which is supported in (nearly) all current browsers (not IE though): HLS. (Which sometimes/usually may actually be h.264, just reformatted for streaming use? Not sure). While the caniuse chart makes it look like only a few browsers support HLS, in fact the other browsers do via their support of the Media Source API (formerly called Media Source Extensions). Javascript code can be used to add HLS support to a browser via the Media Source API, and if you use a Javascript viewer like video.js or MediaElement.js, this just happens for you transparently. This is one reason people may be using such players instead of raw html5 <video> tags alone. (I plan to use video.js).

Analogously to MP4, there are other adaptive bitrate streaming formats too, that may be superior in various ways to HLS, but don’t have as wide support. Like MPEG-DASH. At the moment, you can reach pretty much any browser in use with HLS, and of community/industry peers I found using an adaptive bitrate streaming format, HLS was the main one in use, usually presented without alternative sources.

But do we even need it?

At first I thought that one point of HLS was allowing better/more efficient seeking: When the viewer wants to jump to the middle of the video you don’t want to force the browser to download the whole video. But in fact, after more research I believe current browsers can seek in a static mp4 just fine using HTTP byte-range requests (important they are on a server that supports such!), so long as the MP4 was prepared with what ffmpeg calls faststart. (This is important! You can verify if a video is properly prepared by running mediainfo on it and looking for the IsStreaming line in output; or by jumping through some hoops with ffmpeg).

It’s a bit hacky to do seeking with just http-hosted mp4s, but user-agents seem to have gotten pretty good at it. You may want to be sure any HTTP caching layer you are using can appropriately pass through and cache HTTP byte-range requests.

So, as far as I know, that leaves actual dynamic adaptive bitrate feature as the reason to use HLS. But you need that only if the MP4’s you’d be using otherwise are a higher bitrate than the lowest bitrate you’d be preparing for your inclusion in your HLS bundle. Like, if your original MP4 is only 500 kbps, and that’s the lowest bitrate you’d be including in your HLS package anyway… your original MP4 is already viewable on as slow a connection as your HLS preparation would be.

What is the lowest bitrate typically included as an option in HLS? i’ve found it hard to find advice on what variants to prepare for HLS distribution! In Avalon’s default configuration for creating HLS with ffmpeg, the lowest bitrate variant is 480p at 500 kbps. For more comparisons, if you turn on simulation of a slow connection in Chrome Dev Tools, Chrome says “Slow 3G” is about 400 kbps, and “Fast 3G” is about 1500 kbps. The lowest bitrate offered by AWS MediaConvert’s HLS presets is 400 kbps (for 270p or 360p) or 600kbps at 480p.

(In some cases these bitrates may be video-only; audio can add another 100-200kbps, depending on what quality you encode it at.)

I think if your original MP4 is around 500 kbps, there’s really no reason at all to use HLS. As it approaches 1500 kbps (1.5 Mbps)… you could consider creating HLS with a 500kbps variant, but also probably get away with serving most users adequately at the original bitrate (sorry to the slower end of cell phone network). As you start approaching 3Mbps, I’d start considering HLS, and if you have HD 720p or 1080p content (let alone 4K!) which can get up to 6 Mbps bitrates and even much higher — I think you’d probably be leaving users on not-the-fastest connections very frustrated without HLS.

This is me doing a lot of guessing, cause I haven’t found a lot of clear guidance on this!

As our originals are digitizations of VHS (and old sometimes degraded VHS at that), and started out pretty low-quality, our initial bitrates are fairly low. In one of our sample collections, the bitrates were around 1300bps — I’d say we probably don’t need HLS? Some of our digitized film was digitized in SD at around 2300 kbps — meh? But we had a couple films digitized at 1440p and 10 Mbps — okay, probably want to either downsample the access MP4, or use HLS.

MOST of our cultural heritage peers do not yet seem to be using HLS. In a sampling of videos found on DPLA, almost all were being delivered as MP4 (and usually fairly low-quality videos at under 2 Mbps, so that’s fine!). However, most of our samvera peers using avalon are probably using HLS.

So how do you create and serve HLS?

Once you create it, it’s actually just static files on disk, you can serve with a plain old static HTTP web server. You don’t need any kind of fancy media server to serve HLS! (Another misconception I started out not being sure of, that maybe used to be different, in the days when “RTP” and such were your only adaptive streaming options). An HLS copy on disk is one (or usually several) .m3u8 manifest files, and a lot of .ts files with chunked data referenced by the manifests.

You can (of course) use ffmpeg to create HLS. A lot of people do that happily, but it doesn’t work well for our cloud deployment — creating HLS takes too long for us to want to do it in a Rails background job on a heroku worker dyno, and we don’t want to be in the business of running our own servers otherwise.

Another thing some of our peers use is the wowza media server. We didn’t really look at that seriously either, I think our peers using it are at universities with enormous campus-wide site licenses for running it on-premises (which we don’t want to do), there might be a cloud-hosted SaaS version… but I just mention this product for completeness in case you are interested, it looked too “enterprisey” for our needs, and we didn’t significantly investigate.

The solutions we found and looked at that fit into the needs we had for our cloud-hosted application were:

  • Use AWS Elemental MediaConvert to create the HLS variants, host on S3 and serve from there (probably with CloudFront, paying AWS data egrees fees for video watched). This is sort of like a cloud alternative to ffmpeg, you tell it exactly what HLS variants you want.
  • AWS Elemental MediaPackage ends up working more like a “video server” — you just give it your originals, and get back an HLS URL, you leave the characteristics of the HLS variants up to the black box, and it creates them apparently on-the-fly as needed. You don’t pay for storage of any HLS variants, but do pay a non-trivial fee for every minute of video processed (potentially multiple times if it expires from cache and is watched again) on top of the AWS egrees fees.
  • CloudFlare Stream is basically CloudFlare’s version of MediaPackage. They charge by second of footage instead of byte (for both storage and the equivalent of egress bandwidth), and it’s not cheap… whether it’s cheaper or more expensive than MediaPackage can depend on the bitrate of your material, and the usage patterns of viewing/storing. For our low-bitrate unpredictable-usage patterns, it looks to me likely to be end up more expensive? But not sure.
  • Just upload them all to youtube and/or vimeo, and serve from there? Kind of crazy but it just might work? Could still be playing on our web pages, but they’re actually pointing at youtube or vimeo hosting the video…. While this has the attraction of being the only SaaS/PaaS solutions I know of that won’t have metered bandwidth (you don’t pay per minute viewed)… there are some limitations too. I couldn’t really find any peers doing this with historical cultural heritage materials.

I’m going to talk about each of these in somewhat more detail below, especially with regard to costs.

First a word on cost estimating of video streaming

One of our biggest concerns with beginning to include video in digital collections is cost. Especially because, serving out of S3/Cloudfront, or most (all?) other cloud options, we pay data egress costs for every minute of video viewed. As a non-profit educational institution, the more the video gets viewed, the more we’re achieving our mission — but our revenue doesn’t go up with minutes viewed, and it can really add up.

So that’s one the things we’re most interested in comparing between different HLS options.

But comparing it requires guessing at a bunch of metrics. Some that are easier to guess are: How many hours of video we’ll be storing; and How many hours we’ll be ingesting per month. Some services charge in bytes, some in hours; to convert from hours to bytes requires us to guess our average bitrate, which we can take a reasonable stab at. (Our digitized VHS is going to be fairly low quality, maybe 1500-2500 kbps).

Then there are the numbers that are a lot harder to guess — how many hours of video will be viewed a month? 100 hours? 1000? 10,000? I am not really sure what to expect, it’s the least under our control, it could grow almost unboundedly and cost us a lot of money. Similarly, If we offer full-file downloads, how many GB of video files will be downloaded a month?

Well, I made some guesses, and I made a spreadsheet that tried to estimate costs of various platforms under various scenarios. (There are also probably assumptions I don’t even realize I’m making not reflected in the spreadsheet that will effect costs!). Our initial estimates are pretty small for a typical enterprise, just hosting maybe 100 hours of video, maybe 200 hours viewed a month? Low-res VHS digitized material. (Our budget is also fairly sensitive to what would be very small amounts in a commercial enterprise!)

You can see/copy the spreadsheet here, and I’ll put a few words about each below.

Serve MP4 files from S3

The base case! Just serve plain MP4 files from S3 (probably with CloudFront in front). Sufficient if our MP4 bitrates are 500 kbps, maybe up to around 1.5 Mbps.

Our current app architecture actually keeps three copies of all data — production, a backup, and a sync’d staging environment. So that’s some S3 storage charges, initially estimated at just full-cost standard S3. There are essentially no “ingest” costs (some nominal cost to replicate production to our cross-region backup).

Then there are the standard AWS data egress costs — Cloudfront not actually that different from standard S3, until you get into trying to do bulk reserved purchases or something, but we’ll just estimate at the standard rate.

The storage costs will probably be included in any other solution too, since we’ll probably still keep our “canonical” cop(ies) on S3 regardless.

HLS via AWS Elemental MediaConvert

AWS Elemental MediaConvert is basically a transcoding service — think of it like ffmpeg transcoding but AWS-hosted SaaS. Your source needs to be on S3 (well, technically it can be a public URL elsewhere), you decide what variants you want to create, they are written to an S3 bucket.

Bandwidth costs are exactly the same as our base MP4 S3 case, since we’d still serving from Cloudfront — so essentially scales up with traffic exactly the same as our base case, which is nice. (hypothetically could be somewhat less bandwidth depending on how many users receive lower-bitrate variants via HLS, but we just estimated that everyone would get the high-quality one, as an upper bound).

We pay a bit more for storage (have to store the HLS derivatives, just standard S3 prices).

Then we pay an ingest cost to create the HLS, that is actually charged per minute (rather than per GB) — for SD, if we avoid “professional tier” features and stay no more than 30 fps, $0.0075 per minute of produced video (meaning the more HLS variants you create, the more you pay).

Since we probably will be digitizing a fairly small package per month, and there is no premium over standard S3 for bandwidth (our least predictable cost), this may be a safe option?

Also, because we know of at least one peer using AWS Elemental MediaConvert. (Northwestern University Libraries).

AWS Elemental MediaPackage

MediaPackage is basically an AWS media server offering. You can use it for live streaming, but we’re looking at the “Video on Demand” use case and pricing only. You just give it a source video, and it creates an HLS URL for it. (Among other possible streaming formats; HLS is the one we care about). You don’t (and I think can’t) tell it what bitrates/variants to create in the HLS stream, it just does what it thinks best. On the fly/on-demand, I think?

The pricing model includes a fairly expensive $0.05/GB packaging fee — that is, fee to create the stream. (why not per-minute like MediaConvert? I don’t know). This is charged on-demand: Not until someone tries to watch a video. If multiple people are watching the same video at the same time, you’ll only pay the packaging fee once as it’ll be cached in CloudFront. But I don’t know if it’s clear exactly how long it will remain cached in CloudFront — and I don’t know how to predict my viewers usage patterns anyway, how much they’ll end up watching the same videos taking advantage of cache — how many views will result in packaging fees vs cached views.

So taking a worst-case estimate of zero cache utilization, MediaPackage basically adds a 50% premium to our bandwidth costs. These being our least predictable and unbounded costs, this is a bit risky — if we have a lot of viewed minutes, that don’t cache well, this could end up being much more expensive than MediaConvert approach.

But, you don’t pay any storage fees for the HLS derivatives at all. If you had a large-stored-volume and relatively small viewed-minutes, MediaPackage could easily end up cheaper than doing it yourself with MediaConvert (as you can see in our spreadsheet). Plus, there’s just less to manage or code, you really just give it an S3 source URL, and get back an HLS URL, the end.

CloudFlare Stream

CloudFlare Stream is basically Cloudflare’s altenrative to MediaPackage. Similarly, it can be used for livestreaming, but we’re just looking at it for “video on demand”. Similarly, you basically just give it video and get back HLS URL (or possibly other streaming formats), without specifying the details.

The big difference is that CloudFlare meters per minute instead of per GB. For storage of a copy of “originals” in the Stream system, which is required– and we’d end up storing an extra copy in CloudFlare, since we’re still going to want our canonical copy in AWS. (Don’t know if we can do everything we’d need/want with a copy only in CloudFlare stream). And CloudFlare charges per minute for bandwidth from Stream too. (There is no ingest/packaging fee, and no cost for storage of derived HLS).

Since they charge per minute, how competitive it is really depends on your average bitrate, the higher your average bitrate the better a deal CloudFlare is compared to AWS! At an average bitrate of more than 1500kbps, the CloudFlare bandwidth cost starts beating AWS — at 10 MBps HD, it’s going to really beat it. But we’re looking at relatively low-quality SD under 1500kbps, so.

Whether CloudFlare Stream is more or less expensive than one of the AWS approaches is going to depend not only on bandwidth, but on your usage patterns (how much are you storing, how much are you ingesting a month) — from a bit more expensive to, in the right circumstances, a lot less expensive.

CloudFlare Stream has pretty much the convenience of AWS MediaPackage, except that we need to deal with getting a separate copy of originals into CloudFlare, with prepaid storage limits (you need to kind of sign up for what storage limit you want). Which is actually kind of inconvenient.

What if we used YouTube or Vimeo though?

What if we host our videos on YouTube or Vimeo, and deliver them from there? Basically use them as a cloud hosted video server? I haven’t actually found any peers doing this with historical/cultural heritage/archival materials. But the obvious attraction is that these services don’t meter bandwidth, we could get out of paying higher egress/bandwidth as viewing usage goes up — our least predictable and potentially otherwise largest budget component.

The idea is that this would be basically invisible to the end-user, they’d still be looking at our digital collections app and an embedded viewer; ideally the items would not be findable in youtube/vimeo platform search or on google to a youtube/vimeo page. It would also be mostly invisible to content management staff, they’d ingest into our Digital Collections system same as ever, and our software would add to vimeo and get referencing links via vimeo API.

We’d just be using youtube or vimeo as a video server platform, really not too different from how one uses AWS MediaPackage or Cloudflare Stream.

Youtube is completely free, but, well, it’s youtube. It’s probably missing features we might want (I don’t think you can get a direct HLS link), has unclear/undocumented limits or risk of having your account terminated, you get what you pay for as far as support, etc.

Vimeo is more appealing. Features included (some possibly only in “pro” account and above) seem (am still at beggining of investigation) to include:

  • HLS URLs we could use with whatever viewers we wanted, same viewers we’d be using with any other HLS URL.
    • Also note direct video download links, if we want, so we can avoid bandwidth/egress charges on downloads too!
  • Support for high-res videos, no problem, all the way up to 4K/HDR. (Although we don’t need that for this phase of VHS digitization)
  • “team” accounts where multiple staff accounts can have access to Vimeo management of our content. (Only 3 accounts on $20/month “pro”, 10 on “Business” and higher)
  • Unlisted/private settings that should keep our videos off of any vimeo searches or google. Even a “Hide from Vimeo” setting where the video cannot be viewed on a vimeo page at all, but only as embedded (say via HLS)!
    • One issue, though, is that the HLS and video download links we do have probably won’t be signed/expiring — once someone has it, they have it and can share it (until/unless you delete the content from vimeo). This is probably fine for our public content use cases, but worth keeping in mind.
  • An API that looks reasonable and full-featured.

Vimeo storage/ingest limits

The way Vimeo does price tiers/metering is a bit odd, especially at the $20/month “Pro” level. It’s listed as being limited to ingesting 20GB/week, and 1TB/year. But I guess it can hold as much content you want as long as you ingest it at no more than 20GB/week? Do I understand that right? For our relatively low-res SD content, let’s say at a 3Mbs bitrate — 20GB/week is about 15 hours/week — at our current planned capacity, we wouldn’t be ingesting more than that as a rule, although it’s a bit annoying that if we did, as an unusual spike, our software would have to handle the weekly rate limit.

At higher plans, the vimeo limits are total storage rather than weekly ingest. The “Business” plan at $50/month has 5TB total storage. At 3Mbs bitrate, that’s around 3700 hours of content. At our current optimistic planned ingest capacity, it would take us over 10 years to fill that up. If it were HD content at 10 Mbps, 5TB is around 1100 hours of content, which we might reach in 4 or 5 years at our current planned ingest rate.

The “Premium” plan lets you bump that up to 7TB for $75/month.

It’s certainly conceivable we could reach those limits — and even sooner if we increase our digitization/ingest capacity beyond what we’re starting with. I imagine at that point we’d have to talk to them about a custom “enterprise” plan, and hope they can work out something reasonable for a non-profit academic institution that just needs expanded storage limits.

I imagine we’d write our software so it could serve straight MP4 if the file wasn’t (yet?) in Vimeo, but would just use vimeo HLS (and download link?) URLs if it was.

It’s possible there will be additional unforeseen limitations or barriers once we get into an investigatory implementation, but this seems worth investigating.

So?

Our initial implementation may just go with static MP4 files, for our relatively low-bitrate SD content.

When we are ready to explore streaming (which could be soon after MVP), I think we’d probably explore either vimeo? If not Vimeo… AWS MediaConvert is more “charted territory” as we have cooperating peers who have used it… but the possibility of MediaConvert or CloudFront Stream to be cheaper under some usage patterns is interesting. (And they are possibly somewhat simpler to implement). However, their risk of being much more expensive under other usage patterns may be too risky. Predictability of budget is really high-value in the non-profit world, which is a large part of our budgeting challenge here, the unpredictability of costs when increased usage means increased costs due to metered bandwidth/egress.

Finding source install location of a loaded ruby gem at runtime

Not with a command line, but from within a ruby program, that has gems loaded … how do you determine the source install location of such a loaded gem?

I found it a bit difficult to find docs on this, so documenting for myself now that I’ve figured it out. Eg:

Gem.loaded_specs['device_detector'].full_gem_path
# => "/Users/jrochkind/.gem/ruby/2.7.5/gems/device_detector-1.0.5"

Blacklight: Automatic retries of failed Solr requests

Sometimes my Blacklight app makes a request to Solr and it fails in a temporary/intermittent way.

  • Maybe there was a temporary network interupting, resulting in a failed connection or timeout
  • Maybe Solr was overloaded and being slow, and timed out
    • (Warning: Blacklight by default sets no timeouts, and is willing to wait forever for Solr, which you probably don’t want. How to set a timeout is under-documented, but set a read_timeout: key in your blacklight.yml to a number of seconds; or if you have RSolr 2.4.0+, set key timeout. Both will do the same thing, pass the value timeout to an underlying faraday client).
  • Maybe someone restarted the Solr being used live, which is not a good idea if you’re going for zero uptime, but maybe you aren’t that ambitious, or if you’re me maybe your SaaS solr provider did it without telling you to resolve the Log4Shell bug.
    • And btw, if this happens, can appear as a series of connection refused, 503 responses, and 404 responses, for maybe a second or three.
  • (By the way also note well: Your blacklight app may be encountering these without you knowing, even if you think you are monitoring errors. Blacklight default will take pretty much all Solr errors, including timeouts, and rescue them, responding with an HTTP 200 status page with a message “Sorry, I don’t understand your search.” And HoneyBadger or other error monitoring you may be using will probably never know. Which I think is broken and would like to fix it, but have been having trouble getting consensus and PR reviews to do so. You can fix it with some code locally, but that’s a separate topic, ANYWAY…)

So I said to myself, self, is there any way we could get Blacklight to automatically retry these sorts of temporary/intermittent failures, maybe once or twice, maybe after a delay? So there would be fewer errors presented to users (and fewer errors alerting me, after I fixed Blacklight to alert on em), in exhange for some users in those temporary error conditions waiting a bit longer for a page?

Blacklight talks to Solr via RSolr — can use 1.x or 2.x — and RSolr, if you’re using 2.x, uses faraday for it’s solr http connections. So one nice way might be to configure the Blacklight/RSolr faraday connection with the faraday retry middleware. (1.x rubydoc). (moved into its own gem in the recently released faraday 2.0).

Can you configure custom faraday middleware for the Blacklight faraday client? Yeesss…. but it requires making and configuring a custom Blacklight::Solr::Repository class, most conveniently by sub-classing the Blacklight class and overriding a private method. :( But it seems to work out quite well after you jump through some a bit kludgey hoops! Details below.

Questions for the Blacklight/Rsolr community:

  • Is this actually safe/forwards-compatible/supported, to be sub-classing Blacklight::Solr::Repository and over-riding build_connection with a call to super? Is this a bad idea?
  • Should Blacklight have it’s own supported and more targeted API for supplying custom faraday middleware generally (there are lots of ways this might be useful), or setting automatic retries specifically? i’d PR it, if there was some agreement about what it should look like and some chance of it getting reviewed/merged.
  • Is there anyone, anyone at all, who is interested in giving me emotional/political/sounding-board/political/code-review support for improving Blacklight’s error handling so it doesn’t swallow all connection/timeout/permanent configuration errors by returning an http 200 and telling the user “Sorry, I don’t understand your search”?

Oops, this may break in Faraday 2?

I haven’t actually tested this on the just-released Faraday 2.0, that was released right after I finished working on this. :( If faraday changes something that makes this approach infeasible, that might be added motivation to make Blacklight just have an API for customizing faraday middleware without having to hack into it like this.

The code for automatic retries in Blacklight 7

(and probably many other versions, but tested in Blacklight 7).

Here’s my whole local pull request if you find that more covenient, but I’ll also walk you through it a bit below and paste in frozen code.

There were some tricks to figuring out how to access and change the middleware on the existing faraday client returned by the super call; and how to remove the already-configured Blacklight middleware that would otherwise interfere with what we wanted to do (including an existing use of the retry middleware that I think is configured in a way that isn’t very useful or as intended). But overall it works out pretty well.

I’m having it retry timeouts, connection failures, 404 responses, and any 5xx response. Nothing else. (For instance it won’t retry on a 400 which generally indicates an actual request error of some kind that isn’t going to have any different result on retry).

I’m at least for now having it retry twice, waiting a fairly generous 200ms before first retry, then another 400ms before a second retry if needed. Hey, my app can be slow, so it goes.

Extensively annotated:

# ./lib/scihist/blacklight_solr_repository.rb
module Scihist
# Custom sub-class of stock blacklight, to override build_connection
# to provide custom faraday middleware for HTTP retries
#
# This may not be a totally safe forwards-compat Blacklight API
# thing to do, but the only/best way we could find to add-in
# Solr retries.
class BlacklightSolrRepository < Blacklight::Solr::Repository
# this is really only here for use in testing, skip the wait in tests
class_attribute :zero_interval_retry, default: false
# call super, but then mutate the faraday_connection on
# the returned RSolr 2.x+ client, to customize the middleware
# and add retry.
def build_connection(*_args, **_kwargs)
super.tap do |rsolr_client|
faraday_connection = rsolr_client.connection
# remove if already present, so we can add our own
faraday_connection.builder.delete(Faraday::Request::Retry)
# remove so we can make sure it's there AND added AFTER our
# retry, so our retry can succesfully catch it's exceptions
faraday_connection.builder.delete(Faraday::Response::RaiseError)
# add retry middleware with our own confiuration
# https://github.com/lostisland/faraday/blob/main/docs/middleware/request/retry.md
#
# Retry at most twice, once after 300ms, then if needed after
# another 600 ms (backoff_factor set to result in that)
# Slow, but the idea is slow is better than an error, and our
# app is already kinda slow.
#
# Retry not only the default Faraday exception classes (including timeouts),
# but also Solr returning a 404 or 502. Which gets converted to
# Faraday error because RSolr includes raise_error middleware already.
#
# Log retries. I wonder if there's a way to have us alerted if
# there are more than X in some time window Y…
faraday_connection.request :retry, {
interval: (zero_interval_retry ? 0 : 0.300),
# exponential backoff 2 means: 1) 0.300; 2) .600; 3) 1.2; 4) 2.4
backoff_factor: 2,
# But we only allow the first two before giving up.
max: 2,
exceptions: [
# default faraday retry exceptions
Errno::ETIMEDOUT,
Timeout::Error,
Faraday::TimeoutError,
Faraday::RetriableResponse, # important to include when overriding!
# we add some that could be Solr/jetty restarts, based
# on our observations:
Faraday::ConnectionFailed, # nothing listening there at all,
Faraday::ResourceNotFound, # HTTP 404
Faraday::ServerError # any HTTP 5xx
],
retry_block: -> (env, options, retries_remaining, exc) do
Rails.logger.warn("Retrying Solr request: HTTP #{env["status"]}: #{exc.class}: retry #{options.max – retries_remaining}")
# other things we could log include `env.url` and `env.response.body`
end
}
# important to add this AFTER retry, to make sure retry can
# rescue and retry it's errors
faraday_connection.response :raise_error
end
end
end
end

Then in my local CatalogController config block, nothing more than:

config.repository_class = Scihist::BlacklightSolrRepository

I had some challenges figuring out how to test this. I ended up testing against a live running Solr instance, which my app’s test suite does sometimes (via solr_wrapper, for better or worse).

One test that’s just a simple smoke test that this thing seems to still function properly as a Blacklight::Solr::Repository without raising. And one that of a sample error

require "rails_helper"
describe Scihist::BlacklightSolrRepository do
# a way to get a configured repository class…
let(:repository) do
Scihist::BlacklightSolrRepository.new(CatalogController.blacklight_config).tap do |repo|
# if we are testing retries, don't actually wait between them
repo.zero_interval_retry = true
end
end
# A simple smoke test against live solr hoping to be a basic test that the
# thing works like a Blacklight::Solr::Repository, our customization attempt
# hopefully didn't break it.
describe "ordinary behavior smoke test", solr: true do
before do
create(:public_work).update_index
end
it "can return results" do
response = repository.search
expect(response).to be_kind_of(Blacklight::Solr::Response)
expect(response.documents).to be_present
end
end
# We're actually going to use webmock to try to mock some error conditions
# to actually test retry behavior, not going to use live solr.
describe "retry behavior", solr:true do
let(:solr_select_url_regex) { /^#{Regexp.escape(ScihistDigicoll::Env.lookup!(:solr_url) + "/select")}/ }
describe "with solr 400 response" do
before do
stub_request(:any, solr_select_url_regex).to_return(status: 400, body: "error")
end
it "does not retry" do
expect {
response = repository.search
}.to raise_error(Blacklight::Exceptions::InvalidRequest)
expect(WebMock).to have_requested(:any, solr_select_url_regex).once
end
end
describe "with solr 404 response" do
before do
stub_request(:any, solr_select_url_regex).to_return(status: 404, body: "error")
end
it "retries twice" do
expect {
response = repository.search
}.to raise_error(Blacklight::Exceptions::InvalidRequest)
expect(WebMock).to have_requested(:any, solr_select_url_regex).times(3)
end
end
end
end

Github Action setup-ruby needs to quote ‘3.0’ or will end up with ruby 3.1

You may be running builds in Github Actions using the setup-ruby action to install a chosen version of ruby, looking something like this:

    - name: Set up Ruby
      uses: ruby/setup-ruby@v1
      with:
        ruby-version: 3.0

A week ago, that would have installed the latest ruby 3.0.x. But as of the christmas release of ruby 3.1, it will install the latest ruby 3.1.x.

The workaround and/or correction is to quote the ruby version number. If you actually want to get latest ruby 3.0.x, say:

      with:
        ruby-version: '3.0'

This is reported here, with reference to this issue on the Github Actions runner itself. It is not clear to me that this is any kind of a bug in the github actions runner, rather than just an unanticipated consequence of using a numeric value in YAML here. 3.0 is of course the same number as 3, it’s not obvious to me it’s a bug that the YAML parser treats them as such.

Perhaps it’s a bug or mis-design in the setup-ruby action. But in lieu of any developers deciding it’s a bug… quote your 3.0 version number, or perhaps just quote all ruby version numbers with the setup-ruby task?

If your 3.0 builds started failing and you have no idea why — this could be it. It can be a bit confusing to diagnose, because I’m not sure anything in the Github Actions output will normally echo the ruby version in use? I guess there’s a clue in the “Installing Bundler” sub-head of the “Setup Ruby” task:

Of course it’s possible your build will succeed anyway on ruby 3.1 even if you meant to run it on ruby 3.0! Mine failed with LoadError: cannot load such file -- net/smtp, so if yours happened to do the same, maybe you got here from google. :) (Clearly net/smtp has been moved to a different status of standard gem in ruby 3.1, I’m not dealing with this further becuase I wasn’t intentionally supporting ruby 3.1 yet).

Note that if you are building with a Github actions matrix for ruby version, the same issue applies. Maybe something like:

matrix:
        include:
          - ruby: '3.0' 
 steps:
    - uses: actions/checkout@v2

    - name: Set up Ruby
      uses: ruby/setup-ruby@v1
      with:
        ruby-version: ${{ matrix.ruby }}

Setting and handling solr timeouts in Blacklight

When using the Blacklight gem for a Solr search front-end (most used by institutions in the library/cultural heritage sector), you may wish to set a timeout on how long to wait for Solr connection/response.

It turns out, if you are using Rsolr 2.x, you can do this by setting a read_timeout key in your blacklight.yml file. (This under-documented key is a general timeout, despite the name; I have not investigated with Rsolr 1.x).

But the way it turns into an exception and the way that exception is handled is probably not what you would consider useful for your app. You can then change this by over-riding the handle_request_error method in your CatalogController.

I am planning on submitting some PR’s to RSolr and Blacklight to improve some of these things.

Read on for details.

Why set a timeout?

It’s generally considered important to always set a timeout value on an external network request. If you don’t do this, your application may wait indefinitely for the remote server to respond, if the remote server is being slow or hung; or it may depend on underlying library default timeouts that may not be what you want.

What can happen to a Blackligh that does not a set a Solr timeout? We could have a Solr server that takes a really long time — or is entirely hung — on returning a response for one request, or many, or all of them.

Your web workers (eg puma or passenger) will be waiting a while for Solr. Either indefinitely, or maybe there’s a default timeout in the HTTP client (I’m actually not sure, but maybe 60s for net-http?). During this time, the web workers are busy, and unable to handle other requests. This will reduce the traffic capacity of your app, for a very slow/repeatedly misbehaving Solr possibly catastrophically leading to an app that appears unresponsive.

There may be some other part of the stack that will timeout waiting for the web worker to return a response (while the web worker is waiting for Solr). For instance, heroku is willing to wait a maximum of 30 seconds, and I think Passenger also has timeouts (although may default to as long as 10 minutes??). But this may be much longer than you really want your app to wait on Solr for reasons above, and when it does get triggered you’ll get a generic “Timed out waiting for app response” in your logs/monitoring, it won’t be clear the web worker was waiting on solr, making operational debugging harder.

How to set a Blacklight Solr timeout

A network connection to Solr in the Blacklight stack first goes through RSolr, which then (in Rsolr 2.x) uses the Faraday ruby gem, which can use multiple http drivers but default uses net-http from the stdlib.

For historical reasons, how to handle timeouts has been pretty under-documented (and sometimes changing) at all these levels! They’re not making it easy to figure out how to effectively set timeouts! It took the ruby community a bit of time to really internalize the importance of timeouts on HTTP calls.

So I did some research, in code and in manual tests.

Faraday timeouts?

If we start in the middle at Faraday, it’s not clearly documented… and may be http-adapter-specific? Faraday really doesn’t make this easy for us!

But from googling, it looks like Faraday generally means to support keys open_timeout (waiting for a network connection to open), and timeout (often waiting for a response to be returned, but really… everything else, and sometimes includes open_timeout too).

If you want some details….

For instance, if we look at the faraday adapter for http-rb, we can see that the faraday timeout option is passed to http-rb for each of connect, read, and write.

  • (Which really means if you set it to 5 seconds… it could wait 5 seconds for connect then another 5 seconds for write and another 5 seconds for read 😦. http-rb actually provided a general/global timeout at one point, but faraday doens’t take advatnage of it. 😦😦).

And then http-rb adapter uses the open_timeout value just for connect and write. That is, setting both faraday options timeout and open_timeout to the same value would be redundant for the the http-rb adapter at present. the http-rb adapter doesn’t seem to do anything with any other faraday timeout options.

If we look at the default net-http adapter… It’s really confusing! We have to look at this method in faraday generic too. But I confirmed by manual testing that net-http actually supports faraday read_timeout, write_timeout, and open_timeout (different values than http-rb), but will also use timeout as a default for any of them. (Again your actual end-to-end timeout can be sum of open/read/write. 😦).

It looks like different Faraday adapters can use different timeout values, but Faraday tries to make the basic timeout value at least do something useful/general for each adapter?

Most blacklight users are probably using the default net-http adapter (Curious to hear about anyone who isn’t?)

What will Blacklight actually pass to Faraday?

This gets confusing too!

Blacklight seems to take whatever keys you have set in your blacklight.yml for the given environment, and pass them to RSolr.connect. With one exception, you have to say http_adapter in blacklight.config to translate to adapter passed to Rsolr.

  • (I didn’t find the code that makes that blacklight_config invocation be your environment-specific hash from blackight.yml, but I confirmed that’s what it is!)

What does Rsolr 2.x do? It does not pass on anything to Faraday, but only certain allow-listed items, after translating. Confusingly, it’s only wiling to pass on open_timeout, and also translate a read_timeout value from blacklight.yml to Faraday timeout.

Phew! So Blacklight/Rsolr only supports two timeout values to be passed to faraday:

  • open_timeout to Faraday open_timeout
  • read_timeout to Faraday timeout.

PR to Rsolr on timeout arguments?

I think ideally RSolr would pass on any of the values Faraday seems to recognize, at least with some adapters, for timeouts: read_timeout, open_timeout, write_timeout, as well as just timeout.

But to get from what it does now to there in a backwards compatible way… kind of impossible because of how it’s currently translating read_timeout to timeout. :(

I think I may PR one that just recognizes timeout too, while leaving read_timeout as a synonym with a deprecation warning telling you to use timeout? Still thinking this through.

What happens when a Timeout is triggered?

Here we have another complexity. Just as the timeout configuration values are translated on the way down the stack, the exceptions raised when a timeout happens are translated again on the way up, HTTP library => Faraday => RSolr => Blacklight.

Faraday basically has two exception classes it tries to normalize all underlying HTTP library timeouts to: Faraday::ConnectionFailed < Faraday::Error (for timeouts opening the connection) and Faraday::TimeoutError < Faraday::ServerError < Faraday::Error for other timeouts, such as read timeouts.

What happens with a connection timeout?

  1. Faraday raises a Faraday::ConnectionFailed error. (For instance from the default Net::HTTP Adapter)
  2. RSolr rescues it, and re-raises as an RSolr::Error::ConnectionRefused, which sub-classes the ruby stdlib Errno::ECONNREFUSED
  3. Blacklight rescues that Errno::ECONNREFUSED, and translates it to a Blacklight::Exceptions::ECONNREFUSED, (which is still a sub-class of stdlib Errno::ECONNREFUSED)

That just rises up to your application, to give the end-user probably a generic error message, be logged, be caught by any error-monitoring services you have, etc. Or you can configure your application to handle these Blacklight::Exceptions::ECONNREFUSED errors in some custom way using standard Rails rescue_from functionality, etc.

This is all great, just what we expect from exception handling.

The one weirdness is that the exception suggests connection refused, when really it was a timeout, which is somewhat different… but Faraday itself doesn’t distinguish between those two situations, which some people would like to improve for a while now, but there isn’t much a client of Faraday can do about in the meantime.

What happens with other timeouts?

Say, the network connection opened fine, but Solr is just being really slow returning a response (it totally happens) and exceeding a Faraday timeout value set.

The picture here is a bit less good.

  1. Faraday will raise a Faraday::TimeoutError (eg from the net-http adapter).
  2. RSolr does not treat this specially, but just rescues and re-raises it just like any other Faraday::Error as a generic RSolr::Error::Http
  3. Blacklight will take it, just as any other RSolr::Error::Http, and rescues and re-raise as a generic Blacklight::Exceptions::InvalidRequest
  4. Blacklight does not allow this to just rise up through the app, but instead uses Rails rescue_from to register it’s own handler for it, a handle_request_error method.
  5. The handle_request_error method will log the error, and then just display the current Blacklight controller “index” page (ie search form), with a message “Sorry, I don’t understand your search.”

This is… not great.

  • From a UX point of view, this is not good, we’re telling the user “sorry I don’t understand your search” when the probelm was a Solr timeout… it makes it seem like there’s something the user did wrong or could do differently, but that’s not what’s going on.
    • In fact that’s true for a lot of errors Blacklight catches this way. Solr is down? Solr collection doesn’t exist? Solr configuration has a mismatch with Blacklight configuration? All of these will result in this behavior, none of them are something the end-user can do anything about.
  • If you have an error monitoring service like Honeybadger, it won’t record this error, since the app handled it instead of letting it rise unhandled. So you may not even know this is going on.
  • If you have an uptime monitoring service, it might not catch this either, since the app is returning a 200. You could have an app pretty much entirely down and erroring for any attempt to do a search… but returning all HTTP 200 responses.
  • While Blacklight does log the error, it does it in a DIFFERENT way than Rails ordinarily does… you aren’t going to get a stack trace, or any other contextual information, it’s not really going to be clear what’s going on at all, if you mention it at all.

Not great. One option is to override the handle_request_error method in your own CatalogController to: 1) Disable this functionality entirely, don’t swallow the error with a “Sorry, I don’t understand your search” message, just re-raise it; and 2) unwrap the underlying Faraday::TimeoutError before re-raising, so that gets specifically reported instead of a generic “Blacklight::Exceptions::InvalidRequest”, so we can distinguish this specific situation more easily in our logs and error monitoring.

Here’s an implementation that does both, to be put in your catalog_controller.rb:

  # OVERRIDE of Blacklight method. Blacklight by default takes ANY kind
  # of Solr error (could be Solr misconfiguraiton or down etc), and just swallows
  # it, redirecting to Blacklight search page with a message "Sorry, I don't understand your search."
  #
  # This is wrong.  It's misleading feedback for user for something that is usually not
  # something they can do something about, and it suppresses our error monitoring
  # and potentially misleads our uptime checking.
  #
  # We just want to actually raise the error!
  #
  # Additionally, Blacklight/Rsolr wraps some errors that we don't want wrapped, mainly
  # the Faraday timeout error -- we want to be able to distinguish it, so will unwrap it.
  private def handle_request_error(exception)
    exception_causes = []
    e = exception
    until e.cause.nil?
      e = e.cause
      exception_causes << e
    end

    # Raies the more specific original Faraday::TimeoutError instead of
    # the generic wrapping Blacklight::Exceptions::InvalidRequest!
    if faraday_timeout = exception_causes.find { |e| e.kind_of?(Faraday::TimeoutError) }
      raise faraday_timeout
    end

    raise exception
  end

PRs to RSolr and Blacklight for more specific exception?

RSolr and Blacklight both have a special error class for the connection failed/timed out condition. But just lump Faraday::Timeout in with any other kind of error.

I think this logic is probably many years old, and pre-dates Faraday’s current timeout handling.

I think they should both have a new exception class which can be treated differently. Say RSolr::Error::Timeout and Blacklight::Exceptions::RepositoryTimeout?

I plan to make these PRs.

PR to Blacklight to disable that custom handle_request_error behavior

I think the original idea here was that something in the user’s query entry would trigger an exception. That’s what makes rescueing it and re-displaying it with the message “Sorry, I don’t understand your search” make some sense.

At the moment, I have no idea how to reproduce that, figure out a user-entered query that actually results in a Blacklight::Exceptions::InvalidRequest. Maybe it used to be possible to do in an older version of Solr but isn’t anymore? Or maybe it still is, but I just don’t know how?

But I can reproduce ALL SORTS of errors that were not about the user’s entry and which the end-user can do nothing about, but which still result in this misleading error message, and the error getting swallowed by Blacklight and avoiding your error- and uptime-monitoring services. Solr down entirely; Solr collection/core not present or typo’d. Mis-match between Solr configuration and Blacklight configuration, like Blacklight mentioning an Solr field that doens’t actually exist.

All of these result in Blacklight swallowing the exception, and returning an HTTP 200 response with the message “Sorry, I don’t understand your search”. This is not right!

I think this behavior should be removed in a future Blacklight version.

I would like to PR such a thing, but I’m not sure if I can get it reviewed/merged?

Blacklight 7.x, deprecation of view overrides, paths forward

This post will only be of interest to those who use the blacklight ruby gem/platform, mostly my collagues in library/cultural heritage sector.

When I recently investigated updating our Rails app from Blacklight to the latest 7.19.2, I encountered a lot of deprecation notices. They were related to code both in my local app and a plugin trying to override parts of Blacklight views — specifically the “constraints” (ie search limits/query “breadcrumbs” display) area in the code I encountered, I’m not sure if it applies to more areas of view customization.

Looking into this more to see if I could get a start on changing logic to avoid deprecation warnings — I had trouble figuring out any non-deprecated way to achieve the overrides.  After more research, I think it’s not totally worked out how to make these overrides keep working at all in future Blacklight 8, and that this affects plugins including blacklght_range_limit, blacklight_advanced_search, geoblacklight, and possibly spotlight. Some solutions need to be found if these plugins are to be updated keep working in future Blacklight 8.

I have documented what I found/understood, and some ideas for moving forward, hoping it will help start the community process of figuring out solutions to keep all this stuff working. I may not have gotten everything right or thought of everything, this is meant to help start the discussion, suggestions and corrections welcome.

This does get wordy, I hope you can find it useful to skip around or skim if it’s not all of interest. I believe the deprecations start around Blacklight 7.12 (released October 2020). I believe Blacklight 7.14 is the first version to suport ruby 3.0, so anyone wanting to upgrade to ruby 3 will encounter these issues.

Background

Over blacklight’s 10+ year existence, it has been a common use-case to customize specific parts of Blacklight, including customizing what shows up on one portion of a page while leaving other portions ‘stock’. An individual local application can do this with it’s own custom code; it is also common from many of shared blacklight “plug-in”/”extension” engine gems.

Blacklight had tradtionally implemented it’s “view” layer in a typical Rails way, involving “helper” methods and view templates. Customizations and overrides, by local apps or plugins, were implemented by over-riding these helper methods and partials. This traditional method of helper and partial overrides is still described in the Blacklight project wiki — it possibly could use updating for recent deprecations/new approaches).

This view/helper/override approach has some advantages: It just uses standard ruby and Rails, not custom Blacklight abstractions; multiple different plugins can override the same method, so long as they all call “super”, to cooperatively add funtionality; it is very flexible and allows overrides that “just work”.

It also has some serious disadvantages. Rails helpers and views are known in general for leading to “spaghetti” or “ball of mud” code, where everything ends up depending on everything/anything else, and it’s hard to make changes without breaking things.

In the context of shared gem code like Blacklight and it’s ecosystem, it can get even messier to not know what is meant to be public API for an override. Blacklight’s long history has different maintainers with different ideas, and varying documentation or institutional memory of intents can make it even more confusing. Several generations of ideas can be present in the current codebase for both backwards-compatibility and “lack of resources to remove it” reasons. It can make it hard to make any changes at all without breaking existing code, a problem we were experiencing with Blacklight.

One solution that has appeared for Rails is the ViewComponent gem (written by github, actually), which facilitates better encapsulation, separation of concerns, and clear boundaries between different pieces of view code.The current active Blacklight maintainers (largely from Stanford I think?) put in some significant work — in Blacklight 7.x — to rewrite some significant parts of Blacklight’s view architecture based on the ViewComponent gem. This is a welcome contribution to solving real problems! Additionally, they did some frankly rather heroic things to get this replacement with ViewComponent to be, as a temporary transition step, very backwards compatible, even to existing code doing extensive helper/partial overrides, which was tricky to accomplish and shows their concern for current users.

Normally, when we see deprecation warnings, we like to fix them, to get them out of our logs, and prepare our apps for the future version where deprecated behavior stops working entirely. To do otherwise is considered leaving “technical debt” for the future, since a deprecation warning is telling you that code will have to be changed eventually.

The current challenge here is that it’s not clear (at least to me) how to change the code to still work in current Blacklight 7.x and upcoming Blacklight 8x. Which is a challenge both for running in current BL 7 without deprecation, and for the prospects of code continuing to work in future BL 8. I’ll explain more with examples.

Blacklight_range_limit (and geoblacklight): Add a custom “constraint”

blacklight_range_limit introduces new query parameters for range limit filters, not previously recognized by Blacklight, that look eg like &range[year_facet][begin]=1910 In addition to having these effect the actual Solr search, it also needs to display this limit (that Blacklight core is ignoring) in the “constraints” area above the search results:

To do this it overrides the render_constraints_filters helper method from Blacklight, through some fancy code effectively calling super to render the ordinary Blacklight constraints filters but then adding on it’s rendering of the contraints only blacklight_range_limit knows about. One advantage of this “override, call super, but add on” approach is that multiple add-ons can do it, and they don’t interfere with each other — so long as they all call super, and only want to add additional content, not replace pre-existing content.

But overriding this helper method is deprecated in recent Blacklight 7.x. If Blacklight detects any override to this method (among other constraints-related methods), it will issue a deprecation notice, and also switch into a “legacy” mode of view rendering, so the override will still work.

OK, what if we wanted to change how blacklight_range_limit does this, to avoid triggering the deprecation warnings, and to have blacklight continue to use the “new” (rather than “legacy”) logic, that will be the logic it insists on using in Blacklight 8?

The new logic is to render with the new view_component, Blacklight::ConstraintsComponent.new. Which is rendered in the catalog/_constraints.html.erb partial. I guess if we want the rendering to behave differently in that new system, we need to introduce a new view component that is like Blacklight::ConstraintsComponent but behaves differently (perhaps a sub-class, or a class using delegation). Or, hey, that component takes some dependent view_components as args, maybe we just need to get the ConstraintsComponent to be given an arg for a different version of one of the _component args, not sure if that will do it.

It’s easy enough to write a new version of one of these components… but how would we get Blacklight to use it?

I guess we would have to override catalog/_constraints.html.erb. But this is unsastisfactory:

  • I thougth we were trying to get out of overriding partials, but even if it’s okay in this situation…
  • It’s difficult and error-prone for an engine gem to override partials, you need to make sure it ends up in the right order in Rails “lookup paths” for templates, but even if you do this…
  • What if multiple things want to add on a section to the “constraints” area? Only one can override this partial, there is no way for a partial to call super.

So perhaps we need to ask the local app to override catalog/_constraints.html.erb (or generate code into it), and that code calls our alternate component, or calls the stock component with alternate dependency args.

  • This is already seeming a bit more complex and fragile than the simpler one-method override we did before, we have to copy-and-paste the currently non-trivial implementation in _constraints.html.erb, but even if we aren’t worried about that….
  • Again, what happens if multiple different things want to add on to what’s in the “constraints” area?
  • What if there are multiple places that need to render constraints, including other custom code? (More on this below). They all need to be identically customized with this getting-somewhat-complex code?

That multiple things might want to add on isn’t just theoretical, geoblacklight also wants to add some things to the ‘constraints’ area and also does it by overriding the render_constraints_filters method.

Actually, if we’re just adding on to existing content… I guess the local app could override catalog/_constraints.html.erb, copy the existing blacklight implementation, then just add on the END a call to both say <%= render(BlacklightRangeLimit::RangeConstraintsComponent %> and then also <%= <%= render(GeoBlacklight::GeoConstraintsComponent) %>… it actually could work… but it seems fragile, especially when we start dealing with “generators” to automatically create these in a local app for CI in the plugins, as blacklight plugins do?

My local app (and blacklight_advanced_search): Change the way the “query” constraint looks

If you just enter the query ‘cats’, “generic” out of the box Blacklight shows you your search with this as a sort of ‘breadcrumb’ constraint in a simple box at the top of the search:

My local app (in addition to changing the styling) changes that to an editable form to change your query (while keeping other facet etc filters exactly the same). Is this a great UX? Not sure! But it’s what happens right now:

It does this by overriding `render_constraints_query` and not calling super, replace the standard implementation with my own.

How do we do this in the new non-deprecated way?

I guess again we have to either replace Blacklight::ConstraintsComponent with a new custom version… or perhaps pass in a custom component for query_constraint_component… this time we can’t just render and add on, we really do need to replace something.

What options do we? Maybe, again, customizing _constraints.html.erb to call that custom component and/or custom-arg. And make sure any customization is consistent with any customization done by say blacklight_range_limit or geoblacklight, make sure they aren’t all trying to provide mutually incompatible custom components.

I still don’t like:

  • having to override a view partial (when before I only overrode a helper method), in local app instead of plugin it’s more feasible, but we still have to copy-and-paste some non-trivial code from Blacklight to our local override, and hope it doesn’t change
  • Pretty sensitive to implementation of Blacklight::ConstraintsComponent if we’re sub-classing it or delegating it. I’m not sure what parts of it are considered public API, or how frequently they are to change… if we’re not careful, we’re not going ot have any more stable/reliable/forwards-compatible code than we did under the old way.
  • This solution doesn’t provide a way for custom code to render a constraints area with all customizations added by any add-ons, which is a current use case, see next section.

It turns out blacklight_advanced_search also customizes the “query constraint” (in order to handle the multi-field queries that the plugin can do), also by overriding render_constraints_query, so this exact use case affects that plug-in too, with a bit more challenge in a plugin instead of a local app.

I don’t think any of these solutions we’ve brainstormed are suitable and reliable.

But calling out to Blacklight function blocks too, as in spotlight….

In addition to overriding a helper method to customize what appears on the screen, traditionally custom logic in a local app or plug-in can call a helper method to render some piece of Blacklight functionality on screen.

For instance, the spotlight plug-in calls the render_constraints method in one of it’s own views, to include that whole “constraints” area on one of it’s own custom pages.

Using the legacy helper method architecture, spotlight can render the constraints including any customizations the local app or other plug-ins have made via their overriding of helper methods. For instance, when spotlight calls render_constraints, it will get the additional constraints that were added by blacklight_range_limit or geoblacklight too.

How would spotlight render constraints using the new architecture? I guess it would call the Blacklight view_component directly, render(Blacklight::ConstraintsComponent.new(.... But how does it manage to use any customizations added by plug-ins like blacklight_range_limit? Not sure. None of the solutions we brainstormed above seem to get us there.

I suppose (Eg) spotlight could actually render the constraints.html.erb partial, that becomes the one canonical standardized “API” for constraints rendering, to be customized in the local app and re-used every time constraints view is needed? That might work, but seems a step backwards to go toward view partial as API to me, I feel like we were trying to get away from that for good reasons, it just feels messy.

This makes me think new API might be required in Blacklight, if we are not to have reduction in “view extension” functionality for Blacklgiht 8 (which is another option, say, well, you just cant’ do those things anymore, significantly trimming the scope of what is possible with plugins, possibly abandoning some plugins).

There are other cases where blacklight_range_limit for example calls helper methods to re-use functionality. I haven’t totally analyzed them. It’s possible that in some cases, the plug-in just should copy-and-paste hard-coded HTML or logic, without allowing for other actors to customize them. Examples of what blacklight_range_limit calls here include

New API? Dependency Injection?

Might there be some new API that Blacklight could implement that would make this all work smoother and more consistently?

“If we want a way to tell Blacklight “use my own custom component instead of Blacklight::ConstraintsComponent“, ideally without having to override a view template, at first that made me think “Inversion of Control with Dependency Injection“? I’m not thrilled with this generic solution, but thinking it through….

What if there was some way the local app or plugin could do Blacklight::ViewComponentRegistration.constraints_component_class = MyConstraintsComponent, and then when blacklight wants to call it, instead of doing, like it does now, <%= render(Blacklight::ConstraintsComponent.new(search_state: stuff) %>, it’d do something like: `<%= Blacklight::ViewComponentRegistration.constraints_component_class.new(search_state: stuff) %>.

That lets us “inject” a custom class without having to override the view component and every other single place it might be used, including new places from plugins etc. The specific arguments the component takes would have to be considered/treated as public API somehow.

It still doesn’t let multiple add-ons cooperate to each add a new constraint item though. i guess to do that, the registry could have an array for each thing….

Blacklight::ViewComponentRegistration.constraints_component_classes = [
  Blacklight::ConstraintsComponent,
  BlacklightRangeLimit::ConstraintsComponent,
  GeoBlacklight::ConstraintsComponent
]

# And then I guess we really need a convenience method for calling
# ALL of them in a row and concatenating their results....

Blacklight::ViewComponentRegistration.render(:constraints_component_class, search_state: stuff)

On the plus side, now something like spotlight can call that too to render a “constraints area” including customizations from BlacklightRangeLimit, GeoBlacklight, etc.

But I have mixed feelings about this, it seems like the kind of generic-universal yet-more-custom-abstraction thing that sometimes gets us in trouble and over-complexified. Not sure.

API just for constraints view customization?

OK, instead of trying to make a universal API for customizing “any view component”, what if we just focus on the actual use cases in front of us here? All the ones I’ve encountered so far are about the “constraints” area? Can we add custom API just for that?

It might look almost exactly the same as the generic “IoC” solution above, but on the Blacklight::ConstraintsComponent class…. Like, we want to customize the component Blacklight::ConstraintsComponent uses to render the ‘query constraint’ (for my local app and advanced search use cases), right now we have to change the call site for Blacklight::ConstraintsComponent.new every place it exists, to have a different argument… What if instead we can just:

Blacklight::ConstraintsComponent.query_constraint_component =
   BlacklightAdvancedSearch::QueryConstraintComponent

And ok, for these “additional constraint items” we want to add… in “legacy” architecture we overrode “render_constraints_filters” (normally used for facet constraints) and called super… but that’s just cause that’s what we had, really this is a different semantic thing, let’s just call it what it is:

Blacklight::ConstraintsComponent.additional_render_components <<
  BlacklightRangeLimit::RangeFacetConstraintComponent

Blacklight::ConstraintsComponent.additional_render_components <<
  GeoBlacklight::GeoConstraintsComponent

All those component “slots” would still need to have their initializer arguments be established as “public API” somehow, so you can register one knowing what args it’s initializer is going to get.

Note this solves the spotlight case too, spotlight can just simply call render Blacklight::ConstraintsComponent(..., and it now does get customizations added by other add-ons, because they were registered with the Blacklight::ConstraintsComponent.

I think this API may meet all the use cases I’ve identified? Which doesn’t mean there aren’t some I haven’t identified. I’m not really sure what architecture is best here, I’ve just trained to brainstorm possibilities. It would be good to choose carefully, as we’d ideally find something that can work through many future Blacklight versions without having to be deprecated again.

Need for Coordinated Switchover to non-deprecated techniques

The way Blacklight implements backwards-compatible support for the constraints render, is if it detects anything in the app is overriding a relevant method or partial, it continues rendering the “legacy” way with helpers and partials.

So if I were to try upgrading my app to do something using a new non-deprecatred method, while my app is still using blacklight_range_limit doing things the old way… it woudl be hard to keep them both working. If you have more than one Blacklight plug-in overriding relevant view helpers, it of course gets even more complicated.

It pretty much has to be all-or-nothing. Which also makes it hard for say blacklight_range_limit to do a release that uses a new way (if we figured one out) — it’s probably only going to work in apps that have changed ALL their parts over to the new way. I guess all the plug-ins could do releases that offered you a choice of configuration/installation instructions, where the host app could choose new way or old way.

I think the complexity of this makes it more realistic, especially based on actual blacklight community maintenance resources, that a lot of apps are just going to keep running in deprecated mode, and a lot of plugins only available triggering deprecation warnings, until Blacklight 8.0 comes out and the deprecated behavior simply breaks, and then we’ll need Blacklight 8-only versions of all the plugins, with apps switching everything over all at once.

If different plugins approach this in an uncoordianted fashion, each trying to investnt a way to do it, they really risk stepping on each others toes and being incompatible with each other. I think really something has to be worked out as the Blacklgiht-recommended consensus/best practice approach to view overrides, so everyone can just use it in a consistent and compatible way. Whether that requires new API not yet in Blacklight, or a clear pattern with what’s in current Blacklight 7 releasees.

Ideally all worked out by currently active Blacklight maintainers and/or community before Blacklight 8 comes out, so people at least know what needs to be done to update code. Many Blacklight users may not be using Blacklight 7.x at all yet (7.0 released Dec 2018) — for instance hyrax still uses Blacklight 6 — so I’m not sure what portion of the community is already aware this is coming up on the horizon.

I hope the time I’ve spent investigating and considering and documenting in this piece can be helpful to the community as one initial step, to understanding the lay of the land.

For now, silence deprecations?

OK, so I really want to upgrade to latest Blacklight 7.19.2, from my current 7.7.0. To just stay up to date, and to be ready for ruby 3.0. (My app def can’t pass tests on ruby 3 with BL 7.7; it looks like BL added ruby 3.0 support in BL 7.14.0? Which does already have the deprecations).

It’s not feasible right now to eliminate all the deprecated calls. But my app does seem to work fine, just with deprecation calls.

I don’t really want to leave all those “just ignore them for now”. deprecation messages in my CI and production logs though. They just clutter things up and make it hard to pay attention to the things Iwant to be noticing.

Can we silence them? Blacklight uses the deprecation gem for it’s deprecation messages; the gem is by cbeer, with logic taken out of ActiveSupport.

We could wrap all calls to deprecated methods in Deprecation.silence do…. including making a PR to blacklight_range_limit to do that? I’m not sure I like the idea of making blacklight_range_limit silent on this problem, it needs more attention at this point! Also I’m not sure how to use Deprecation.silence to effect that clever conditional check in the _constraints.html.erb template.

We could entirely silence everything from the deprecation gem with Deprecation.default_deprecation_behavior — I don’t love this, we might be missing deprecations we want?

The Deprecation gem API made me think there might be a way to silence deprecation warnings from individual classes with things like Blacklight::RenderConstraintsHelperBehavior.deprecation_behavior = :silence, but I think I was misinterpreting the API, there didn’t seem to be actually methods like that available in Blacklight to silence what I wanted in a targetted way.

Looking/brainstormign more in Deprecation gem API… I *could* change it’s behavior to it’s “notify” strategy that sends ActiveSupport::Notification events instead of writing to stdout/log… and then write a custom ActiveSupport::Notification subscriber which ignored the ones I wanted to ignore… ideally still somehow keeping the undocumented-but-noticed-and-welcome default behavior in test/rspec environment where it somehow reports out a summary of deprecations at the end…

This seemed too much work. I realized that the only things that use the Deprecation gem in my project are Blacklight itself and the qa gem (I don’t think it has caught on outside blacklight/samvera communities), and I guess I am willing to just silence deprecations from all of them, although I don’t love it.

Notes on retrying all jobs with ActiveJob retry_on

I would like to configure all my ActiveJobs to retry on failure, and I’d like to do so with the ActiveJob retry_on method.

So I’m going to configure it in my ApplicationJob class, in order to retry on any error, maybe something like:

class ApplicationJob < ActiveJob::Base
  retry_on StandardError # other args to be discussed
end

Why use ActiveJob retry_on for this? Why StandardError?

Many people use backend-specific logic for retries, especially with Sidekiq. That’s fine!

I like the idea of using the ActiveJob functionality:

  • I currently use resque (more on challenges with retry here later), but plan to switch to something else at some point medium-term. Maybe sideqkiq, but maybe delayed_job or good_job. (Just using the DB and not having a redis is attractive to me, as is open source). I like the idea of not having to redo this setup when I switch back-ends, or am trying out different ones.
  • In general, I like the promise of ActiveJob as swappable commoditized backends
  • I like what I see as good_job’s philosophy here, why have every back-end reinvent the wheel when a feature can be done at the ActiveJob level? That can help keep the individual back-end smaller, and less “expensive” to maintain. good_job encourages you to use ActiveJob retries I think.

Note, dhh is on record from 2018 saying he thinks setting up retries for all StandardError is a bad idea. But I don’t really understand why! He says “You should know why you’d want to retry, and the code should document that knowledge.” — but the fact that so many ActiveJob back-ends provide “retry all jobs” functionality makes it seem to me an established common need and best practice, and why shouldn’t you be able to do it with ActiveJob alone?

dhh thinks ActiveJob retry is for specific targetted retries maybe, and the backend retry should be used for generic universal ones? Honestly I don’t see myself doing much specific targetted retries, making all your jobs idempotent (important! Best practice for ActiveJob always!), and just having them all retry on any error seems to me to be the way to go, a more efficient use of developer time and sufficient for at least a relatively simple app.

One situation I have where a retry is crucial, is when I have a fairly long-running job (say it takes more than 60 seconds to run; I have some unavoidably!), and the machine running the jobs needs to restart. It might interrupt the job. It is convenient if it is just automatically retried — put back in the queue to be run again by restarted or other job worker hosts! Otherwise it’s just sitting there failed, never to run again, requiring manual action. An automatic retry will take care of it almost invisibly.

Resque and Resque Scheduler

Resque by default doens’t supprot future-scheduled jobs. You can add them with the resque-scheduler plugin. But I had a perhaps irrational desire to avoid this — resque and it’s ecosystem have at different times had different amounts of maintenance/abandonment, and I’m (perhaps irrationally) reluctant to complexify my resque stack.

And do I need future scheduling for retries? For my most important use cases, it’s totally fine if I retry just once, immediately, with a wait: 0. Sure, that won’t take care of all potential use cases, but it’s a good start.

I thought even without resque supporting future-scheduling, i could get away with:

retry_on StandardError, wait: 0

Alas, this won’t actually work, it still ends up being converted to a future-schedule call, which gets rejected by the resque_adapter bundled with Rails unless you have resque-scheduler installed.

But of course, resque can handle wait:0 semantically, if the code was willing to do it by queing an ordinary resque job…. I don’t know if it’s a good idea, but this simple patch to Rails-bundled resque_adapter will make it willing to accept “scheduled” jobs when the time to be scheduled is actually “now”, just scheduling them normally, while still raising on attempts to future schedule. For me, it makes retry_on.... wait: 0 work with just plain resque.

Note: retry_on attempts count includes first run

So wanting to retry just once, I tried something like this:

# Will never actually retry
retry_on StandardError, attempts: 1

My job was never actually retried this way! It looks like the attempts count includes the first non-error run, the total number of times job will be run, including the very first one before any “retries”! So attempts 1 means “never retry” and does nothing. Oops. If you actually want to retry only once, in my Rails 6.1 app this is what did it for me:

# will actually retry once
retry_on StandardError, attempts: 2

(I think this means the default, attempts: 5 actually means your job can be run a total of 5 times– one original time and 4 retries. I guess that’s what was intended?)

Note: job_id stays the same through retries, hooray

By the way, I checked, and at least in Rails 6.1, the ActiveJob#job_id stays the same on retries. If the job runs once and is retried twice more, it’ll have the same job_id each time, you’ll see three Performing lines in your logs, with the same job_id.

Phew! I think that’s the right thing to do, so we can easily correlate these as retries of the same jobs in our logs. And if we’re keeping the job_id somewhere to check back and see if it succeeded or failed or whatever, it stays consistent on retry.

Glad this is what ActiveJob is doing!

Logging isn’t great, but can be customized

Rails will automatically log retries with a line that looks like this:

Retrying TestFailureJob in 0 seconds, due to a RuntimeError.
# logged at `info` level

Eventually when it decides it’s attempts are exhausted, it’ll say something like:

Stopped retrying TestFailureJob due to a RuntimeError, which reoccurred on 2 attempts.
# logged at `error` level

This does not include the job-id though, which makes it harder than it should be to correlate with other log lines about this job, and follow the job’s whole course through your log file.

It’s also inconsistent with other default ActiveJob log lines, which include:

  • the Job ID in text
  • tags (Rails tagged logging system) with the job id and the string "[ActiveJob]". Because of the way the Rails code applies these only around perform/enqueue, retry/discard related log lines apparently end up not included.
  • The Exception message not just the class when there’s a class.

You can see all the built-in ActiveJob logging in the nicely compact ActiveJob::LogSubscriber class. And you can see how the log line for retry is kind of inconsistent with eg perform.

Maybe this inconsistency has persisted so long in part because few people actually use ActiveJob retry, they’re all still using their backends backend-specific functionality? I did try a PR to Rails for at least consistent formatting (my PR doesn’t do tagging), not sure if it will go anywhere, I think blind PR’s to Rails usually do not.

In the meantime, after trying a bunch of different things, I think I figured out the reasonable way to use the ActiveSupport::Notifications/LogSubscriber API to customize logging for the retry-related events while leaving it untouched from Rails for the others? See my solution here.

(Thanks to BigBinary blog for showing up in google and giving me a head start into figuring out how ActiveJob retry logging was working.)

(note: There’s also this: https://github.com/armandmgt/lograge_active_job But I’m not sure how working/maintained it is. It seems to only customize activejob exception reports, not retry and other events. It would be an interesting project to make an up-to-date activejob-lograge that applied to ALL ActiveJob logging, expressing every event as key/values and using lograge formatter settings to output. I think we see exactly how we’d do that, with a custom log subscriber as we’ve done above!)

Warning: ApplicationJob configuration won’t work for emails

You might think since we configured retry_on on ApplicationJob, all our bg jobs are now set up for retrying.

Oops! Not deliver_later emails.

Good_job README explains that ActiveJob mailers don’t descend from ApplicationMailer. (I am curious if there’s any good reason for this, it seems like it would be nice if they did!)

The good_job README provides one way to configure the built-in Rails mailer superclass for retries.

You could maybe also try setting delivery_job on that mailer superclass to use a custom delivery job (thanks again BigBinary for the pointer)… maybe one that subclasses the default class to deliver emails as normal, but let you set some custom options like retry_on? Not sure if this would be preferable in any way.

logging URI query params with lograge

The lograge gem for taming Rails logs by default will lot the path component of the URI, but leave out the query string/query params.

For instance, perhaps you have a URL to your app /search?q=libraries.

lograge will log something like:

method=GET path=/search format=html

The q=libraries part is completely left out of the log. I kinda want that part, it’s important.

The lograge README provides instructions for “logging request parameters”, by way of the params hash.

I’m going to modify them a bit slightly to:

  • use the more recent custom_payload config instead of custom_options. (I’m not certain why there are both, but I think mostly for legacy reasons and newer custom_payload? is what you should read for?)
  • If we just put params in there, then a bunch of ugly <ActionController::Parameters show up in the log if you have nested hash params. We could fix that with params.to_unsafe_h, but…
  • We should really use request.filtered_parameters instead to make sure we’re not logging anything that’s been filtered out with Rails 6 config.filter_parameters. (Thanks /u/ezekg on reddit). This also converts to an ordinary hash that isn’t ActionController::Parameters, taking care of previous bullet point.
  • (It kind of seems like lograge README could use a PR updating it?)
  config.lograge.custom_payload do |controller|
    exceptions = %w(controller action format id)
    params: controller.request.filtered_parameters.except(*exceptions)
  end

That gets us a log line that might look something like this:

method=GET path=/search format=html controller=SearchController action=index status=200 duration=107.66 view=87.32 db=29.00 params={"q"=>"foo"}

OK. The params hash isn’t exactly the same as the query string, it can include things not in the URL query string (like controller and action, that we have to strip above, among others), and it can in some cases omit things that are in the query string. It just depends on your routing and other configuration and logic.

The params hash itself is what default rails logs… but what if we just log the actual URL query string instead? Benefits:

  • it’s easier to search the logs for actually an exact specific known URL (which can get more complicated like /search?q=foo&range%5Byear_facet_isim%5D%5Bbegin%5D=4&source=foo or something). Which is something I sometimes want to do, say I got a URL reported from an error tracking service and now I want to find that exact line in the log.
  • I actually like having the exact actual URL (well, starting from path) in the logs.
  • It’s a lot simpler, we don’t need to filter out controller/action/format/id etc.
  • It’s actually a bit more concise? And part of what I’m dealing with in general using lograge is trying to reduce my bytes of logfile for papertrail!

Drawbacks?

  • if you had some kind of structured log search (I don’t at present, but I guess could with papertrail features by switching to json format?), it might be easier to do something like “find a /search with q=foo and source=ef without worrying about other params)
  • To the extent that params hash can include things not in the actual url, is that important to log like that?
  • ….?

Curious what other people think… am I crazy for wanting the actual URL in there, not the params hash?

At any rate, it’s pretty easy to do. Note we use filtered_path rather than fullpath to again take account of Rails 6 parameter filtering, and thanks again /u/ezekg:

  config.lograge.custom_payload do |controller|
    {
      path: controller.request.filtered_path
    }
  end

This is actually overwriting the default path to be one that has the query string too:

method=GET path=/search?q=libraries format=html ...

You could of course add a different key fullpath instead, if you wanted to keep path as it is, perhaps for easier collation in some kind of log analyzing system that wants to group things by same path invariant of query string.

I’m gonna try this out!

Meanwhile, on lograge…

As long as we’re talking about lograge…. based on commit history, history of Issues and Pull Requests… the fact that CI isn’t currently running (travis.org grr) and doesn’t even try to test on Rails 6.0+ (although lograge seems to work fine)… one might worry that lograge is currently un/under-maintained…. No comment on a GH issue filed in May asking about project status.

It still seems to be one of the more popular solutions to trying to tame Rails kind of out of control logs. It’s mentioned for instance in docs from papertrail and honeybadger, and many many other blog posts.

What will it’s future be?

Looking around for other possibilties, I found semantic_logger (rails_semantic_logger). It’s got similar features. It seems to be much more maintained. It’s got a respectable number of github stars, although not nearly as many as lograge, and it’s not featured in blogs and third-party platform docs nearly as much.

It’s also a bit more sophisticated and featureful. For better or worse. For instance mainly I’m thinking of how it tries to improve app performance by moving logging to a background thread. This is neat… and also can lead to a whole new class of bug, mysterious warning, or configuration burden.

For now I’m sticking to the more popular lograge, but I wish it had CI up that was testing with Rails 6.1, at least!

Incidentally, trying to get Rails to log more compactly like both lograge and rails_semantic_logger do… is somewhat more complicated than you might expect, as demonstrated by the code in both projects that does it! Especially semantic_logger is hundreds of lines of somewhat baroque code split accross several files. A refactor of logging around Rails 5 (I think?) to use ActiveSupport::LogSubscriber made it possible to customize Rails logging like this (although I think both lograge and rails_semantic_logger still do some monkey-patching too!), but in the end didn’t make it all that easy or obvious or future-proof. This may discourage too many other alternatives for the initial primary use case of both lograge and rails_semantic_logger — turn a rails action into one log line, with a structured format.

Notes on Cloudfront in front of Rails Assets on Heroku, with CORS

Heroku really recommends using a CDN in front of your Rails app static assets — which, unlike in non-heroku circumstances where a web server like nginx might be taking care of it, otherwise on heroku static assets will be served directly by your Rails app, consuming limited/expensive dyno resources.

After evaluating a variety of options (including some heroku add-ons), I decided AWS Cloudfront made the most sense for us — simple enough, cheap, and we are already using other direct AWS services (including S3 and SES).

While heroku has an article on using Cloudfront, which even covers Rails specifically, and even CORS issues specifically, I found it a bit too vague to get me all the way there. And while there are lots of blog posts you can find on this topic, I found many of them outdated (Rails has introduced new API; Cloudfront has also changed it’s configuration options!), or otherwise spotty/thin.

So while I’m not an expert on this stuff, i’m going to tell you what I was able to discover, and what I did to set up Cloudfront as a CDN in front of Rails static assets running on heroku — although there’s really nothing at all specific to heroku here, if you have any other context where Rails is directly serving assets in production.

First how I set up Rails, then Cloudfront, then some notes and concerns. Btw, you might not need to care about CORS here, but one reason you might is if you are serving any fonts (including font-awesome or other icon fonts!) from Rails static assets.

Rails setup

In config/environments/production.rb

# set heroku config var RAILS_ASSET_HOST to your cloudfront
# hostname, will look like `xxxxxxxx.cloudfront.net`
config.asset_host = ENV['RAILS_ASSET_HOST']

config.public_file_server.headers = {
  # CORS:
  'Access-Control-Allow-Origin' => "*", 
  # tell Cloudfront to cache a long time:
  'Cache-Control' => 'public, max-age=31536000' 
}

Cloudfront Setup

I changed some things from default. The only one that absolutely necessary — if you want CORS to work — seemed to be changing Allowed HTTP Methods to include OPTIONS.

Click on “Create Distribution”. All defaults except:

  • Origin Domain Name: your heroku app host like app-name.herokuapp.com
  • Origin protocol policy: Switch to “HTTPS Only”. Seems like a good idea to ensure secure traffic between cloudfront and origin, no?
  • Allowed HTTP Methods: Switch to GET, HEAD, OPTIONS. In my experimentation, necessary for CORS from a browser to work — which AWS docs also suggest.
  • Cached HTTP Methods: Click “OPTIONS” too now that we’re allowing it, I don’t see any reason not to?
  • Compress objects automatically: yes
    • Sprockets is creating .gz versions of all your assets, but they’re going to be completely ignored in a Cloudfront setup either way. ☹️ (Is there a way to tell Sprockets to stop doing it? WHO KNOWS not me, it’s so hard to figure out how to reliably talk to Sprockets). But we can get what it was trying to do by having Cloudfront encrypt stuff for us, seems like a good idea, Google PageSpeed will like it, etc?
    • I noticed by experimentation that Cloudfront will compress CSS and JS (sometimes with brotli sometimes gz, even with the same browser, don’t know how it decides, don’t care), but is smart enough not to bother trying to compress a .jpg or .png (which already has internal compression).
  • Comment field: If there’s a way to edit it after you create the distribution, I haven’t found it, so pick a good one!

Notes on CORS

AWS docs here and here suggest for CORS support you also need to configure the Cloudfront distribution to forward additional headers — Origin, and possibly Access-Control-Request-Headers and Access-Control-Request-Method. Which you can do by setting up a custom “cache policy”. Or maybe instead by by setting the “Origin Request Policy”. Or maybe instead by setting custom cache header settings differently using the Use legacy cache settings option. It got confusing — and none of these settings seemed to be necessary to me for CORS to be working fine, nor could I see any of these settings making any difference in CloudFront behavior or what headers were included in responses.

Maybe they would matter more if I were trying to use a more specific Access-Control-Allow-Origin than just setting it to *? But about that….

If you set Access-Control-Allow-Origin to a single host, MDN docs say you have to also return a Vary: Origin header. Easy enough to add that to your Rails config.public_file_server.headers. But I couldn’t get Cloudfront to forward/return this Vary header with it’s responses. Trying all manner of cache policy settings, referring to AWS’s quite confusing documentation on the Vary header in Cloudfront and trying to do what it said — couldn’t get it to happen.

And what if you actually need more than one allowed origin? Per spec Access-Control-Allow-Origin as again explained by MDN, you can’t just include more than one in the header, the header is only allowed one: ” If the server supports clients from multiple origins, it must return the origin for the specific client making the request.” And you can’t do that with Rails static/global config.public_file_server.headers, we’d need to use and setup rack-cors instead, or something else.

So I just said, eh, * is probably just fine. I don’t think it actually involves any security issues for rails static assets to do this? I think it’s probably what everyone else is doing?

The only setup I needed for this to work was setting Cloudfront to allow OPTIONS HTTP method, and setting Rails config.public_file_server.headers to include 'Cache-Control' => 'public, max-age=31536000'.

Notes on Cache-Control max-age

A lot of the existing guides don’t have you setting config.public_file_server.headers to include 'Cache-Control' => 'public, max-age=31536000'.

But without this, will Cloudfront actually be caching at all? If with every single request to cloudfront, cloudfront makes a request to the Rails app for the asset and just proxies it — we’re not really getting much of the point of using Cloudfront in the first place, to avoid the traffic to our app!

Well, it turns out yes, Cloudfront will cache anyway. Maybe because of the Cloudfront Default TTL setting? My Default TTL was left at the Cloudfront default, 86400 seconds (one day). So I’d think that maybe Cloudfront would be caching resources for a day when I’m not supplying any Cache-Control or Expires headers?

In my observation, it was actually caching for less than this though. Maybe an hour? (Want to know if it’s caching or not? Look at headers returned by Cloudfront. One easy way to do this? curl -IXGET https://whatever.cloudfront.net/my/asset.jpg, you’ll see a header either x-cache: Miss from cloudfront or x-cache: Hit from cloudfront).

Of course, Cloudfront doesn’t promise to cache for as long as it’s allowed to, it can evict things for it’s own reasons/policies before then, so maybe that’s all that’s going on.

Still, Rails assets are fingerprinted, so they are cacheable forever, so why not tell Cloudfront that? Maybe more importantly, if Rails isn’t returning a Cache-Cobntrol header, then Cloudfront isn’t either to actual user-agents, which means they won’t know they can cache the response in their own caches, and they’ll keep requesting/checking it on every reload too, which is not great for your far too large CSS and JS application files!

So, I think it’s probably a great idea to set the far-future Cache-Control header with config.public_file_server.headers as I’ve done above. We tell Cloudfront it can cache for the max-allowed-by-spec one year, and this also (I checked) gets Cloudfront to forward the header on to user-agents who will also know they can cache.

Note on limiting Cloudfront Distribution to just static assets?

The CloudFront distribution created above will actually proxy/cache our entire Rails app, you could access dynamic actions through it too. That’s not what we intend it for, our app won’t generate any URLs to it that way, but someone could.

Is that a problem?

I don’t know?

Some blog posts try to suggest limiting it only being willing to proxy/cache static assets instead, but this is actually a pain to do for a couple reasons:

  1. Cloudfront has changed their configuration for “path patterns” since many blog posts were written (unless you are using “legacy cache settings” options), such that I’m confused about how to do it at all, if there’s a way to get a distribution to stop caching/proxying/serving anything but a given path pattern anymore?
  2. Modern Rails with webpacker has static assets at both /assets and /packs, so you’d need two path patterns, making it even more confusing. (Why Rails why? Why aren’t packs just at public/assets/packs so all static assets are still under /assets?)

I just gave up on figuring this out and figured it isn’t really a problem that Cloudfront is willing to proxy/cache/serve things I am not intending for it? Is it? I hope?

Note on Rails asset_path helper and asset_host

You may have realized that Rails has both asset_path and asset_url helpers for linking to an asset. (And similar helpers with dashes instead of underscores in sass, and probably different implementations, via sass-rails)

Normally asset_path returns a relative URL without a host, and asset_url returns a URL with a hostname in it. Since using an external asset_host requires we include the host with all URLs for assets to properly target CDN… you might think you have to stop using asset_path anywhere and just use asset_urlYou would be wrong.

It turns out if config.asset_host is set, asset_path starts including the host too. So everything is fine using asset_path. Not sure if at that point it’s a synonym for asset_url? I think not entirely, because I think in fact once I set config.asset_host, some of my uses of asset_url actually started erroring and failing tests? And I had to actually only use asset_path? In ways I don’t really understand what’s going on and can’t explain it?

Ah, Rails.

ActiveSupport::Cache via ActiveRecord (note to self)

There are a variety of things written to use flexible back-end key/value datastores via the ActiveSupport::Cache API.

For instance, say, activejob-status.

I have sometimes in the past wanted to be able to use such things storing the data in an rdbms, say vai ActiveRecord. Make a table for it. Sure, this won’t be nearly as fast or “scalable” as, say, redis, but for so many applications it’s just fine. And I often avoid using a feature at all if it is going to require to me to add another service (like another redis instance).

So I’ve considered writing an ActiveSupport::Cache adapter for ActiveRecord, but never really gotten around to it, so I keep avoiding using things I’d be trying out if I had it….

Well, today I discovered the ruby gem that’s a key/value store swiss army knife, moneta. Look, it has an ActiveSupport::Cache adapter so you can use any moneta-supported store as an ActiveSupport::Cache API. AND then if you want to use an rdbms as your moneta-supported store, you can do it through ActiveRecord or Sequel.

Great, I don’t have to write the adapter after all, it’s already been done! Assuming it works out okay, which I haven’t actually checked in practice yet.

Writing this in part as a note-to-self so next time I have an itch that can be scratched this way, I remember moneta is there — to at least explore further.

Not sure where to find the docs, but here’s the source for ActiveRecord moneta adapter. It looks like I can create different caches that use different tables, which is the first thing I thought to ensure.

The second thing I thought to look for — can it handle expiration, and purging expired keys? Unclear, I can’t find it. Maybe I could PR it if needed.

And hey, if for some reason you want an ActiveSupport::Cache backed by PStore or BerkelyDB (don’t do it!), or Cassandara (you got me, no idea?), moneta has you too.