Bibliographic Wilderness

Consider a small donation to rubyland.news?

jrochkind — Mon, 27 Nov 2023 18:09:26 +0000

I started rubyland.news a few years ago because it was a thing I wanted to see for the Ruby community. I had been feeling a shrinking of the ruby open source collaborative community, it felt like the room was emptying out.

If you find value in Rubyland News, just a few dollars contribution on my Github Sponsors page would be so appreciated.

I wanted to make people writing about ruby and what they were doing with it visible to each other and to the community, in order to try to (re)build/preserve/strengthen a self-conception as a community, connect people to each other, provide entry to newcomers, and just make it easier to find ruby news.

I develop and run rubyland.news in my spare time, as a hobby project, all by myself, on custom Rails software. I have never and will never accepted money for editorial placement — the feeds included in rubyland.news are exclusively based on my own judgement of what will serve readers and the community well.

Why am I asking for money?

The total cost of Rubyland News, including hosting and the hostname itself, are around $180 a month. Current personal regular monthly donations add up to about $100 a year — from five individual sponsors (thank you!!!!)

I pay for this out of my pocket. I’m doing totally fine, no need to worry about me, but I do work for an academic non-profit, and don’t have the commercial market software engineer income some may assume.

Some donations would also help motivate me to keep putting energy into this, showing me that the project really does have value to the community. If I am funded to exceed my costs, I might also add resources necessary for additional features (like a non-limited DB to keep a searchable history around?)

You can donate one-time or monthly on my Github Sponsors page. The suggested levels are $1 and $5 per month. If I get an increase in $5-$10/month more contributions this year, I will consider it a huge success, it really makes a difference!

If you donate $5/month or more, and would like to be publicly listed/thanked, I am very happy to do so, just let me know!

If you don’t want to donate or can’t spare the cash, but do want to send me an email telling me about your use of rubyland news, I would love that too! I really don’t get much feedback! And would love to know any features you want or need. (With formerly-known-as-twitter being on the downslide, are there similar services you’d like to see rubyland.news published to?) jonathan at rubyland.news)

Thanks

Thanks to anyone who donates anything at all
also to anyone who sends me a note to tell me that they value Rubyland News (seriously, I get virtually no feedback — telling me things you’d like to be better/different is seriously appreciated too! Or things you like about how it is now. I do this to serve the community, and appreciate feedback and suggestions!)
To anyone who reads Rubyland News at all
To anyone who blogs about ruby, especially if you have an RSS feed, especially if you are doing it as a hobbyist/community-member for purposes other than business leads!
To my current monthly github sponsors, it means a lot!
To anyone contributing in their own way to any part of open source communities for reasons other than profit, sometimes without much recognition, to help create free culture that isn’t just about exploiting each other!

Beware sinatra, rails 7.1, rack 3, resque bundler dependency resolution

jrochkind — Thu, 09 Nov 2023 17:30:54 +0000

tldr practical advice for google: If you use resque 3.6.0 or less, and Rails 7.1, and are getting an error: cannot load such file -- rack/showexceptions — you probably need to add rack "~> 2.0" to your Gemfile!

The latest version of the ruby gem sinatra, as I write this, is 3.1.0, and it does not yet support the recently released rack 3. It correctly specifies that in it’s gemspec, with gem "rack", "~> 2.2", ">= 2.2.4”

[And as of this writing, that is true in sinatra github main branch too, no work has been done to allow rack 3.x]

The new Rails 7.1 does work with and allow Rack 3.x, as well as still working with Rack 2.x, it allows any rack >= 2.2.4 (specifying it will be compatible with a future rack 4.x too, which seems dangerous, for reasons, read on)

There is a version of sinatra that (wrongly) specifies working with rack 3.x: Sinatra 1.0 (Released March 2010!) specifies in it’s gemspec that it will work with any rack >= 1.0. They quickly corrected that in Sinatra 1.1a to say “~> 1.1”, meaning “1.x greater than or equal to 1.1 only”.

But sinatra 1.0 is still there in the repo, as a target for bundler dependency resolution, claiming to work fine with rack 3.x. By the way, sinatra 1.0 is wrong about that, it certainly does not work with rack 3.x. One error you might get from it is cannot load such file -- rack/showexceptions on boot, which is a lot better than a subtler error that only shows up at runtime, for sure!

Do you see where this is going?

I am in process of updating my app to Rails 7.1. I didn’t even know my app had a sinatra dependency… but it turns out it did, my app uses resque latest version 2.6.0, which has a dependency on sinatra sinatra >= 0.9.2

So okay, poor bundler has to take this dependency tree and create a resolution for it. Rails 7.1 allows rack 2 or 3; resque 2.6.0 allows any sinatra at all; sinatra 1.0 allows any rack, but sinatra 3.1.0 only allows rack 2.x.

There are two possible resolutions that satisfy those restrictions (really more than two if you can use any old version of a dependency), but the one bundler picked was:

rack 3.0.8
sinatra 1.0

Which then failed CI because sinatra 1.0 doesn’t really work with rack 3.x.

The other possible resolution would have been rack 2.2.8 and sinatra 3.1.0.

That’s the one I actually want.

To help it it along I just need to add rack "~> 2.0" to my Gemfile. This was a bit confusing to debug!

What is the problem? The danger of open-ended gem dependencies

So the problem here is sinatra 1.0 (ten years ago!) claiming it supported any rack version no matter how high! It should have said ~> 1.0 meaning “1.x, but not 2” — how could it possibly predict it would work with rack 2, or 3, or 4, or 9.0?

If sinatra 1.0 had put an upper bound on the version of rack it woudl work with, bundler would have done the ‘correct’ (to us humans) resolution out of the box, cause the ‘wrong’ one it did would not have been available as satisfying all restrictions. Doing an open-ended spec like this leaves a bomb that can get someone decades later, as it did here.

And Rails is still doing that! actionpack 7.1.x says it works with any rack >= 2.2.4 — it ought to add in a < 4 there, it knows it works with rack 2.x and 3.x, but how can it predict it works with rack 5.x or 6.x, which don’t exist at all yet? It’s leaving the same bomb for bundler dependency resolution in the future that sinatra 1.0 did, and there’s no real way to fix it once the versions are out there.

Alternately, if sinatra released a version that did support rack 3, and said so, bundler would preferentially choose that version, with rack 3, and we wouldn’t have a problem. (Bundler’s dependency resolution is actually really amazing, it’s amazing how often it makes the “right” choice among many possible versions that would satisfy all dependency restrictions) I’m not sure how much maintenance energy sinatra is getting, but eventually it’s going to have to get there or there’s going to be a conflict with something that has sinatra in it’s dependency tree and also has something that requires rack 3 in it’s dependency tree.

And more immediately… resque says it works with any sinatra >= 0.9.2 (released in 2009)…. but does it really? Who knows. Releasing a resque that says it needs, oh, sinatra >= 2.0 (released 2017) might help bundler come to a more satisfying dependency resolution… or could just result in bundler deciding to use an old version of resque so it can use an old version of sinatra which says (incorrectly!) it supprots rack 3…. hard to predict. But maybe I’ll PR resque. But resque is also not exactly overflowing with maintenance applied to it these days…

Eventually I just need to switch away from resque. I have my eye on good_job.

S3 CORS headers proxied by CloudFront require HEAD not just GET?

jrochkind — Mon, 09 Oct 2023 17:07:21 +0000

I’m not totally sure what happened, but the tldr is that at the end of last week, our video.js-played HLS videos served from an S3 bucket — via CloudFront — appears to have started requiring us to list “HEAD” in the “AllowedMethods” for CORS configuraton, in addition to pre-existing “GET”.

I’m curious if anyone else has any insight into what’s going on there… I have some vague guesses at the end, but still don’t really have a handle on it.

Our setup: HLS video from S3 buckets

We use the open-source video.js to display some video, in the HLS format. Which involves linking to a .m3u8 manifest file, which is the first file the user-agent will request.

When implementing, we discovered that if the .m3u8 and other HLS files are on a different domains than the web page running the JS, you need the server hosting the HLS files to supply CORS headers. Makes sense, reasonable.

Our HLS files are on a public S3 bucket. We also have a simple Cloudfront distribution in front of the public S3 bucket.

We set this CORS policy on the S3 bucket, probably one I just found/copy/pasted at some point. (CORS policies on S3 are now set, I think, only in JSON form; in the past they could be XML and you can find XML examples too). (warning, may not be sufficient)

[
    {
        "AllowedHeaders": 
            "*"
        ],
        "AllowedMethods": [
            "GET"
        ],
        "AllowedOrigins": [
            "*"
        ],
        "ExposeHeaders": [],
        "MaxAgeSeconds": 43200
    }
]

And for a long time, that just worked. The S3 bucket responded with proper CORS headers for video.js to work. The CloudFront distribution appropriately cached/forwarded the response with those headers. (note * as allowed origin, so the cache is not origin-specific, which should be fine then!)

Last week it broke? How I investigated

Some time around Wednesday Oct 4-Thursday Oct 5th, our browser video display started broking. In a very hard to reproduce way.

Some viewers got the error from video.js it gives when it can’t fetch the video source (for instance, a network failure might give you this same error message):

“Media could not be loaded, either because the server or network failed or because the format is not supported.”

(and, by the way, this error could happen on new videos at new urls that didn’t exist 24 hours previous…)

Once a developer managed to reproduce this, looking in Dev Tools console in the browser, we could see a CORS error reported:

Access to XMLHttpRequest at ‘[url]’ has been blocked by CORS policy: No ‘Access-Control-Allow-Origin’ header is present on the requested resource.

It took me a bit to figure out how to investigate whether CORS headers were being returned appropriately or not. It turns out that S3, at least, only returns the CORS headers when an Origin header is present in the request, and it matches the CORS rules (the second condition, in this case, should be universal, as our allowed origin is *). Maybe this is how CORS always works?

So we could investigate like, so using verbose mode to see headers from a GET request:

curl -H "Origin: https://our-example.org" -v "https://some-s3-or-cloudfront/etc"

Doing this, I discovered that for some people a cloudfront request as above would return CORS headers (we’re looking for eg Access-Control-Allow-Origin: * in the response!), and other times it wouldn’t! Cloudfront headers include a x-amz-cf-pop header, which reminded me, right there are different Cloudfront POPs different people could be connecting to… okay, so some Cloudfront POPs are returning the CORS headers others not? Which kind of violates my model of CloudFront, i thought POPs would be synchronized to always return the same content, but who knows.

But okay then, was the S3 original source returning CORS headers?

Well, to make matters more confusing, I made a mistake which ultimately led me to the solution too. Instead of doing curl -v, I had originally been doing curl -I, which I had come to think as “just show me the response headers not body”, but of course actually is a synonym for --head and tells curl to do a HTTP HEAD method request.

And I configured S3 to allow only GET method, so, no, when I did a HEAD request to the direct S3 source, no CORS headers were included, duh. If I did it with GET they were.

I actually didn’t totally realize what was going on at first (really forgot that -I was a HEAD request to curl, not a GET where it only showed me resposne headers!)…. but something about this experience, and while googling seeing an occasional S3 CORS example that included HEAD as well as GET in allowed options…

Led me to try just adding HEAD to my AllowedOptions… So now this is my public S3 buckets CORS policy:

[
    {
        "AllowedHeaders": [
            "*"
        ],
        "AllowedMethods": [
            "GET",
            "HEAD"
        ],
        "AllowedOrigins": [
            "*"
        ],
        "ExposeHeaders": [],
        "MaxAgeSeconds": 43200
    }
]

And… this seemed to fix things? Along with clearing the CloudFront cache though, to make sure it wasn’t serving bad response headers from cache, so that could have played a role too.

At this point I really don’t understand what was going on, why/how I fixed it more or less accidentally, how I got lucky enough to fix it… or honestly if I even really did fix it?

What is going on anyway?

We have had this system in place for over a year, with no changes I know about — no changes to S3 or CloudFront configuration, or to video.js version. What changed?

I feel like the symptoms probably mean that CloudFront is sometimes doing a HEAD request to S3 for these files, and caching the response headers, and then using those cached response headers from a HEAD request on a GET request response… but why would it do that? And again why would it start encountering this situation now after a long time working fine?

At first I wondered, wait… we’ve had this setup for about a year… and we tell CloudFront to cache these responses (with content-unique URLs) for the HTTP max cache age of a year…. has our content just started to exceed it’s year max-age… so now CloudFront is maybe doing some conditional HEAD requests to S3 to see if it’s cache is still good (it is, Etag unchanged)… and for some reason it uses the CORS headers it gets back from there to update it’s cached headers, while still using it’s original cached body?

That seemed maybe plausible (if unclear whether it was a defensible thing for CloudFront to do), but then I remembered — no, we are seeing this problem too with new content and URLs that have only existed for less than 24 hours, so it can’t be a case of year-old content that CloudFront has been caching for a year.

I’m pretty mystified. Why this started breaking now after working for months, with no known changes. Has something on S3’s end changed with how it executes CORS policies to produce CORS headers? Or something on CloudFront changed with how it forwards/caches them? Or something in browsers or video.js changed with regard to exactly what requests are made? (is the browser now making HEAD requests for this content, and requiring CORS headers on response, in places it didn’t before? But that doesn’t explain why CloudFront POPs were giving me unexpectedly inconsistent results to GET requests, sometimes including CORS headers in response sometimes not!)

AND I don’t really understand why I have to include HEAD in my S3 CORS policy at all — I hadn’t been expecting to need to authorize HEAD requests via CORS, I expected video.js would be doing GET requests, and that’s all I’d need to authorize.

So I seem to have fixed the problem… but I never like it when I don’t understand my own solution. Have I really fixed it, or will it just come back?

Googling I can not find anything that seems relevant to this at all. Should anyone using CloudFront in front of a public S3 bucket, where responses need origin * CORS headers — always include HEAD as well as GET in AllowedMethods? Is this really such a weird situation? Why can’t I find anyone talking about it? Is it for some reason special to HLS video?

So anyway, I blog this. Hoping that someone else running into a mysterious problem will find this post, when I could find nothing! And hoping for the even slimmer chance that someone will see this who thinks they know exactly what was going on and can explain it!

Investigating OCR and Text PDFs from Digital Collections

jrochkind — Tue, 18 Jul 2023 18:02:28 +0000

At the Science History Institute Digital Collections, we have a fairly small collection compared to some peers (~70,000 images) of historical materials. Many of those images are of text: Books, pamphlets, advertisements, memos, etc.

We haven’t previously done any OCR (Optical Character Recognition), but started thinking about doing that. In addition to using captured text for the site-wide search, it made sense to us to look to providing:

“search inside the book/work”, with results highlighted on in-browser page images, such as provided by this example from Internet Archive using their own viewer, or this example from NCSU using UniversalViewer (search for say “corn”, note how results are highlighted on the page image)
Provide downloadable PDFs with a “text layer” — select-copyable and searchable text in the PDF, as in this corresponding example from Internet Archive.

Both of those use cases require not just text out of OCR, but text with position information so it can be overlaid on the page image.

We decided to investigate starting with the PDF-with-text-layer as the first product, because I (naively, i think!) believed this would be straightforward to do, and because we have user research indications that some portion of our users really love PDFs (which I think would be common among at least any user groups of academics, probably others too).

I had to do a lot of research to understand the technologies, techniques, and tools that are out there in this domain. So I capture my findings here in a giant blog post, partially to capture my own notes, and ideally to help give someone else a head start. (This recorded conference presentation from Merlijn Wajer at Internet Archive is also a good overview of the technical ecosystem! Merlijn and the IA are very central figures in what open source work is going on in this area!)

I was a bit bewildered to notice that few of our peers seemed to offer PDF downloads (especially with a text layer), as I’m pretty confident our collective users would want this. I then discovered the tooling is somewhat limited, and used to be worse. I’m not sure and am curious why our sector/field/industry hasn’t invested more in software development to create better tooling here!

tldr Summary of findings and plan

So after some initial research, I had discovered that several peers used the hOCR format to represent OCR-with-position output, to power their in-browser online “highlight a search result on a scanned page” interfaces.

I somewhat naively figured I could use my choice of OCR engines (the open source tesseract seemed most popular; or maybe Amazon Textract?) to produce HOCR, which I’d do on image ingest.

And then I imagined (and thought I saw confirmed based on a bit of googling), I’d have my choice of tools to combine the hOCR with raster images into PDFs. Ideally I could use a fast compiled-C tool that could easily be installed via apt on ubuntu.

It turns out — that was over-optimistic. hOCR isn’t as quite as widely inter-operably standard as hoped. Tools for rendering a PDF based on hOCR are in fact very limited (and mostly python). There is a field of abandoned and not-super-robust partial solutions.

In fact, I only found one piece of software that could do this job well, the Internet Archive’s archive-pdf-tools. And even it does not currently seem to do as good a job of positioning text in the PDF as tesseract itself does (although it intends to port tesseract’s logic). It’s also a bit difficult to install, and may not be installable on MacOS. It’s not super widely-used software.

Later I also discovered a clever kind of compression meant for this kind of textual PDF, called Mixed Raster Content (MRC). When this works, it can really reduce the file-size of otherwise enormous PDFs of hundreds of pages of page images. And in fact there is only one working open-source implementation of this as far as I can tell, again the Internet Archive’s archive-pdf-tools. (There are implementations in commercial software, I have not evaluated them, more later).

The rest of this post will be some lengthy notes going into my findings about PDF technology; and evaluation of all the software I could find and easily evaluate that would render hOCR to a PDF or be otherwise useful; and tips and tricks and difficulties in using that software.

But in the end, I identified basically only two realistic paths to get to PDFs with text layer from my scanned TIFFs.

Have tesseract output a “text-only” PDF (a tiny PDF that includes only invisible “text layer”), and then use another tool (such as qpdf) to combine it with raster images.
- tesseract just does a better job of laying out text in the PDFs it outputs them any other tool I could find (and there aren’t many). Although archive-pdf-tool intends to match tesseract, with a port of tesseract’s logic even — it’s not currently doing so.
- Optionally, you could take the output of this, and try to run it through archive-pdf-tool’s experimental pdfcomp tool, to apply the MRC compression to a PDF it did not itself create. (I haven’t yet figured out how to access/run this experimental tool)
- If we aren’t doing MRC compression, look into using JP2 instead of JPEG — turns out PDF supports jp2, and it may be a smaller file for same quality.
- I’m still going to need the hOCR for future online-search applications, so I’ll be having tesseract output both hOCR and the text-layer PDF, one way or another, and storing them both.
- This does not give us the opportunity to manually correct hOCR — if we wanted to correct PDFs (perhaps for accessibility), it would have to be directly on the PDF, and would not apply corrections to other uses of the hOCR.
- Details on this approach below in the tesseract section.
Have tesseract output hOCR (probably at time of ingest), and then use archive-pdf-tool’s recode_pdf tool to assemble the hOCR with raster images into PDFs.
- At present, we won’t get quite as good text positioning as with tesseract
- it’s a bit harder to get installed (and may not be installable on our dev box) — which would apply if we tried to use pdfcomp for compression in path 1, but in path 1 it’s an optional add-on, here we’d have to get it solved from go
- But the text positioning is better than anything else I found but tesseract, and we’ll get good MRC compression out of it too.
- This would give us an opportunity to apply corrections to hOCR (perhaps for accessibility remediation) and (re-)generate PDFs accordingly, if we had a workflow and tooling for that.
- recode_pdf starts with TIFFs, so the PDF-generation process is going to be a bit slower and more resource-intensive.
- details of this approach below in the archive-pdf-tools section.

With either of those paths, it might be convenient to generate single-image single-page PDFs at file ingestion time, and then combine them into an aggregated multi-page PDF-demand. This makes it somewhat easier to deal with the fact that our app allows staff to add/remove or publish/unpublish individual pages on demand, which would invalidate generated PDFs. This approach would wind up with duplicated copies of an embedded font, but tesseract’s embedded “glyphless” font is only ~600 bytes, less than 1% of the likely sizes of outputs.

Anyway, these are pretty much the only options I came up with after much investigation of software that didn’t quite cut the mustard. It turns out going from hOCR to positioned text in a PDF is non-trivial, different tools do it differently, and not as well as others. Other open source software investigated (there isn’t a lot!):

ocropus/hocr-tools a python package including a hocr-pdf tool for rendering PDF from hOCR. Didn’t do a great job positioning the hOCR, was unable to handle positioning non-completely-horizontal lines diagonally, which tesseract and archive-pdf-tools were.
eloops/hocr2pdf a Javascript package that was meant by it’s original author as just a proof-of-concept experiment and hadn’t been touched in a while, did not do a good job of positioning
Exactimage hocr2pdf: At first appeared to be the compiled C hocr=>pdf tool I imagine existed. But it seems to be old unmaintained software, and I could not get it to work with contemporary tesseract hOCR.
pdfbeads: Ancient ruby software that can hypothetically do hOCR positioning and a MRC-like compression. I could not get it’s MRC to work for me; it’s hOCR positioning was inferior to archive-pdf-tool’s; and it’s weird zombie software with unclear mainline source repository.

You’ve now gotten the important bits of this post summarized. In the remainder of this post, we have more musings on state of the field, context of technologies available, and notes on individual software packages reviewed — it’s a LOT of stuff. I am not certain how useful others may find these notes on what I have discovered!

Other options? Commercial options? State of the market?

I just couldn’t find many tools for eg hOCR rendering — although there may be more in the .Net world. There are some relevant commercial offerings here, that deal with OCR and PDF generation. They are often Windows-only, and often GUI software meant for someone to be operating as part of a scanning workflow. I think the market may be “corporate document management”. Some (or maybe just one?) of them claim to do MRC compression. Some of them have cloud “SDKs”. (as far as actual local SDKs, the market seems to be only for .Net).

I got the feeling that there was a lot of collaborative open-source energy on these techniques, for purposes of “ebooks” and “scanned books” 10-15 years ago (around the time of Google Books introduction?), but that it sort of petered out. This does not seem to be something our library and archive institutions have invested in. Thanks to the Internet Archive for being the main player working in this field and releasing open source tools! (Here is a video from the Internet Archive’s Merlijn Wajer that explains their procedures and how in 2020 they moved to an open source stack here. It also serves as a great overview of the technologies and tools discussed in this blog post.)

With few open source options, I would be potentially willing to pay for an appropriate tool at the right price. But the publicly-available documentation and general “developer experience” of commercial tools tends to be even worse than open source, it’s very difficult to even figure out if it’s going to work for you. I have a few notes on commercial tools in the “MRC Compression” section below, but mostly I have not spent the time to understand the market.

OCR: Tesseract is the open source option

Optical Character Recognition, or OCR, is the process of taking an image, and extracting the text from it as text.

As far as I can tell, Tesseract is the only current widely used (or at all?) open source OCR option.

There were other packages at one time popular, but for instance I don’t believe “Cuneiform” is currently being maintained or getting much use. (Wikipedia says last cuneiform release was in 2011, so).

Tesseract is currently at version 5.x (5.0 released Nov 2021) — but Ubuntu 22 apt repo still only has the latest 4.x release. And when I tried asking library field peers, it seems most are currently still using latest 4.x release. Tesseract 4.0 actually introduced “a new neural network-based recognition engine” (although it still supports using models with the old engine too, I think), so earlier than 4.x would be a really different product, but 4.x already has you on the new engine.

Tesseract works with human-language-based models, so you have to tell it which languages you expect in a document (you can tell it more than one). It has official support for a lot of human languages (including some historical early-modern ones). It does not, as far as I know, have official support for handwriting (rather than type-set) recognition.

It is also possible to train your own models for tesseract, and some people may be sharing non-official trained models for certain kinds of materials. I am not certain if I’ve seen any such that use the new “neural network-based recognition engine”, and at any rate I haven’t spent any time investigating this area.

On ubuntu, you can install tesseract with apt-get install tesseract-ocr, and install individual language packs with eg apt-get install tesseract-ocr-deu (you need to look up the appropriate tesseract language code). On MacOS, you can install tesseract with brew install tesseract, and install all supported language packs at once with brew install tesseract-lang.

For officially supported language packs, there are “FAST” and “BEST” model variants available. The distribution packages above will install the “FAST” packages. The “FAST” packages are smaller on disk and intended to result in much faster operation, with only slightly decreased accuracy. If you want to install and use the “BEST” packages instead… I am not sure how, and have not spent time with them or comparing.

Other OCR options? Commercial? AWS Textract?

I looked briefly at AWS Textract. It only handles six major European languages. BUT it claims to be able to recognize hand-writing? We def have hand-written items in the collection, would be big if it worked well.

It has all sorts of fancy tools for recognizing structured text on various types of business documents (invoices, business cards) that are mostly not of concern to us. It does not produce hOCR, but does produce it’s own XML format that maybe could be converted to hOCR, although a converter isn’t included in this project I found of other hOCR conversions.

If I understand the pricing properly, At $15/1000 pages it’s quite expensive. We estimated the price of CPU time on heroku using tesseract to be 100x less.

Perhaps we’d investigate in the future to expand our processes to OCR’ing handwriting. But first the lower-hanging fruit.

There are other commercial OCR solutions, including Google Cloud Vision, and lots and lots of Windows-based “document management workflow” solutions, that I haven’t really even looked at.

Note on Accessibility and OCR

Automatically OCR’d text does not necessarily produce an “accessible” copy, say for people with vision impairments. While current OCR results from eg tesseract are surprisingly good, and provide a good product for “searchability”, they still include too many errors to be simply read as a primary text, as you can see if you look at the text alone.

Additionally, in PDF form, I am told for accessibility for man purposes the text really needs to be “tagged” in a way that simple OCR will not produce.

It is almost certain that we do not have the resources to produce this level of accessibility for the tens of thousands of page images in our corpus. While adding machine-generated OCR may increase accessiblity somewhat for some purposes, contexts, and users — it definitely is going to leave a lot of people out, people who have vision impairments among others.

Another possible intervention is that we could provide a clear functions for users to request accessible/remediated PDF (or other) copies on a per-work basis. It’s still not totally clear how we’d best provide that, whether we’d do remediation in-house (and using what tools and workflows), or perhaps send PDFs to vendors to produce accessible copies (this is not cheap, it looks like maybe $1/page or more for PDFs with accessible tagging, although I’m not certain).

In my imagination, a well-engineered process for remediating OCR might involve producing hOCR, then correcting the hOCR that is then used to (re-)generate PDFs. This way the corrections would also apply to other uses of the (h)OCR such as indexing for collection-wide search in Solr, or for search-inside-the-page with highlighting features offered via a web browser.

However, the tooling for this seems to be pretty limited, this kind of workflow does not in fact seem to be common. hocrjs is a possibly still-maintained tool for viewing hocr in a browser. It could be a building block into making a GUI for reviewing/fixing hocr (which may be internet-archive has for their own use, see this video?). Here is a more full-featured proof-of-concept for actually editing/correcting hocr. Alternately, hocr-proofreader seems to be a proof-of-concept not “finished” into actually supporting some kinds of review and editing of HOCR — while the notes suggest it’s not ‘finished’ it is a very impressive proof-of-concpet — check out the demo!

Of course, even if you corrected typos in the hOCR, that wouldn’t necessarily give you enough for the accessible “tagging” in the PDF. (Is there even a format that can capture OCR-with-position and all the semantics necessary to produce PDF tagging too? I don’t think it’s hOCR. The state of the ecosystem is underwhelming here).

A more realistic approach for the existing eco-system might be remediating a PDF as PDF (either sending to a vendor, or in-house with tools like Acrobat Pro or Abbyy FineReader) — and then extracting the (corrected) text from it as hOCR, to put into our system for other uses. The Internet Archive archive-hocr-tools project has a script that can extract a text layer from PDF to hocr, although it’s not mentioned in project readme (I might PR this), I’m not sure how I found it!

Some PDF tech details

What is a “PDF with text-layer” anyway?

PDF’s don’t actually have “layers” or “text layers”. But this is shorthand for a PDF that includes actual computer-readable text in addition, in these cases, to a “raster” (pixel-based) image of a photograph of a physical text.

The PDF text isn’t in a “layer”, it’s just individual pieces of text positioned in the PDF. PDF actually has a “rendering mode” (constant 3) for non-displayed text. (See this StackOverlow for some discussion).

In the kind of PDFs we’re talking about there are non-displayed text objects positioned in the same place/size as the words in the picture, so you can select (to copy and paste) text, and it looks like you are selecting the image itself. And you can “search within the document” in a PDF viewer, and it will highlight your results, looking like it’s highlighting the image itself.

Even though the text is not displayed, it needs to be associated with a font. There are fonts that are “built-in” to PDF, but they can only display characters in traditional “Latin-1” character sets. Displaying text in this pre-Unicode-asendance format is a bit tricky if you are trying to do it yourself with raw bytes. Fonts in a PDF can be embedded in the PDF itself — and typically are for this sort of thing — to make sure the text can be displayed (or possibly even interpreted at all?) on a machine without the chosen font installed.

The text that isn’t even going to be displayed can get by with just a bare-bones stub of a font, a “glyphless” font, since they don’t actually have to display, they just need to be encoded as machine-readable text. Tesseract, for instance, seems to use it’s own TrueType “glyphless font” that weighs only 572 bytes. It has in the past sometimes had to be tweaked, almost anything you want to do with a PDF ends up being non-trivial to do reliably.

HOCR and Alto: Formats to Represent OCR data with positions

You could do an OCR operation and just get text out. But if you want to overlay the text on top of the scanned image for select-copy-paste or search-result-highlighting, you need position information too.

Are there standard interchangeable formats that encode this information? Yes…. sort of.

The most popular one seems to be hOCR. It literally is an HTML document, with

for paragraphs and

s and s, that embeds positional and other information in title attributes. (Flashback to “HTML microformats” for anyone else? Nevermind).

When I asked around for colleagues to see what they were using to power online on-scanned-page search-highlighting, the answer was hOCR. tesseract can output hocr. There were several tools I found that could take hOCR as input.

The thing is… it’s unclear how well hOCR actually serves as a mutually-intelligible interchange format. Going back to 2016, there has been some concern that hOCR allows too much variation and hOCR from different producers may not truly be mutually intelligible. I think some of the tools I found that take “hOCR” as input may really only work with tesseract hOCR, and maybe even only certain versions of tesseract.

At the moment, there seem to be very few pieces of currently-maintained software that produce hOCR directly. (tesseract and… maybe there’s another open source package called Kraken? And a couple other barely- or non-maintained little-used open source packages).

As far as I can tell, most proprietary/commercial solutions can not read or write hOCR; they mostly use their own proprietary XML formats, if anything. Hypothetically you could translate from and to hOCR, and for some formats there are tools that claim to do so. Github cneud/ocr-conversion is a repository of scripts to convert between various OCR-position formats; it contains scripts to convert FROM several vendor formats (incluing Abbyy) to hOCR, but not usually the reverse.

There is another similar format, endorsed by the Library of Congress, called ALTO, which some think is technically superior, but it doesn’t seem to be supported by very many (any?) tools. (Tesseract can output ALTO, although it isn’t very well documented).

The end result is that this field isn’t quite as standards-based inter-operable as I had hoped/assumed.

MRC compression

So, raster (pixel) images are big, especially when you have hundreds of them. In our current application, we’re making PDFs out of 1200 pixel JPGs (made at default JPG compression level). The PDF for one particular 700-page book is 325MB. That’s a big file.

You could reduce the resolution or image quality. But 1200pixels is already only ~150dpi for an 8.5″/11″ page, and increasing JPG compression may introduce noticeable artifacts in some images — although we could experiment with this more. (If you do want to reduce byte size, do you get better perceived quality for the reduced size with less resolution or more JPG compression? I suspect keeping the resolution but increasing compression is the way to go, but I’m not sure).

However, it turns out someone (maybe these guys in 1998?) invented a very clever way to apply higher compression with less loss of perceived quality — specifically for the kinds of images likely in scanned books or scanned text. Called “Mixed Raster Compression” (MRC), or “hyper-compression”, it involves separating the page “background” (which can be highly compressed), from any embedded graphics and text (which can’t be compressed as much without noticeable problems — especially the text), separating them in separate images with separate levels of compression and/or resolutions, then combining them back together with a “mask”, in a way that PDF technology supports.

More information on MRC can be found on Wikipedia , this vendor’s markettng page, this other vendor’s marketing page, or the Internet Archive’s archive-pdf-tools README. Merlijn from Internet Archive also explains it in a conference presentation.

My sense that is that MRC compression is more of a technique than an exact algorithm. Different implementations may do it differently, and have output that can be more or less successful. There can be bugs or areas for improvement, that can differ between tools. The different layers can be split purely by automated image processing, but also can use the (eg) hOCR file to identify regions with text that need higher fidelity than backgrounds.

I believe the Internet Archive’s archive-pdf-tools is the only functional open source implementation of MRC encoding.

One commercial tool that may do some kind of MRC compression is the suite of tools known as “GDPicture” (the company behind that has merged with competitor Orpalis making things even more confusing). They do advertise supporting MRC compression. I had a brief phone call with a sales engineer, who wasn’t super familiar with this feature but confirmed they had it, and gave me an overview of the products in general. There is a page at avepdf.com that is “powered by GDPicture MRC Compression SDK” that will let you apply MRC compression to existing PDFs for free… but only a couple an hour, so I haven’t managed to totally wrap my head around it. Hypothetically, then, the PassportPDF cloud SDK from the same company would give me access to “GDPicture MRC Compression” — but I haven’t yet managed to figure out how. (But see if you can at the API reference?). Figuring out what is available from proprietary projects can sometimes seem even more challenging than from open source.

The market-leader Abbyy also says they support MRC, including via an SDK? One of the first or most popular commercial tools to apply this technique may have been called “LuraTech”, I’m not sure the current status of that software.

Evaluating Internet Archive recode_pdf, compared to alternatives

When I ran internet archive’s recode_pdf (with arg --bg-downsample 3 and otherwise default arguments) on full-resolution TIFFs, it resulted in PDFs that were about 6% the size of a PDF I made from a full-res JPG! Or about 50% the size of the PDFs we make from 1200px JPGs — still a significant reduction. Looking at them side-by-side… in one of my samples the MRC-compressed PDFs did have some visible artifacts, but text is still sharp. In two other cases, no visible artifacts.

I tried to test the free trial at avepdf.com — the extreme rate limit and cumbersome manual browser process made it hard to test a lot. I tested with PDFs that included lossless full-res PNG images, to avoid any lossy=>lossy quality issues. My initial reaction is that the text seems noticeably less sharp in the avepdf MRC-compressed PDF, even at “low” compression level — but if you zoom in, the text seems to get sharp again, which I don’t understand. My subjective impression of image quality is of course subjective, it’s hard to compare. avepdf MRC compression at “low” or “medium” compression seem to be approximately the same size as my recode_pdf output.

If we end up not using MRC, then our 1200px JPG PDFs would be maybe ~2x the size of the recode_pdf full-resolution MRC PDFs. I learned from Merlijn’s presentation that PDF actually supports embedded JPEG2000 (jp2) instead of JPEG (which their MRC technqiue uses), and that jp2 may compress smaller for the same quality. Switching to jp2 instead of jpeg and playing around with maximum compression without artifacts across my sample size… I can get my 1200px JPG PDFs to be about the same size as the recode_pdf full-res MRC compressed PDFs — although of course at reduced resolution.

note on dpi and PDF page size and variation

PDFs as a format is based on a 72 dots-per-inch (dpi) standard grid, with objects sometimes measured in actual inches. (It was a format meant for encoding things to be printed physically!).

You can embed an image of a given resolution, say 500×500 pixels in a PDF, but say it should take up however many “inches” you want, and it will be scaled on display. And the page size can be a given number of “inches” high and wide, which will determine how big it displays on a screen in most viewers.

The TIFF format also has a dpi value embedded in it, which sort of says how big in inches the TIFF (or the thing photographed) was. Some of the tools I tested detected the dpi from the source TIFF, and used it to determine the PDF page size. Others did not, and used a default or guessed size.

Many tools allow you to pass a dpi argument that it will use to determine the “page size” in resulting PDF — in my understanding this should not effect actual image resolution or much other than initial zoom level or size of page if printed. If it does with a given tool, I don’t understand what is going on.

In my tests, I generally did not supply an explicit dpi value, to have one less knob to twiddle. So resulting PDF page sizes can vary.

Available Software to make text PDF from hOCR+images

Source Test Material and Methodology

To try out different tools and techniques, I started with three somewhat representative images from our collection.

A fairly ordinary page of single-column clear text from a book
A page where the photo has text more at an angle and contains figures and several text blocks
A graphical advert that has text headlines and blocks in several places and sizes

Note on embedded thumbnails: Our original TIFFs in our actual repository often have an embedded second image, a tiny ~100px thumbnail. (did you know TIFFs can contain more than one image file?). This is something software involved in some of our photographing workflow at some points in history did without us totally knowing. It can really mess up various image tools, including some included in these tests (I had some really confusing errors at some points, thanks to @MerlijnWajer for helping me out.). So the first thing I did was extract just the first image with vips copy original.tiff just_one_image.tiff (verified with imagemagick identify, which will tell you how many images are in there). (This may also have stripped some metadata, but preserved DPI metadata)

1	3680×5684	60M	(STANDARD) dhc6a4r.tiff	(source)
2	3260×5185	48M	(DIAG) wg8ie02.tiff	(source)
3	4330×5760	71M	(ADVERT) 2y60cl2.tiff	(source)

Tesseract — can create PDF with text layer directly

So you can ask tesseract to do the OCR and output PDF with text layers.

You have very little control of the raster image in the output — tesseract will convert your TIFF to a JPG (no control over JPG compression level), of the same resolution as the TIFF you used as input. This results in a pretty large PDF file — for our one page samples: 3.5M -5M per page, which is a lot, when we consider we will want PDFs for books hundred of pages long.

You want to give tesseract the full-resolution TIFF for best OCR, but maybe want to use smaller files in the PDF. Or maybe you want to manually correct the OCR output before making a PDF?

One obvious option is having Tesseract generate an HOCR file with OCR-positional info, and using another tool to combine the HOCR with a raster image into a PDF. But, it turns out — no other tools I found actually render the tesseract-produced HOCR with text postioned as well as Tesseract itself does.

It made me wish tesseract had an option to take the HOCR (that you have perhaps edited), and combine it with images (of your choice of resolution and compression quality) to make a PDF, using tesseract’s superior HOCR-rendering. It turns out, things along those lines have been suggested, but rejected by tesseract developers who don’t want to get into the general business of creating PDFs.

Instead, they introduced a feature to create a “text-only” PDF — a very small PDF that actually only has the invisible glyphless text layer. The idea, as shown in that ticket, is that you can then use external tools to merge that with images or a PDF with raster images, to create the actual PDF with your choice of raster image and text layer.

I did get this to work pretty well, with qpdf as my merge utility. I merged the tesseract (invisible) text-only PDF with my “legacy” PDF of 1200px-wide JPGs, using these commands:

$ tesseract source.tiff source.tesseract_text_only -l eng -c textonly_pdf=1 pdf

$ qpdf image_only.pdf --underlay source.tesseract_text_only.pdf -- output_image_plus_text.pdf

One caveat — PDF pages have an inherent page size (usually expressed in inches, believe it or not). If the two PDFs you are merging are exactly the same size, that’s fine. If the text-only PDF is bigger than the image one (in PDF inches), that’s fine — that qpdf command will scale it down to match, and the output is just right. But if the text-only PDF is smaller, that qpdf command will just embed it in the middle, and the embedded invisible text won’t be properly aligned with the visible text on raster image.

You can supply dpi arguments to most PDF-creating utilities (including tesseract and recode_pdf below), which basically just effect the PDF inches size set on the resulting PDF. So you just want to make sure to do this to ensure the text-only PDF is larger in PDF-inches.

This works — but doesn’t accomodate the use case where we might want to correct errors in the OCR by editing an HOCR file, before producing the PDF. I haven’t found any way to take advantage of tesseract’s superior layout of OCRd text in the PDF, while correcting the OCR content before the PDF is produced. You can of course edit the PDF directly, but this is cumbersome, and doesn’t get you a corrected HOCR file you can use for other purposes too.

I’m probably going to need HOCR anyway for other purposes. You can have tesseract produce PDF and HOCR in one go if you want. (Btw it turns out tesseract can also produce alto although I’m not sure where this is documented).

tesseract dhc6a4r.tiff scratch/test.tesseract -l eng -c textonly_pdf=1 pdf hocr

Beware that if you produce individual tesseract PDFs with text content and try to combine them… you’ll wind up with duplicate copies of tesseract’s “glyphless font” embedded, one per each source PDF. I haven’t found a good way to merge/de-duplicate them, but I think the embedded glyphless font is only 527 bytes?

Other tools can take the hocr tesseract produces, and use it to position a text layer on a PDF… with mixed results. None currently do as well as tesseract’s own PDF positioning. It turns out going from tesseract hOCR to correctly positioned text on PDF is not a trivial operation?

archive-pdf-tools: recode_pdf — a sophisticated, and supported, tool

The Internet Archive’s archive-pdf-tools is a currently maintained, well-written package in python, extracted from their own workflow and shared. It began with an effort at the Archive that began in 2020 to move to an open source pipeline.

The recode_pdf command takes a TIFF and HOCR, and renders a PDF with text layer, and compressed with the sophisticated MRC compression. It may be the only open-source implementation of MRC compression technique.

It has quite a few non-python dependencies. Installations directions specified for ubuntu worked well for me on ubuntu. One C dependency, jbig2enc — does not exist in the standard Ubuntu package manager. It built from source fine for me on ubuntu, but that gives me some challenges for trying to get it installed on heroku. jbig2enc also has a non-standard-location apt package and a snap, as well as a brew package (I think from former Code4Libber Misty De Meo?). jbib2enc appears mostly unmaintained (although it does have occasional trivial new PR merged, it’s not totally abandoned); but also appears to have a variety of forks out there with different bugfixes/patches, so I’m not sure all those sources are actually the same code!

I am having a bit of trouble installing archive-pdf-tools reliably on MacOS, but that may be corrected soon or may be my own fault.

recode_pdf’s rendering of text from the hOCR file delivered by tesseract — is currently not as good rendering as tesseract itself does when making PDFs for my samples. I describe my observations in this issue filed at archive-pdf-tools.

In fact, archive-pdf-tools’s HOCR rendering is ported from tesseract (and writes PDFs directly with raw bytes, not using a PDF library of any kind). So why isn’t it’s rendering/positioning as good? Not yet clear.

This inferior HOCR rendering is unfortunate, because this is otherwise for sure the most mature/supported open-source HOCR rendering solution I found, which does do a better job of positioning than any other open source code I found. It’s also the only working open-source MRC compression implementation I found.

It was interesting to see the MRC compression. The output PDFs, which have as many pixels as our full-size source images (but under increased lossy compression), fro the most part really do look just as good as much larger bytesize PDFs, while being very small on disk. (There are compression artifacts in some samples though). The archive-pdf-tools MRC-compressed TIFFs are about 10% of the size of tesseract’s PDFs created with full-size JPGs. For our two high-text images they were about 50% of the size of our 1200px wide JPG PDFs; for the graphical image with less text, it was about 80% the size of our 1200px-wide JPG PDF.

As this is the only open-source implementation known for MRC compression, it would be nice to be able to apply it de-coupled from the HOCR rendering. There has been some discussion and work on creating a pdfcomp executable for this, but it seems to still be ongoing. I have not managed to figure out how to test it myself yet. (It’s not clear to me if you are going to have quality problems giving it PDF input that is already JPG lossy compressed, or if this own’t matter in the end).

recode_pdf --bg-downsample 3 --from-imagestack source.tiff --hocr-file  source.tesseract.hocr -o output.recode_pdf.pdf

While I was running only on one page at a time, I believe if you are running on multiple pages, recode_pdf wants a single HOCR file, with multiple pages, in the right order to correspond to the order of TIFF input arguments.

(Note, it turns out you CAN use recode_pdf without jbigenc2, by telling it to use a different inferior compression algorithm, with --mask-compression ccitt. For my three samples, this resulted in 13-25% larger file output. In the case of the mostly graphical one only, it made the PDF output larger than my 1200px JPG output.)

ocropus/hocr-tools: hocr-pdf (python) — has some problems

The hocr-tools package in python includes an hocr-pdf command that is intended to combine HOCR and JPGs to produce PDFs with text layers.

I installed hocr-tools 1.3.0 on my MacOS laptop with simple pip install hocr-tools.

The way hocr-pdf takes it’s input is a bit confusing — you need to run it on a directory which includes only source files, where a corresponding JPG and HOCR have the same name but for suffix. (JPEG must end in .jpg not .jpeg!)

hocr-pdf ./directory > output.pdf

The apache-licensed source code creates a PDF using a python PDF generation library — this is different than some code (such as archive-pdf-tools) that writes raw PDF bytes. So it may make it a good place to look to understand the/an algorithm, possibly for porting to another language, if you want to use a PDF library rather than write raw PDF bytes. I considered this at one point; I’m not sure if (eg) ruby’s prawn has analagous features to all being used from the python PDF library, I’m not sure how hard it would be.

It did not like it when I tried using with JPG with different smaller resolution than the TIFF the HOCR was created from — it produced wrongly scaled output. There are some tools/scripts available to resize HOCR coordinates (javascript, ruby), that I believe would be what you’d need to do this.

To begin with as a demonstration, though, I just used it with a full-size JPG converted from the source TIFF at same resolution.

I did not get great results — the page sizes were weird. For the standard and graphical pages, the image was cut off, not entirely in the PDF — it seems to insist on 8.5″/11″ aspect ratio/page size? For the intermediate “diagonal” page, the page just took up a portion of the canvas, it was too small. The text still did line up with the image, but it seems like perhaps some assumptions about DPI we are not meeting, or other bugs in how the tool calculates PDF page size. I have not yet spent time to report these problems on Github Issues, because other problems encountered probably make this tool unsuitable for me anyway.

In all cases, the HOCR rendering is… OK. I would say it is about the same quality as archive-pdf-tool’s, although it is not identical to archive-pdf-tools, even from the same HOCR file. Apparently HOCR positioning is non-trivial.

On the “diagonal” page, hocr-pdf didn’t make the lines too high like recode_pdf — but it seemed incapable of including angled lines at all, the lines are rendered straight, which makes them not match up with the actual image text. (Try selecting the line “The liquor…” at the bottom). This seems to make it pretty unsuitable for use with our actual input corpus.

`hocr-pdf` also strangely bloated the size of resulting PDFs. Creating a PDF from a JPG that was 3.3M, the resulting PDF was 4.1M! (Compare to tesseract-produced PDF of 3.5M, which makes sense, adding just 200K for textual info). And the PDFs it created generated lots of warning-complaints from poppler and other pdftools.

eloops/hocr2pdf (js) — proof of concept without great rendering

When looking for any open source HOCR rendering code I could find, I found this package on github. At the time I found it, it hadn’t had a commit in many years, and from the commit history and repo activity it was unclear if it had ever really been used at scale, and it didn’t have a license on it. At that early point, if it was working code in Javascript (which uses a PDF-generating library instead of writing raw PDF bytes), I was potentially interested in porting it’s logic to ruby.

I got the author’s email address from the commit history, and emailed them to inquire. Stephen Poole kindly got back to me to confirm this was basically a proof-of-concept that was never used for real work. Stephen kindly added an MIT license in case I wanted to use it.

Curious, I wanted to test it on my test images and hocr. I was able to get it to work after realizing it needed an old version of the cheerio HTML parser, and fixing up the example in the README.

It didn’t do a great job of rendering. Trying to highlight-select lines, it was often impossible to select a line continuously, perhaps because the words on the line ended up with very different heights and baseline positions. It was not able to render the diagonal text diagonally in the diagonal example. (Try selecting “This effect, especially as
regards purples” in the diagonal file to see both issues).

An interesting example, mostly demonstrating that positioning rendered HOCR even as well as archive-pdf-tools does is not necessarily trivial.

Exactimage hocr2pdf — didn’t work for me at all

At first I imagined I was going to find a compiled executable available through package managers that simply combined hocr and images to make a PDF, as if this were a normal thing.

At first that’s what it looked like the ExactImage hocr2pdf tool was. Available via “brew install exactimage” or “apt-get install exactimage“.

The problem is… it didn’t work for me.

At first I had trouble getting it to take my inputs at all, it said “Error reading input file.” If I opened the TIFF in MacOS preview and re-exported as a TIFF again, I could get it to read it.

But it produced weird PDFs with no scanned images at all, and just a portion of the HOCR text rendered visibly in giant font.

It is an old package that doesn’t seem to be getting maintenance; the docs suggest it was written for use with HOCR from the (also non-maintained) cuneiform OCR package. Either I don’t understand how to use it, or HOCR has changed over time/between vendors that it can’t handle contemporary tesseract HOCR.

hocr2pdf -i scan.tiff -o test.pdf < ocr.hocr

pdfbeads (ruby) — a historical artifact, of unclear current utility

Researching this stuff, I found mention of this mythical project “pdfbeads”, which was written, in ruby, over 10 years ago, and appeared to be targetted at creating “ebook” PDFs from scans — there was a lot of energy in this domain back then, and at one point this was a well-known package with implementations of some things not found elsewhere.

It did/does both HOCR rendering and a kind of compression that seems to be similar to MRC, if not being MRC, although it’s not referred to as such in rubybeads code or docs.

I am not certain when it was first written, because it was originally in a “rubyforge” repo, and rubyforge is gone, along with it’s commit history and discussions that were there, which is sad. Some “forks” of pdfbeads exist on github, but none of them copied history from the original rubyforge (svn?) repo. Some claim to do things like “update for ruby 2.0”. For instance d235j/pdfbeads (which has a version number of 1.1.1), and ifad/pdfbeads (which has a version number of 1.0.11).

OK, the weird thing is… rubybeads got a rubygems release 1.1.3 in Jan 2022 — only a year ago — the first rubygems release since 2014. I have no idea if the repo this release came from is public, or really where to find the code for this release (other than in the gem package) — rubygem metadata for “homepage” still points to rubyforge!

But a CHANGELOG file is captured in the rubygem package, which rubygems.org conveniently shows us in a diff, so we can see what features have been added/changed in the latest release.

The READMEs found in all those locations do have an email address for the pdfbeads author, Alexey Kryukov. I tried emailing him for info (and if there is a public repo), but haven’t heard back.

I was initially interested in pdfbeads because I thought it might have a useful ruby implementation of HOCR rendering (writing direct raw PDF bytes, it looks like), and because I thought it might be the only other identified open source implementation of MRC-style compression!

pdfbeads input methods are kind of confusing — not sure if it wants an HOCR file per image, or one combined one like archive-pdf-tools. I tested it on just one image/hocr at a time. Input files can’t have more than one . (period) in them, which had me stuck for a bit. It will leave a lot of intermediate files around, so is best run in a scratch or per-work directory. Using latest 1.1.3 release from rubygems.

 pdfbeads dhc6a4r.tiff dhc6a4r.hocr -f -o dhc6a4r.pdfbeads.pdf

Whatever compression it’s supposed to be doing isn’t working at all for me. That output a 15M PDF, which is 5x the size of the PDF tesseract outputs from the same TIFF input! So… negative compression for me?. Extracting images from the produced PDF shows it was making multiple image overlays MRC-style (and that it decided to downsample the pixel resolution from the source TIFF, by different factors for different images, maybe depending on DPI) — but I guess it’s algorithm just didn’t work well with my input? Maybe it expects black and white input only?

There is probably something I don’t understand about how it is intended to be used. I have found it hard to find instructions/documentation (here’s an HTML doc page at some historical version?), and hard for me to understand.

The HOCR rendering was okay on some pages, but had some serious problems on others. On our image test #2, with “diagonal” text, the diagonal angle of the rendered lines was correct, but they were wrongly vertically offset from their true positions by about half a line? And our #3 graphical image, the line beginning “for home users” was just plain missing, although other lines were positioned well?

Overall, I’m not sure what’s going on with this code.

Ocrmypdf — a high-level tool for adding OCR to PDFs, usually with tesseract

For completeness, I thought I’d mention Ocrmypdf, because it is something that’s actually still maintained/developed (which seems to be unusual in this field!), with a lot of functionality.

It seems focused on the use case of having PDFs of scans, say from a photocopy machine, and wanting to have a “just works” tool that takes that as input and leaves you with a text layer. It’s sort of a high-level integration of lots of other tools to try to give you this one-click solution. It itself is written in python.

While it does have it’s own python implementation of HOCR rendering/positioning, by default it uses a “sandwhich” mode to have tesseract position the OCR’d text, and does not use it’s own HOCR renderer by default. It does say it’s own HOCR renderer “has the best compatibility with Mozilla’s PDF.js viewer”, but also warns it doesn’t currently handle non-Latin Unicode properly.

It does not do MRC compression, but is intersted in in it, began talking to Merlijn from Internet Archive about it, which led to the archive-pdf-tools attempts at the pdfcomp tool.

I didn’t spend too much time actually investigating this, when I saw that it by default just used tesseract for text rendering, and didn’t implement MRC. I haven’t actually tested it’s built-in HOCR rendering, I only just noticed now that OcrMyPDF docs suggest you might want to use it for “better compantibility with Mozilla’s PDF.js viewer”?

jbrinley/HocrConverter (python) — one more

I didn’t take the time to actually play with this one yet, but for completeness — former Code4Libber Jon Brinley has some 13-year-old python code at https://github.com/jbrinley/HocrConverter/blob/master/HocrConverter.py, which also links to a blog post of his at https://xplus3.net/2009/04/02/convert-hocr-to-pdf/

A copyright notice in OcrMyPDF source for Jon Brinley suggests maybe their implementation came first from here? Maybe.

10-15 years ago, people were doing a lot of work in this area that just kind of… stalled out?

OCFL and “source of truth” — two options

jrochkind — Tue, 21 Mar 2023 18:15:35 +0000

Some great things about conferences is how different sessions can play off each other, and how lots of people interested in the same thing are in the same place (virtual or real) at the same time, to bounce ideas off each other.

I found both of those things coming into play to help elucidate what I think is an important issue in how software might use the Oxford Common File Layout (OCFL). Prompted by the Code4Lib 2023 session The Oxford Common File Layout – Understanding the specification, institutional use cases and implementations, with presentations by Tom Wrobel, Stefano Cossu, and Arran Griffith. (recorded video here).

OCFL is a specification for laying files out in a disk-like storage system, in a way that is suitable for long-time preservation. With a standard simple layout that is both human- and machine-readable, and would allow someone (some software) at a future point to reconstruct digital objects and metadata from the record left on disk.

The role of OCFL in a software system: Two choices

After the conference presentation, Matt Lincoln from JStor Labs asked a question in Slack chat that had been rising up in my mind too, but which Matt said more clearly than was in my mind at the time! This prompted a discussion on Slack, largely but not entirely between me and Stefano Cossu, which I found to be very productive, and which I’m going to detail here with my own additional glosses, but first let’s start with Matt’s question.

(I will insert slack links to quotes in this piece; you probably can’t see the sources unless you are a member of the Code4Lib workspace).

For the OCFL talk, I’m still unclear what the relationship is/can/will be in these systems between the database supporting the application layer, and the filesystem with all the OCFL-laid-out objects. Does DB act as a source of truth and OCFL as a copy? OCFL as source of truth and DB as cache? No db at all, and just r/w directly to OCFL? If I’m a content manager and edit an item’s metadata in the app’s web interface, does that request get passed to a DB and THEN to OCFL? Is the web app reading/writing directly to the OCFL filesystem without mediating DB representation? Something else?
Matt Lincoln

I think Matt, utilizing the helpful term “source of truth”, accurately identifies two categories of use of OCFL in a software system — and in fact, that different people in the OCFL community — even different presenters in this single OCFL conference presentation — had been taking different paths, and maybe assuming that everyone else was on the same page as them, or at least not frequently drawing out the difference and consequences of these two paths.

Stefano Cossu, one of the presenters from the OCFL talk at Code4Lib, described it this way in a Slack response:

IMHO OCFL can either act as a source from which you derive metadata, or a final destination for preservation derived from a management or access system, that you don’t want to touch until disaster hits. It all depends on how your ideal information flow is. I believe Fedora is tied to OCFL which is its source of truth, upon which you can build indices and access services, but it doesn’t necessarily need to be that way.
Stefano Cossu

It turns out that both paths are challenging in different ways; there is no magic bullet. I think this is a foundational question for the software engineering of systems that use OCFL for preservation, with significant implications on the practice of digital preservation as a whole.

First, let’s say a little bit more about what the paths are.

“OCFL as a source of truth”

If you are treating OCFL as a “source of truth”, the files stored in OCFL are the main primary location of your data.

When the software wants to add, remove, or change data, it will probably happen to the OCFL first, or at any rate won’t be considered a successful change until it is reflected in OCFL.

There might be other layers on top providing alternate access to the OCFL, some kind of “index” to OCFL for faster and/or easier access to the data, but these are considered “derivative”, and can always be re-created from just the OCFL. The OCFL is “the data”, everything else is “derivative” and can be re-created by an automated process from the OCFL on disk.

This may be what some of the OCFL designers were assuming everyone would do; as we’ll see, it makes certain things possible, and provides the highest level of confidence in our preservation activities.

“OCFL off to the side”

Alternately, you might write an application more or less using standard architectures for writing (eg) web applications. The data is probably in a relational database system (rdbms) like postgres or MySQL, or some other data store meant for supporting application development.

When the application makes a change to the data, it’s made to the primary data store.

Then the data is “mirrored” to OCFL. Possibly after every change, or possibly periodically. The OCFL can be thought of as a kind of “backup” — a backup in a specific standard format meant to support long-term preservation and interoperability. I’m calling this “off to the side”, Stefano aboves calls it “final destination”, in either case contrasted with “source of truth”.

It’s possible you haven’t stored all the data the application uses to OCFL, only the data you want to backup “for long-term preservation purposes”. (Stefano later suggests this is their practice, in fact). Maybe there is some data you think is necessary only for the particular present application’s functionalities (say, to support back-end accounts and workflows), which you think of as accidental, ephemeral, contextual, or system-specific and non-standard– and which you don’t see any use to storing for long-term preservation.

In this path, if ALL you have is the OCFL, you aren’t intending that you can necessarily stand your actual present application back up — maybe you didn’t store all the data you’d need for that; maybe you don’t have existing software capable of translating the OCFL back to the form the application actually needs it in to function. Of if you are intending that, the challange is greater to accomplish it, as we’ll see.

So why would you do this? Well, let’s start with that.

Why not OCFL as a source of truth?

There’s really only one reason — because it makes application development a lot harder. What do I mean by “a lot harder”? I mean, it’s going to take more development time, and more development care and decisions, you’re going to have more trouble achieving reasonable performance in a large-scale system — and you’re going to make more mistakes, have more bugs and problems, more initial deliveries that have problems. It’s not all “up-front” cost or known cost, but as you continue to develop the system, you’re going to keep struggling with these things. You honestly have increased chance of failure.

Why?

In the Slack thread, Stefano Cossu spoke up for OCFL to be a “final destination”, not the “source of truth” for the daily operating software:

I personally prefer OCFL to be the final destination, since if it’s meant to be for preservation, you don’t want to “stir” the medium by running indexing and access traffic, increasing the chances of corruption.
Stefano Cossu

If you’re using it as the actual data store for a running application, instead of leaving it off to the side as a backup, it perhaps increases the chances of bugs effecting data reliability.

The problem with that setup [OCFL as source of truth] is that a preservation system has different technical requirements from an access system. E.g. you may not want store (and index) versioning information in your daily-churn system. Or you may want to use a low-cost, low-performance medium for preservation
Stefano Cossu

OCFL is designed to rebuild knowledge (not only data, but also the semantic relationships between resources) without any supporting software. That’s what I intend for long-term preservation. In order to do that, you need to serialize everything in a way that is very inefficient for daily use.
Stefano Cossu

The form that OCFL prescribes is cumbersome to use for ordinary daily functionality. It makes it harder to achieve the goals you want for your actually running software.

I think Stefano is absolutely right about all of this, by the way, and also thank him for skillfully and clearly delineating a perspective that may, explicitly or not, actually be somewhat against the stream of some widespread OCFL assumptions.

One aspect of the cumbersomeness is that writes to OCFL need to be “synchronized” with regard to concurrency — the contents of a new version written to OCFL are as deltas on the previous version, so if another version is added while you are working on preparing your additional version — your version will be wrong. You need to use some form of locking, whether optimistic or naive pessimistic locks.

Whereas a relational database system is built on decades of work to ensure ACID (atomicity, consistency, isolation, durability) with regard to writes, while also trying to optimize performance within these constraints (which can be a real tension) — with OCFL we don’t have the built-up solutions (tools and patterns) for this to the same extent.

Application development gets a lot harder

In general, building a (say) web app on a relational database system is a known problem with a huge corpus of techniques, patterns, shared knowledge, and toolsets available. A given developer may be more or less experienced or skilled; different developers may disagree on optimal choices in some cases. But those choices are being made from a very established field, with deep shared knowledge on how to build applications rapidly (cheaply), with good performance and reliability.

When we switch to OCFL as the primary “source of truth” for an app, we in some ways are charting new territory and have to figure out and invent the best ways to do certain things, with much less support from tooling, the “literature” (even including blogs you find on google etc), and a much smaller community of practice.

The Fedora repository platform is in some sense meant to be a kind of “middleware” to make this lift easier. In its version 6 incarnation, it’s own internal data store is OCFL. It doesn’t give you a user-facing app. It gives you a “middleware” you can access over a more familiar HTTP API with clear semantics, and you don’t have to deal with the underlying OCFL (or in previous incarnations other internal formats) yourself. (Seth Erickson’s ocfl_index could be thought of as similar peer “middleware” in some ways, although it’s read-only, it doesn’t provide for writing).

But it’s still not the well-trodden path of rapid web application development on top of an rdbms.

I think that the samvera (née hydra) community really learned this to some extent the hard way, the way trying to build on top of this novel architecture really raised the complexity, cost, and difficulty of implementing the user-facing application (with implications on succession, hiring, and retention too). I’m not saying this happened becuase Fedora team did something wrong, I’m saying a novel architecture like this inherently and neccessarily raises the difficulty over a well-trodden architectural path. (although it’s possible to recognize the challenge and attempt to ameliorate with features that make things easier on developers, it’s not possible to eliminate).

Some samvera peer instititions have left the Fedora-based architecture, I think as a result of this experience. Where I work at Science History Institute, we left sufia/hydra/samvera to write a closer to “just plain Rails app”, and I believe it successfully and seriously increased our capacity to meet organizational and business needs within our available software engineering capacity. I personally would be really relutant to go back to attempting to use Fedora and/or OCFL as a “source of truth”, instead of more conventional web app data storage patterns.

So… that’s why you might not… but what do you lose?

What do you lose without OCFL as source of truth?

The trade-off is real though — I think some of the assumptions about what OCFL provides how are actually based on assumptions of OCFL as source of truth in your application.

Mike Kastellec’s Code4Lib presentation just before the OCFL one, on How to Survive a Disaster [Recovery] really got me thinking about backups and reliability.

Many of us have heard (or worse, found out ourselves the hard way) the adage: You don’t really know if you have a good backup unless you regularly go through the practice of recovery using it, to test it. Many have found that what they thought was their backup — was missing, was corrupt, or was not in a format suitable for supporting recovery. Because they hadn’t been verifying it would work for recovery, they were just writing to it but not using it for anything.

(Where I work, we try to regularly use our actual backups as the source of sync’ing from a production system to a staging system, in part as a method of incorporating backup recovery verification into our routine).

How is a preservation copy analogous? If your OCFL is not your source of truth, but just “off to the side” as a “preservation copy” — it can easily be a similar “write-only” copy. How do you know what you have there is sufficient to serve as a preservation copy?

Just as with backups, there are (at least) two categories of potential problem: It could be there are bugs in your synchronization routines, such that what you thought was being copied to OCFL was not, or not on the schedule you thought, or was getting corrupted or lost in transit. But the other category, even worse — it could be that your design had problems, and what you chose to sync to OCFL left out some crucial things that these future consumers of your preservation copy would have needed to fully restore and access the data. Stefano also wrote:

We don’t put everything in OCFL. Some resources are not slated for long-term preservation. (or at least, we may not in the future, but we do now)

If you are using the OCFL as your daily “source of truth”, you at least know the data you have stored in OCFL is sufficient to run your current system. Or at least you haven’t noticed any bugs with it yet, and if anyone notices any you’ll fix them ASAP.

The goal of preservation is that some future system will be able to use these files to reconstruct the objects and metadata in a useful way… It’s good to at least know it’s sufficient for some system, your current system. If you are writing to OCFL and not using it for anything… it reminds us of writing to a backup that you never restore from. How do you know it’s not missing things, by bug or by misdesign?

Do you even intend the OCFL to be sufficient to bring up your current system (I think some do, some don’t, some haven’t thought about it), and if you do, how do you know it meets your intents?

OCFL and Completeness and Migrations

The OCFL web page lists as one of its benefits (which I think can also be understood as design goals for OCFL):

Completeness, so that a repository can be rebuilt from the files it stores

If OCFL is your applications “source of truth”, you have this necessarily, in the sense of that almost being the definition of OCFL being the “source of truth”. (maybe suggesting at least some OCFL designers were assuming it as source of truth).

But if your OCFL is “off to the side”… do you even have that? I guess it depends on if you intended the OCFL to be transformable back to your application’s own internal source of truth, and if that intention was successful. If we’re talking about data from your application being written “off to the side” to OCFL, and then later transformed back to your application — I think we’re talking about what is called “round-tripping” the data.

There was another Code4Lib presentation about repository migration at Stanford, in the Slack discussion happening about that presentation, Stanford’s Justin Coyne and Mike Giarlo wrote:

I don’t recommend “round trip mappings”. I was a developer on this project. It’s very challenging to not lose data when going from A -> B -> A
Justin Coyne

We spent sooooo much time on getting these round-trip mappings correct. Goodness gracious.
Mike Giarlo

So, if you want to make your OCFL “off to the side” provide this quality of completeness via round-trippability, you probably have to be focusing on it intentionally, and then it’s still going to be really hard, maybe one of the hardest (most time-consuming, most buggy) aspects of your application, or at least it’s persistence layer.

I found this presentation about repository migration really connecting my neurons to the OCFL discussion generally — when i thought about this I realized, well, that makes sense, woah, is one description of “preservation” activities actually: a practice of trying to plan and provide for unknown future migrations not yet fully spec’d?

So, while we were talking about repository migrations on Slack, and how challenging the data migrations were (several conf presentations dealt with data migrations in repositories) Seth Erickson made a point about OCFL:

One of the arguments for OCFL is that the repository software should upgradeable/changeable without having to migrate the data… (that’s the aspiration, anyways)
Seth Erickson

If the vision is that with nothing more than an OCFL storage system, we can point new software to it and be up and running without a data migration — I think we can see this is basically assuming OCFL as the “source of truth”, and also talking about the same thing the OCFL webpage calls “completeness” again.

And why is this vision aspirational? Well, to begin with, we don’t actually have very many repository systems that use OCFL as a source of truth. We may only have Fedora — that is, systems that use Fedora as middleware. Or maybe ocfl_index too, although it being only read-only and also middleware that doesn’t necessarily have user-facing software built on it yet, it’s probably currently a partial entry at most.

If we had multiple systems that could already do this, we’d be a lot more confident it would work out — but of course, the expense and difficulty of building a system using OCFL as the “source of truth” is probably a large part of why we don’t!

OK, do we at least have multiple systems based on fedora? Well… yes. Even before Fedora was based on OCFL, it would hypothetically be possible to upgrade/change repository software without a data migration if both source and target software were based on Fedora… except, in fact, it was not possible to do this between Samvera sufia/hydra and Islandora, despite both being based on fedora, because even though they both used fedora, their metadata stored in Fedora (or OCFL) was not consistent. A whole giant topic we’re not going to cover here, except to point out it’s a huge challenge for that vision of “completeness” providing for software changes without data migration, a huge challenge that we have seen in practice, without necessarily seeing a success in practice. (Even within hyrax alone, there are currently two different possible fedora data layouts, using traditional activefedora with “wings” adapter or instead valkyrie-fedora adapter, requiring data migration between them!)

And if we think of the practice of preservation as being trying to maximize chances of providing for migration to future unknown systems with unknown needs… then we see it’s all aspirational (that far-future digital preservation is an aspirational endeavor is of course probably not a controversial thing to say either).

But the little bit of paradox here is that while “completeness” makes it more likely you will be able to easily change systems without data loss, the added cost of developing systems that achieve “completeness” via OCFL as “source of truth” means — you will probably have much fewer, if any, choices of suitable systems to change to, or resources available to develop them!

So… what do we do? Can we split the difference?

I think the first step is acknowledging the issue, the tension here between completeness via “OCFL as source-of-truth” and, well, ease of software development. There is no magic answer that optimizes everything, there are trade-offs.

That quality of “completeness” of data (“source of truth”) is going to make your software much more challenging to develop. Take longer, take more skill, have more chance of problems and failures. And another way to say this is: Within a given amount of engineering resources, you will be delivering fewer features that matter to your users and organization, because you are spending more of your resources on implementing on a more challenging architecture.

What you get out of this is aspirationally increased chances of successful preservation. This doesn’t mean you shouldn’t do it, digital preservation is neccessarily aspirational. I’m not sure one balances this cost and benefit — it might likely be different for different institutions — but I think we should be careful not to be routinely under-estimating the cost or over-estimating the size or confidence of benefits from the “source of truth” approach. Undoubtedly many institutions will still choose to develop OCFL as a source of truth, especially using middleware intended to ease the burden, like Fedora.

I will probably not be one of them at my current institution — the cost is just too high for us, we can’t give up the capacity to relatively rapidly meet other organizational and user needs. But I’d like to look at incorporating OCFL as “off to the side” preservation copy anyway in the future.

(And Stefano and me are definitely not the only ones considering this or doing it. Many institutions are using an “off to the side” “final destination” approach to preservation copies, if not with OCFL, than with some of it’s progenitors or peers like BagIt or Stanford’s MOAB — the “off to the side” approach is not unusual, and for good reasons! We can acknowledge it and talk about it without shame!)

If you are developing instead with OCFL as a “off to the side” (or “final destination”), are there things you can do to try to get closer to the benefits of OCFL as “source of truth”?

The main thing I can think of involves “round-trippability”

Yes, commit to storing all of your objects and metadata necessary to restore a working current system in your OCFL
And commit to storing it round-trippably
One way to ensure/enforce this would be — every time you write a new version to OCFL, run a job that serializes those objects and metadata to OCFL, and back to your internal format, and verify that it is still equivalent. Verify the round-trip.

Round-trippability doens’t just happen on it’s own, and ensuring it will definitely significantly increase the cost of your development — as the Stanford folks said from experience, round-trippability is a headache and a major cost! But, it could conceivably get you a lot of the confidence in “completeness” that “source of truth” OCFL gets you. And as it still is “off to the side”, it still allows you to write your application using whatever standard (or innovative in different directions) architectures you want, you don’t have the novel data persistence architecture design involved in all of your feature development to meet user and business needs.

This will perhaps arrive at a better cost/benefit balance for some institutions.

There may be other approaches or thoughts, this is hopefully the beginning of a long conversation and practice.

Escaping/encoding URI components in ruby 3.2

jrochkind — Tue, 14 Feb 2023 21:13:08 +0000

Thanks to zverok_kha’s awesome writeup of Ruby changes, I noticed a new method released in ruby 3.2: CGI.escapeURIComponent

This is the right thing to use if you have an arbitrary string that might include characters not legal in a URI/URL, and you want to include it as a path component or part of the query string:

require 'cgi'

url = "https://example.com/some/#{ CGI.escapeURIComponent path_component }" + 
  "?#{CGI.escapeURIComponent my_key}=#{CGI.escapeURIComponent my_value}"

The docs helpfully refer us to RFC3986, a rare citation in the wild world of confusing and vaguely-described implementations of escaping (to various different standards and mistakes) for URLs and/or HTML
This will escape / as %2F, meaning you can use it to embed a string with / in it inside a path component, for better or worse
This will escape a space ( ) as %20, which is correct and legal in either a query string or a path component
There is also a reversing method available CGI.unescapeURIComponent

What if I am running on a ruby previous to 3.2?

Two things in standard library probably do the equivalent thing. First:

require 'cgi'
CGI.escape(input).gsub("+", "%20")

CGI escape but take the +s it encodes space characters into, and gsub them into the more correct %20. This will not be as performant because of the gsub, but it works.

This, I noticed once a while ago, is what ruby aws-sdk does… well, except it also unescapes %7E back to ~, which does not need to be escaped in a URI. But… generally… it is fine to percent-encode ~ as %7E. Or copy what aws-sdk does, hoping they actually got it right to be equivalent?

Or you can use:

require 'erb'
ERB::Util.url_encode(input)

But it’s kind of weird to have to require the ERB templating library just for URI escaping. (and would I be shocked if ruby team moves erb from “default gem” to “bundled gem”, or further? Causing you more headache down the road? I would not). (btw, ERB::Util.url_encode leaves ~ alone!)

Do both of these things do exactly the same thing as CGI.escapeURIComponent? I can’t say for sure, see discussion of CGI.escape and ~ above. Sure is confusing. (there would be a way to figure it out, take all the chars in various relevant classes in the RFC spec and test them against these different methods. I haven’t done it yet).

What about URI.escape?

In old code I encounter, I often see places using URI.escape to prepare URI query string values…

# don't do this, don't use URI.escape
url = "https://example.com?key=#{ URI.escape value }"

# not this either, don't use URI.escape
url = "https://example.com?" + 
   query_hash.collect { |k, v| "#{URI.escape k}=#{URI.escape v}"}.join("&")

This was never quite right, in that URI.escape was a huge mess… intending to let you pass in whole URLs that were not legal URLs in that they had some illegal characters that needed escaping, and it would somehow parse them and then escape the parts that needed escaping… this is a fool’s errand and not something it’s possible to do in a clear consistent and correct way.

But… it worked out okay because the output of URI.escape overlapped enough with (the new RFC 3986-based) CGI.escapeURIComponent that it mostly (or maybe even always?) worked out. URI.escape did not escape a /… but it turns out / is probably actually legal in a query string value anyway, it’s optional to escape it to %2F in a query string? I think?

And people used it in this scenario, I’d guess, because it’s name made it sound like the right thing? Hey, I want to escape something to put it in a URI, right? And then other people copied from code they say, etc.

But URI.escape was an unpredictable bad idea from the start, and was deprecated by ruby, then removed entirely in ruby 3.0!

When it went away, it was a bit confusing to figure out what to replace it with. Because if you asked, sometimes people would say “it was broken and wrong, there is nothing to replace it”, which is technically true… but the code escaping things for inclusion in, eg, query strings, still had to do that… and then the “correct” behavior for this actually only existed in the ruby stdlib in the erb module (?!?) (where few had noticed it before URI.escape went away)… and CGI.escapeURIComponent which is really what you wanted didn’t exist yet?

Why is this so confusing and weird?

Why was this functionality in ruby stdlib non-existent/tucked away? Why are there so many slightly different implementations of “uri escaping”?

Escaping is always a confusing topic in my experience — and a very very confusing thing to debug when it goes wrong.

The long history of escaping in URLs and HTML is even more confusing. Like, turning a space into a + was specified for application/x-www-form-urlencoded format (for encoding an HTML form as a string for use as a POST body)… and people then started using it in url query strings… but I think possibly that was never legal, or perhaps the specifications were incomplete/inconsistent on it.

But it was so commonly done that most things receiving URLs would treat a literal + as an encode space… and then some standards were retroactively changed to allow it for compatibility with common practice…. maybe. I’m not even sure I have this right.

And then, as with the history of the web in general, there have been a progression of standards slightly altering this behavior, leapfrogging with actual common practice, where technically illegal things became common and accepted, and then standards tried to cope… and real world developers had trouble underestanding there might be different rules for legal characters/escaping in HTML vs URIs vs application/x-www-form-urlencoded strings vs HTTP headers…. and then language stdlib implementers (including but not limited to ruby) implemented things with various understandings acccording to various RFCs (or none, or buggy), documented only with words like “Escapes the string, replacing all unsafe characters with codes.” (unsafe according to what standard? For what purpose?)

PHEW.

It being so confusing, lots of people haven’t gotten it right — I swear that AWS S3 uses different rules for how to refer to spaces in filenames than AWS MediaConvert does, such that I couldn’t figure out how to get AWS MediaConvert to actually input files stored on S3 with spaces in them, and had to just make sure to not use spaces in filenames on S3 destined for MediaConvert. But maybe I was confused! But honestly I’ve found it’s best to avoid spaces in filenames on S3 in general, because S3 docs and implementation can get so confusing and maybe inconsistent/buggy on how/when/where they are escaped. Because like we’re saying…

Escaping is always confusing, and URI escaping is really confusing.

Which is I guess why the ruby stdlib didn’t actually have a clearly labelled provided-with-this-intention way to escape things for use as a URI component until ruby 3.2?

Just use CGI.escapeURIComponent in ruby 3.2+, please.

What about using the Addressable gem?

When the horrible URI.escape disappeared and people that had been wrongly using it to escape strings for use as URI components needed some replacement and the ruby stdlib was confusing (maybe they hadn’t noticed ERB::Util.url_encode or weren’t confident it did the right thing and gee I wonder why not), some people turned to the addressable gem.

This gem for dealing with URLs does provide ways to escape strings for use in URLs… it actually provides two different algorithms depending on whether you want to use something in a path component or a query component.

require 'addressable'

Addressable::URI.encode_component(query_param_value, Addressable::URI::CharacterClasses::QUERY)

Addressable::URI.encode_component(path_component, Addressable::URI::CharacterClasses::PATH)

Note Addressable::URI::CharacterClasses::QUERY vs Addressable::URI::CharacterClasses::PATH? Two different routines? (Both by the way escape a space to %20 not +).

I think that while some things need to be escaped in (eg) a path component and don’t need to be in a query component, the specs also allow some things that don’t need to be escaped to be escaped in both places, such that you can write an algorithm that produces legally escaped strings for both places, which I think is what CGI.escapeURIComponentis. Hopefully we’re in good hands.

On Addressable, neither the QUERY nor PATH variant escapes /, but CGI.escapeURIComponent does escape it to %2F. PHEW.

You can also call Addressable::URI.encode_component with no second arg, in which case it seems to escape CharacterClasses::RESERVED + CharacterClasses::UNRESERVED from this list. Whereas PATH is, it looks like there, equivalent to UNRESERVED with SOME of RESERVED (SUB_DELIMS but only some of GENERAL_DELIMS), and QUERY is just path plus ? as needing escaping…. (CGI.escapeURIComponent btw WILL escape ? to %3F).

PHEW, right?

Anyhow

Anyhow, just use CGI.escapeURIComponent to… escape your URI components, just like it says on the lid.

Thanks to /u/f9ae8221b for writing it and answering some of my probably annoying questions on reddit and github.

attr_json 2.0 release: ActiveRecord attributes backed by JSON column

jrochkind — Thu, 09 Feb 2023 21:21:27 +0000

attr_json is a gem to provide attributes in ActiveRecord that are serialized to a JSON column, usually postgres jsonb, multiple attributes in a json hash. In a way that can be treated as much as possible like any other “ordinary” (database column) ActiveRecord.

It supports arrays and nested models as hashes, and the embedded nested models can also be treated much as an ordinary “associated” record — for instance CI build tests with cocoon , and I’ve had a report that it works well with stimulus nested forms, but I don’t currently know how to use those. (PR welcome for a test in build?)

An example:

# An embedded model, if desired
class LangAndValue
  include AttrJson::Model

  attr_json :lang, :string, default: "en"
  attr_json :value, :string
end

class MyModel < ActiveRecord::Base
   include AttrJson::Record

   # use any ActiveModel::Type types: string, integer, decimal (BigDecimal),
   # float, datetime, boolean.
   attr_json :my_int_array, :integer, array: true
   attr_json :my_datetime, :datetime

   attr_json :embedded_lang_and_val, LangAndValue.to_type
end

model = MyModel.create!(
  my_int_array: ["101", 2], # it'll cast like ActiveRecord
  my_datetime: DateTime.new(2001,2,3,4,5,6),
  embedded_lang_and_val: LangAndValue.new(value: "a sentence in default language english")
)

By default it will serialize attr_json attributes to a json_attributes column (this can also be specified differently), and the above would be serialized like so:

{
  "my_int_array": [101, 2],
  "my_datetime": "2001-02-03T04:05:06Z",
  "embedded_lang_and_val": {
    "lang": "en",
    "value": "a sentence in default language english"
  }
}

Oh, attr_json also supports some built-in construction of postgres jsonb contains (“@>“) queries, with proper rails type-casting, through embedded models with keypaths:

MyModel.jsonb_contains(
  my_datetime: Date.today,
  "embedded_lang_and_val.lang" => "de"
) # an ActiveRelation, you can chain on whatever as usual

And it supports in-place mutations of the nested models, which I believe is important for them to work “naturally” as ruby objects.

my_model.embedded_lang_and_val.lang = "de"
my_model.embedded_lang_and_val_change 
# => will correctly return changes in terms of models themselves
my_model.save!

There are some other gems in this “space” of ActiveRecord attribute json serialization, with different fits for different use cases, created either before or after I created attr_json — but none provide quite this combination of features — or, I think, have architectures that make this combination feasible (I could be wrong!). Some to compare are jsonb_accessor, store_attribute, and store_model.

One use case where I think attr_json really excels is when using Rails Single-Table Inheritance, where different sub-classes may have different attributes.

And especially for a “content management system” type of use case, where on top of that single-table inheritance polymorphism, you can have complex hierarchical data structures, in an inheritance hierarchichy, where you don’t actually want or need the complexity of an actual normalized rdbms schema for the data that has both some polymorphism and some hetereogeneity. We get some aspects of a schema-less json-document-store, but embedded in postgres, without giving up rdbms features or ordinary ActiveRecord affordances.

Slow cadence, stability and maintainability

While the 2.0 release includes a few backwards incompats, it really should be an easy upgrade for most if not everyone. And it comes three and a half years after the 1.0 release. That’s a pretty good run.

Generally, I try to really prioritize backwards compatibility and maintainability, doing my best to avoid anything that could provide backwards incompat between major releases, and trying to keep major releases infrequent. I think that’s done well here.

I know that management of rails “plugin” dependencies can end up a nightmare, and I feel good about avoiding this with attr_json.

attr_json was actually originally developed for Rails 4.2 (!!), and has kept working all the way to Rails 7. The last attr_json 1.x release actually supported (in same codebase) Rails 5.0 through Rails 7.0 (!), and attr_json 2.0 supports 6.0 through 7.0. (also grateful to the quality and stability of the rails attributes API originally created by sgrif).

I think this succesfully makes maintenance easier for downstream users of attr_json, while also demonstrating success at prioritizing maintainability of attr_json itself — it hasn’t needed a whole lot of work on my end to keep working across Rails releases. Occasionally changes to the test harness are needed when a new Rails version comes out, but I actually can’t think of any changes needed to implementation itself for new Rails versions, although there may have been a few.

Because, yeah, it is true that this is still basically a one-maintainer project. But I’m pleased it has successfully gotten some traction from other users — 390 github “stars” is respectable if not huge, with occasional Issues and PR’s from third parties. I think this is a testament to it’s stability and reliability, rather than to any (almost non-existent) marketing I’ve done.

“Slow code”?

In working on this and other projects, I’ve come to think of a way of working on software that might be called “slow code”. To really get stability and backwards compatibility over time, one needs to be very careful about what one introduces into the codebase in the first place. And very careful about getting the fundamental architectural design of the code solid in the first place — coming up with something that is parsimonious (few architectural “concepts”) and consistent and coherent, but can handle what you will want to throw at it.

This sometimes leads me to holding back on satisfying feature requests, even if they come with pull requests, even if it seems like “not that much code” — if I’m not confident it can fit into the architecture in a consistent way. It’s a trade-off.

I realize that in many contemporary software development environments, it’s not always possible to work this way. I think it’s a kind of software craftsmanship for shared “library” code (mostly open source) that… I’m not sure how much our field/industry accomnodates development with (and the development of) this kind of craftsmanship these days. I appreciate working for a non-profit academic institute that lets me develop open source code in a context where I am given the space to attend to it with this kind of care.

The 2.0 Release

There aren’t actually any huge changes in the 2.0 release, mostly it just keeps on keeping on.

Mostly, 2.0 tries to make things adhere even closer and more consistently to what is expected of Rails attributes.

The “Attributes” API was still brand new in Rails 4.2 when this project started, but now that it has shown itself solid and mature, we can always create a “cover” Rails attribute in the ActiveRecord model, instead of making it “optional” as attr_json originally did. Which provides for some code simplification.

Some rough edges were sanded involved making Time/Date attributes timezone-aware in the way Rails usually does transparently. And with some underlying Rails bugs/inconsistencies having been long-fixed in Rails, they can now store miliseconds in JSON serialization rather than just whole seconds too.

I try to keep a good CHANGELOG, which you can consult for more.

The 2.0 release is expected to be a very easy migration for anyone on 1.x. If anyone on 1.x finds it challenging, please get in touch in a github issue or discussion, I’d like to make it easier for you if I can.

For my Library-Archives-Museums Rails people….

The original motivation from this came from trying to move off samvera (nee hydra) sufia/hyrax to an architecutre that was more “Rails-like”. But realizing that the way we wanted to model our data in a digital collections app along the lines of sufia/hyrax, would be rather too complicated to do with a reasonably normalized rdbms schema.

So… can we model things in the database in JSON — similar to how valkyrie-postgres would actually model things in postgres — but while maintaining an otherwise “Rails-like” development architecture? The answer: attr_json.

So, you could say the main original use case for attr_json was to persist a “PCDM“-ish data model ala sufia/hyrax, those kinds of use cases, in an rdbms, in a way that supported performant SQL queries (minimal queries per page, avoiding n+1 queries), in a Rails app using standard Rails tools and conventions, without an enormously complex expansive normalized rdbms schema.

While the effort to base hyrax on valkyrie is still ongoing, in order to allow postgres vs fedora (vs other possible future stores) to be a swappable choice in the same architecture — I know at least some institutions (like those of the original valkyrie authors) are using valkyrie in homegrown app directly, as the main persistence API (instead of ActiveRecord).

In some sense, valkyrie-postgres (in a custom app) vs attr-json (in a custom app) are two paths to “step off” the hyrax-fedora architecture. They both result in similar things actually stored in your rdbms (and we both chose postgres, for similar reasons, including I think good support for json(b)). They have both have advantages and disadvantages. Valkyrie-postgres kind of intentionally chooses not to use ActiveRecord (at least not in controllers/views etc, not in your business logic), one advantage of such is to get around some of the known widely-commented upon deficiencies and complaints with Rails standard ActiveRecord architecture.

Whereas I followed a different path with attr_json — how can we store things in postgres similarly, but while still using ActiveRecord in a very standard Rails way — how can we make it as standard a Rails way as possible? This maintains the disadvantages people sometimes complain about Rails architecture, but with the benefit of sticking to the standard Rails ecosystem, having less “custom community” stuff to maintain or figure out (including fewer lines of code in attr-json), being more familiar or accessible to Rails-experienced or trained developers.

At least that’s the idea, and several years later, I think it’s still working out pretty well.

In addition to attr_json, I wrote a layer on top to provide some parts on top of attr_json, that I thought would be both common and somewhat tricky in writing a pcdm/hyrax-ish digital collections app as “standard Rails as much as it makes sense”. This is kithe and it hasn’t had very much uptake. The only other user I’m aware of (who is using only a portion of what kithe provides; but kithe means to provide for that as a use case) is Eric Larson at https://github.com/geobtaa/geomg.

However, meanwhile, attr_json itself has gotten quite a bit more uptake — from wider Rails developer community, not our library-museum-archives community. attr_json’s 390 github stars isn’t that big in the wider world of things, but it’s pretty big for our corner of the world. (Compare to 160 for hyrax or 721 for blacklight). That the people using attr_json, and submitting Issues or Pull Requests largely aren’t library-museum-archives developers, I consider positive and encouraging, that it’s escaped the cultural-heritage-rails bubble, and is meeting a more domain-independent or domain-neutral need, at a lower level of architecture, with a broader potential community.

A tiny donation to rubyland.news would mean a lot

jrochkind — Fri, 23 Dec 2022 15:17:51 +0000

I started rubyland.news in 2016 because it was a thing I wanted to see for the ruby community. I had been feeling a shrinking of the ruby open source collaborative community, it felt like the room was emptying out.

If you find value in Rubyland News, just a few dollars contribution on my Github Sponsors page would be so appreciated.

I’ve been solely responsible for its development, and editorial and technical operations. I think it’s been a success. I don’t have analytics, but it seems to be somewhat known and used.

Rubyland.news has never been a commercial project. I have never tried to “monetize” it. I don’t even really highlight my personal involvement much. I have in the past occasionally had modest paid sponsorship barely enough to cover expenses, but decided it wasn’t worth the effort.

I have and would never provide any kind of paid content placement, because I think that would be counter to my aims and values — I have had offers, specifically asking for paid placement not labelled as such, because apparently this is how the world works now, but I would consider that an unethical violation of trust.

It’s purely a labor or love, in attempted service to the ruby community, building what I want to see in the world as an offering of mutual aid.

So why am I asking for money?

The operations of Rubyland News don’t cost much, but they do cost something. A bit more since Heroku eliminated free dynos.

I currently pay for it out of my pocket, and mostly always have modulo occasional periods of tiny sponsorship. My pockets are doing just fine, but I do work for an academic non-profit, so despite being a software engineer the modest expenses are noticeable.

Sure, I could run it somewhere cheaper than heroku (and eventually might have to) — but I’m doing all this in my spare time, I don’t want to spend an iota more time or psychic energy on (to me) boring operational concerns than I need to. (But if you want to volunteer to take care of setting up, managing, and paying for deployment and operations on another platform, get in touch! Or if you are another platform that wants to host rubyland news for free!)

It would be nice to not have to pay for Rubyland News out of my pocket. But also, some donations would, as much as be monetarily helpful, also help motivate me to keep putting energy into this, showing me that the project really does have value to the community.

I’m not looking to make serious cash here. If I were able to get just $20-$40/month in donations, that would about pay my expenses (after taxes, cause I’d declare if i were getting that much), I’d be overjoyed. Even 5 monthly sustainers at just $1 would really mean a lot to me, as a demonstration of support.

You can donate one-time or monthly on my Github Sponsors page. The suggested levels are $1 and $5.

(If you don’t want to donate or can’t spare the cash, but do want to send me an email telling me about your use of rubyland news, I would love that too! I really don’t get much feedback! jonathan at rubyland.news)

Thanks

Thanks to anyone who donates anything at all
also to anyone who sends me a note to tell me that they value Rubyland News (seriously, I get virtually no feedback — telling me things you’d like to be better/different is seriously appreciated too! Or things you like about how it is now. I do this to serve the community, and appreciate feedback and suggestions!)
To anyone who reads Rubyland News at all
To anyone who blogs about ruby, especially if you have an RSS feed, especially if you are doing it as a hobbyist/community-member for purposes other than business leads!
To my current single monthly github sponsor, for $1, who shall remain unnamed because they listed their sponsorship as private
To anyone contributing in their own way to any part of open source communities for reasons other than profit, sometimes without much recognition, to help create free culture that isn’t just about exploiting each other!

vite-ruby for JS/CSS asset management in Rails

jrochkind — Tue, 29 Nov 2022 20:43:13 +0000

I recently switched to vite and vite-ruby for managing my JS and CSS assets in Rails. I was switching from a combination of Webpacker and sprockets — I moved all of my Webpacker and most of my sprockets to vite.

Note that vite-ruby has smooth ready-made integrations for Padrino, Hanami, and jekyll too, and possibly hook points for integrations with arbitrary ruby, plus could always just use vite without vite-ruby — but I’m using vite-ruby with Rails.

I am finding it generally pretty agreeble, so I thought I’d write up some of the things I like about it for others. And a few other notes.

I am definitely definitely not an expert in Javascript build systems (or JS generally), which both defines me as an audience for build tools, but also means I don’t always know how these things might compare with other options. The main other option I was considering was jsbundling-rails with esbuild and cssbundling-rails with SASS, but I didn’t get very far into the weeds of checking those out.

I moved almost all my JS and (S)CSS into being managed/built by vite.

My context

I work on a monolith “full stack” Rails application, with a small two-developer team.

I do not do any very fancy Javascript — this is not React or Vue or anything like that. It’s honestly pretty much “JQuery-style” (although increasingly I try to do it without jquery itself using just native browser API, it’s still pretty much that style).

Nonetheless, I have accumulated non-trivial Javascript/NPM dependencies, including things like video.js , @shoppify/draggable, fontawesome (v4), openseadragon. I need package management and I need building.

I also need something dirt simple. I don’t really know what I’m doing with JS, my stack may seem really old-fashioned, but here it is. Webpacker had always been a pain, I started using it to have something to manage and build NPM packages, but was still mid-stream in trying to switch all my sprockets JS over to webpacker when it was announced webpacker was no longer recommended/maintained by Rails. My CSS was still in sprockets all along.

Vite

One thing to know about vite is that it’s based on the idea of using different methods in dev vs production to build/serve your JS (and other managed assets). In “dev”, you ordinarily run a “vite server” which serves individual JS files, whereas for production you “build” more combined files.

Vite is basically an integration that puts together tools like esbuild and (in production) rollup, as well as integrating optional components like sass — making them all just work. It intends to be simple and provide a really good developer experience where doing simple best practice things is simple and needs little configuration.

vite-ruby tries to make that “just works” developer experience as good as Rubyists expect when used with ruby too — it intends to integrate with Rails as well as webpacker did, just doing the right thing for Rails.

Things I am enjoying with vite-ruby and Rails

You don’t need to run a dev server (like you do with jsbundling-rails and css-bundling rails)
- If you don’t run the vite dev server, you’ll wind up with auto-built vite on-demand as needed, same as webpacker basically did.
- This can be slow, but it works and is awesome for things like CI without having to configure or set up anything. If there have been no changes to your source, it is not slow, as it doesn’t need to re-build.
- If you do want to run the dev server for much faster build times, hot module reload, better error messages, etc, vite-ruby makes it easy, just run ./bin/vite dev in a terminal.
If you DO run the dev server — you have only ONE dev-server to run, that will handle both JS and CSS
- I’m honestly really trying to avoid the foreman approach taken by jsbundling-rails/cssbundling-rails, because of how it makes accessing the interactive debugger at a breakpoint much more complicated. Maybe with only one dev server (that is optional), I can handle running it manually without a procfile.

Handling SASS and other CSS with the same tool as JS is pretty great generally — you can even @import CSS from a javascript file, and also @import plain CSS too to aggregate into a single file server-side (without sass). With no non-default configuration, it just works, and will spit out stylesheet tags, and it means your css/sass is going through the same processing whether you import it from .js or .css.
- I handle fontawesome 4 this way. Include "font-awesome": "^4.7.0" in my package.json, then @import "font-awesome/css/font-awesome.css"; just works, and from either a .js or a .css file. It actually spits out not only the fontawesome CSS file, but also all the font files referenced from it and included in the npm package, in a way that just works. Amazing!!
- Note how you can reference things from NPM packages with just package name. On google for some tools you find people doing contortions involving specifically referencing node-modules, I’m not sure if you really have to do this with latest versions of other tools but you def don’t with vite, it just works.

in general, I really appreciate vite’s clear opinionated guidance and focus on developer experience. Understanding all the options from the docs is not as hard because there are fewer options, but it does everything I need it to. vite-ruby succesfully carries this into ruby/Rails, it’s documentation is really good, without being enormous. In Rails, it just does what you want, automatically.

Vite supports source maps for SASS!
- Not currently on by default, you have to add a simple config.
- Unfortunately sass sourcemaps are NOT supported in production build mode, only in dev server mode. (I think I found a ticket for this, but can’t find it now)
- But that’s still better than the official Rails options? I don’t understand how anyone develops SCSS without sourcemaps!
  - But even though sprockets 4.x finally supported JS sourcemaps, it does not work for SCSS! Even though there is an 18-month-old PR to fix it, it goes unreviewed by Rails core and unmerged.
  - Possibly even more suprisingly, SASS sourcemaps doesn’t seem to work for the newer cssbundling-rails=>sass solution either. https://github.com/rails/cssbundling-rails/issues/68
  - Previous to this switch, I was still using sprockets old-style “comments injected into CSS built files with original source file/line number” — that worked. But to give that up, and not get working scss sourcemaps in return? I think that would have been a blocker for me against cssbundling-rails/sass anyway… I feel like there’s something I’m missing, because I don’t understand how anyone is developing sass that way.

If you want to split up your js into several built files (“chunks), I love how easy it is. It just works. Vite/rollup will do it for you automatically for any dynamic runtime imports, which it also supports, just write import with parens, inside a callback or whatever, just works.

Things to be aware of

vite and vite-ruby by default will not create .gz variants of built JS and CSS
- Depending on your deploy environment, this may not matter, maybe you have a CDN or nginx that will automatically create a gzip and cache it.
- But in eg default heroku Rails deploy, it really really does. Default Heroku deploy uses the Rails app itself to deliver your assets. The Rails app will deliver content-encoding gzip if it’s there. If it’s not… when you switch to vite from webpacker/sprockets, you may now delivering uncommpressed JS and CSS with no other changes to your environment, with non-trivial performance implications but ones you may not notice.
- Yeah, you could probably configure your CDN you hopefully have in front of your heroku app static assets to gzip for you, but you may not have noticed.
- Fortunately it’s pretty easy to configure
  - For me, I do some kind of ugly JS to configure it only when I’m not using dev-mode autoBuild (in dev but without running a vite dev server), becuase it really slows down autoBuild
  - Since I migrated over, the vite-pllugin-rails plugin also does it by default. (I’m not using that, actually)

There are some vite NPM packages involved (vite itself as well as some vite-ruby plugins), as well as the vite-ruby gem, and you have to keep them up to date in sync. You don’t want to be using a new version of vite NPM packages with too-old gem, or vice versa. (This is kind of a challenge in general with ruby gems with accompanying npm packages)
- But vite_ruby actually includes a utility to check this on boot and complain if they’ve gotten out of sync! As well as tools for syncing them! Sweet!
- But that can be a bit confusing sometimes if you’re running CI after an accidentally-out-of-sync upgrade, and all your tests are now failing with the failed sync check. But no big deal.

Things I like less

vite-ruby itself doesn’t seem to have a CHANGELOG or release notes, which I don’t love.
Vite is a newer tool written for modern JS, it mostly does not support CommonJS/node require, preferring modern import. In some cases that I can’t totally explain require in dependencies seems to work anyway… but something related to this stuff made it apparently impossible for me to import an old not-very-maintained dependency I had been importing fine in Webpacker. (I don’t know how it would have done with jsbundling-rails/esbuild). So all is not roses.

Am I worried that this is a third-party integration not blessed by Rails?

The vite-ruby maintainer ElMassimo is doing an amazing job. It is currently very well-maintained software, with frequent releases, quick turnaround from bug report to release, and ElMassimo is very repsonsive in github discussions.

But it looks like it is just one person maintaining. We know how open source goes. Am I worried that in the future some release of Rails might break vite-ruby in some way, and there won’t be a maintainer to fix it?

I mean… a bit? But let’s face it… Rails officially blessed solutions haven’t seemed very well-maintained for years now either! The three year gap of abandonware between the first sprockets 4.x beta and final release, followed by more radio silence? The fact that for a couple years before webpacker was officially retired it seemed to be getting no maintainance, including requiring dependency versions with CVE’s that just stayed that way? Not much documentation (ie Rails Guide) support for webpacker ever, or jsbundling-rails still?

One would think it might be a new leaf with css/jsbundling-rails… but I am still baffled by there being no support for sass sourcemaps in cssbundling-rails and sass! Official rails support doesn’t necessarily get you much “just works” DX when it comes to asset handling for years now.

Let’s face it, this has been an area where being in the Rails github org and/or being blessed by Rails docs has been no particular reason to expect maintenance or expect you won’t have problems down the line anyway. it’s open source, nobody owes you anything, maintainers spend time on what they have interest to spend time on (including time to review/merge/maintain other’s PR’s — which is def non-trivial time!) — it just is what it is.

While the vite-ruby code provides a pretty great integrated into Rails DX, its also actually mostly pretty simple code, especially when it comes to the Rails touch points most at risk of Rails breaking — it’s not doing anything too convoluted.

So, you know, you take your chances, I feel good about my chances compared to a css/jsbundling-rails solution. And if someday I have to switch things over again, oh well — Rails just pulled webpacker out from under us quicker than expected too, so you take your chances regardless!

(thanks to colleague Anna Headley for first suggesting we take a look at vite in Rails!)

Using engine_cart with Rails 6.1 and Ruby 3.1

jrochkind — Mon, 13 Jun 2022 14:40:09 +0000

Rails does not seem to generally advertise ruby version compatibility, but it seems to be the case taht Rails 6.1, I believe, works with Ruby 3.1 — as long as you manually add three dependencies to your Gemfile.

gem "net-imap"
gem "net-pop"
gem "net-smtp"

(Here’s a somewhat cryptic gist from one (I think) Rails committer with some background. Although it doens’t specifically and clearly tell you to add these dependencies for Rails 6.1 and ruby 3.1… it won’t work unless you do. You can find other discussion of this on the net.)

Or you can instead add one line to your Gemfile, opting in to using the pre-release mail gem 2.8.0.rc1, which includes these dependencies for ruby 3.1 compatibility. Mail is already a Rails dependency; but pre-release gems (whose version numbers end in something including letters after a third period) won’t be included by bundler unless you mention a pre-release version (whose version number ends in…) explicitly in Gemfile.

gem "mail", ">= 2.8.0.rc1"

Once mail 2.8.0 final is released, if I understand what’s going on right, you won’t need to do any of this, since it won’t be a pre-release version bundler will just use it when bundle updateing a Rails app, and it expresses the dependencies you need for ruby 3.1, and Rails 6.1 will Just Work with ruby 3.1. Phew! I hope it gets released soon (been about 7 weeks since 2.8.0.rc1).

Engine cart

Engine_cart is a gem for dynamically creating Rails apps at runtime for use in CI build systems, mainly to test Rails engine gems. It’s in use in some collaborative open source communities I participate in. While it has plusses (actually integration testing real app generation) and minuses (kind of a maintenance nightmare it turns out), I don’t generally recommend it, if you haven’t heard of it before and am wondering “Does jrochkind think I should use this for testing engine gems in general?” — this is not an endorsement. In general it can add a lot of pain.

But it’s in use in some projects I sometimes help maintain.

How do you get a build using engine_cart to succesfully test under Rails 6.1 and ruby 3.1? Since if it were “manual” you’d have to add a line to a Gemfile…

It turns out you can create a ./spec/test_app_templates/Gemfile.extra file, with the necessary extra gem calls:

gem "net-imap"
gem "net-pop"
gem "net-smtp"

# OR, above OR below, don't need both

gem "mail", ">= 2.8.0.rc1"

I think ./spec/test_app_templates/Gemfile.extra is a “magic path” used by engine_cart… or if the app I’m working on is setting it, I can’t figure out why/how! But I also can’t quite figure out why/if engine_cart is defaulting to it…
Adding this to your main project Gemfile is not sufficient, it needs to be in Gemfile.extra
Some projects I’ve seen have a line in their Gemfile using eval_gemfile and referencing the Gemfile.extra… which I don’t really understand… and does not seem to be necessary to me… I think maybe it’s leftover from past versions of engine_cart best practices?
To be honest, I don’t really understand how/where the Gemfile.extra is coming in, and I haven’t found any documentation for it in engine_cart . So if this doens’t work for you… you probably just haven’t properly configured engine_cart to use the Gemfile.extra in that location, which the project I’m working on has done in some way?

Note that you may still get an error produced in build output at some point of generating the test app:

run  bundle binstubs bundler
rails  webpacker:install
You don't have net-smtp installed in your application. Please add it to your Gemfile and run bundle install
rails aborted!
LoadError: cannot load such file -- net/smtp

But it seems to continue and work anyway!

None of this should be necessary when mail 2.8.0 final is released, it should just work!

The above is of course always including those extra dependencies, for all builds in your matrix, when they are only necessary for Rails 6.1 (not 7!) and ruby 3.1. If you’d instead like to guard it to only apply for that build, and your app is using the RAILS_VERSION env variable convention, this seems to work:

# ./specs/test_app_templates/Gemfile.extra
#
# Only necessary until mail 2.8.0 is released, allow us to build with engine_cart
# under Rails 6.1 and ruby 3.1, by opting into using pre-release version of mail
# 2.8.0.rc1
#
# https://github.com/mikel/mail/pull/1472

if ENV['RAILS_VERSION'] && ENV['RAILS_VERSION'] =~ /^6\.1\./ && RUBY_VERSION =~ /^3\.1\./
  gem "mail", ">= 2.8.0.rc1"
end

Rails7 connection.select_all is stricter about it’s arguments in backwards incompat way: TypeError: Can’t Cast Array

jrochkind — Mon, 28 Mar 2022 19:51:04 +0000

I have code that wanted to execute some raw SQL against an ActiveRecord database. It is complicated and weird multi-table SQL (involving a postgres recursive CTE), so none of the specific-model-based API for specifying SQL seemed appropriate. It also needed to take some parameters, that needed to be properly escaped/sanitized.

At some point I decided that the right way to do this was with Model.connection.select_all , which would create a parameterized prepared statement.

Was I right? Is there a better way to do this? The method is briefly mentioned in the Rails Guide (demonstrating it is public API!), but without many details about the arguments. It has very limited API docs, just doc’d as: select_all(arel, name = nil, binds = [], preparable: nil, async: false), “Returns an ActiveRecord::Result instance.” No explanation of the type or semantics of the arguments.

In my code working on Rails previous to 7, the call looked like:

MyModel.connection.select_all(
  "select complicated_stuff WHERE something = $1",
  "my_complicated_stuff_name",
  [[nil, value_for_dollar_one_sub]],
  preparable: true
)

yeah that value for the binds is weird, a duple-array within an array, where the first value of the duple-array is just nil? This isn’t documented anywhere, I probably got that from somewhere… maybe one of the several StackOverflow answers.
I honestly don’t know what preparable: true does, or what difference it makes.

In Rails 7.0, this started failing with the error: TypeError: can’t cast Array.

I couldn’t find any documentation of that select_all all method at all, or other discussion of this; I couldn’t find any select_all change mentioned in the Rails Changelog. I tried looking at actual code history but got lost. I’m guessing “can’t cast Array” referes to that weird binds value… but what is it supposed to be?

Eventually I thought to look for Rails tests of this method that used the binds argument, and managed to eventually find one!

So… okay, rewrote that with new binds argument like so:

bind = ActiveRecord::Relation::QueryAttribute.new(
  "something", 
  value_for_dollar_one_sub, 
  ActiveRecord::Type::Value.new
)

MyModel.connection.select_all(
  "select complicated_stuff WHERE something = $1",
  "my_complicated_stuff_name",
  [bind],
  preparable: true
)

Confirmed this worked not only in Rails 7, but all the way back to Rails 5.2 no problem.
I guess that way I was doing it previously was some legacy way of passing args that was finally removed in Rails 7?
I still don’t really understand what I’m doing. The first arg to ActiveRecord::Relation::QueryAttribute.new I made match the SQL column it was going to be compared against, but I don’t know if it matters or if it’s used for anything. The third argument appears to be an ActiveRecord Type… I just left it the generic ActiveRecord::Type::Value.new, which seemed to work fine for both integer or string values, not sure in what cases you’d want to use a specific type value here, or what it would do.
In general, I wonder if there’s a better way for me to be doing what I’m doing here? It’s odd to me that nobody else findable on the internet has run into this… even though there are stackoverflow answers suggesting this approach… maybe i’m doing it wrong?

But anyways, since this was pretty hard to debug, hard to find in docs or explanations on google, and I found no mention at all of this changing/breaking in Rails 7… I figured I’d write it up so someone else had the chance of hitting on this answer.

Exploring hosting video in a non-profit archival digital collections web app

jrochkind — Tue, 08 Mar 2022 18:44:42 +0000

We are a small independent academic institution, that hosts a “digital collections” of digitized historical materials, on the web, using a custom in-house Rails app.

We’re getting into including some historical video content for the first time, and I didn’t have much experience with video, it required me to figure outa few things, that I’m going to document here. (Thanks to many folks from Code4Lib and Samvera communities who answered my questions, including mbklein, cjcolvar, and david.schober).

Some more things about us and our content:

Our content at least initially will be mostly digitized VHS (480p, fairly low-visual-quality content), although we could also eventually have some digitized 16mm film and other content that could be HD.
Our app is entirely cloud-hosted, mainly on heroku and AWS S3. We don’t manage any of our own servers at the OS level (say an EC2), and don’t want to except as a last resort (we don’t really have the in-house capacity to).

Our usage patterns are not necessarily typical for a commercial application! We have a lot (at least eventually we will) of fairly low-res/low-bitrate stuff (old VHSs!), it’s unclear how much will get viewed how often (a library/archives probably stores a lot more content as a proportion of content viewed than a commercial operation), and our revenue doesn’t increase as our content accesses do! So, warning to the general reader, our lense on things may or may not match a commercial enterprise’s. But if you are one of our peers in cultural heritage, you’re thinking, “I know, right?”

All of these notes are as of February 2022, if you are reading far in the future, some things may have changed. Also, I may have some things wrong, this is just what I have figured out, corrections welcome.

Standard Video Format: MP4 (h.264) with html5 video

I originally had in my head that maybe you needed to provide multiple video formats to hit all browsers — but it seems you don’t really anymore.

An MP4 using H.264 codec and AAC audio can be displayed in pretty much any browser of note with an html5 video tag.

There are newer/better/alternate video formats. Some terms (I’m not totally sure which of these are containers are which are codecs and which containers can go with which codecs!) include: WebM, Ogg, Theora, H.265, V8, and V9. Some of these are “better” in various ways (whether better licenses or better quality video at same filesize) — but none are yet supported across all common browsers.

You could choose to provide alternate formats so browsers that can use one of the newer formats will — but I think the days of needing to provide multiple formats to satisfy 99%+ of possible consumers appear gone. (Definitely forget about flash).

The vast majority of our cultural heritage peers I could find are using MP4. (Except for the minority using an adaptive bitrate streaming format…).

Another thing to know about H.264-MP4 is that even at the same screen size/resolution, the bitrate, and therefore the filesize (per second of footage) can vary quite a bit between files. This is because of the (lossy) compression in H.264. Some original source material may just be more compressible than others — not sure how much this comes into play. What definitely comes into play is that when encoding, you can choose to balance higher compression level vs higher quality (similar to in JPG). Most encoding software can let you choose set a maximum bitrate (sometimes using either a variable-bit-rate (VBR) or constant-bit-rate (CBR) algorithm), OR choose a “quality level” on some scale and the encoder will compress differently from second to second of footage, at whatever level will give you that level of “quality”.

An Aside: Tools: ffmpeg and ffprobe

ffmpeg (and its accompanying ffprobe analyzer) are amazing open source tools. They do almost anything. They are at the heart of most other video processing software.

ffprobe is invaluable for figuring out “what do I have here” if say given an mp4 file.

One thing I didn’t notice at first that is really neat is that both ffmpeg and ffprobe can take a URL as an input argument just about anywhere they can take a local file path. This can be very convenient if your video source material is, say, on S3. You don’t need to download it before feeding to ffmpeg or ffprobe, you can just give them a URL, leading to faster more efficient operations as ffmpeg/ffprobe take care of downloading only what they need as they need it. (ffprobe characterization can often happen from only the first portion of the file, for instance).

IF you need an adaptive bitrate streaming format: HLS will do

So then there are “streaming” formats, the two most popular being HLS or MPEG-DASH. While required if you are doing live-streaming, you also can use them for pre-recorded video (which is often called “video-on-demand” (VOD) in internet discussions/documentation, coming out of terminology used in commercial applications).

The main reason to do this is for the “adaptive bitrate” features. You can provide variants at various bitrates, and someone on a slow network (maybe a 3G cellphone) can still watch your video, just at lower resolution/visual quality. An adaptive bitrate streaming format can even change the resolution/quality in mid-view, if network conditions change. Perhaps you’ve noticed this watching commercial video over the internet.

Just like with static video, there seems to be one format which is supported in (nearly) all current browsers (not IE though): HLS. (Which sometimes/usually may actually be h.264, just reformatted for streaming use? Not sure). While the caniuse chart makes it look like only a few browsers support HLS, in fact the other browsers do via their support of the Media Source API (formerly called Media Source Extensions). Javascript code can be used to add HLS support to a browser via the Media Source API, and if you use a Javascript viewer like video.js or MediaElement.js, this just happens for you transparently. This is one reason people may be using such players instead of raw html5 tags alone. (I plan to use video.js).

Analogously to MP4, there are other adaptive bitrate streaming formats too, that may be superior in various ways to HLS, but don’t have as wide support. Like MPEG-DASH. At the moment, you can reach pretty much any browser in use with HLS, and of community/industry peers I found using an adaptive bitrate streaming format, HLS was the main one in use, usually presented without alternative sources.

But do we even need it?

At first I thought that one point of HLS was allowing better/more efficient seeking: When the viewer wants to jump to the middle of the video you don’t want to force the browser to download the whole video. But in fact, after more research I believe current browsers can seek in a static mp4 just fine using HTTP byte-range requests (important they are on a server that supports such!), so long as the MP4 was prepared with what ffmpeg calls faststart. (This is important! You can verify if a video is properly prepared by running mediainfo on it and looking for the IsStreaming line in output; or by jumping through some hoops with ffmpeg).

It’s a bit hacky to do seeking with just http-hosted mp4s, but user-agents seem to have gotten pretty good at it. You may want to be sure any HTTP caching layer you are using can appropriately pass through and cache HTTP byte-range requests.

So, as far as I know, that leaves actual dynamic adaptive bitrate feature as the reason to use HLS. But you need that only if the MP4’s you’d be using otherwise are a higher bitrate than the lowest bitrate you’d be preparing for your inclusion in your HLS bundle. Like, if your original MP4 is only 500 kbps, and that’s the lowest bitrate you’d be including in your HLS package anyway… your original MP4 is already viewable on as slow a connection as your HLS preparation would be.

What is the lowest bitrate typically included as an option in HLS? i’ve found it hard to find advice on what variants to prepare for HLS distribution! In Avalon’s default configuration for creating HLS with ffmpeg, the lowest bitrate variant is 480p at 500 kbps. For more comparisons, if you turn on simulation of a slow connection in Chrome Dev Tools, Chrome says “Slow 3G” is about 400 kbps, and “Fast 3G” is about 1500 kbps. The lowest bitrate offered by AWS MediaConvert’s HLS presets is 400 kbps (for 270p or 360p) or 600kbps at 480p.

(In some cases these bitrates may be video-only; audio can add another 100-200kbps, depending on what quality you encode it at.)

I think if your original MP4 is around 500 kbps, there’s really no reason at all to use HLS. As it approaches 1500 kbps (1.5 Mbps)… you could consider creating HLS with a 500kbps variant, but also probably get away with serving most users adequately at the original bitrate (sorry to the slower end of cell phone network). As you start approaching 3Mbps, I’d start considering HLS, and if you have HD 720p or 1080p content (let alone 4K!) which can get up to 6 Mbps bitrates and even much higher — I think you’d probably be leaving users on not-the-fastest connections very frustrated without HLS.

This is me doing a lot of guessing, cause I haven’t found a lot of clear guidance on this!

As our originals are digitizations of VHS (and old sometimes degraded VHS at that), and started out pretty low-quality, our initial bitrates are fairly low. In one of our sample collections, the bitrates were around 1300bps — I’d say we probably don’t need HLS? Some of our digitized film was digitized in SD at around 2300 kbps — meh? But we had a couple films digitized at 1440p and 10 Mbps — okay, probably want to either downsample the access MP4, or use HLS.

MOST of our cultural heritage peers do not yet seem to be using HLS. In a sampling of videos found on DPLA, almost all were being delivered as MP4 (and usually fairly low-quality videos at under 2 Mbps, so that’s fine!). However, most of our samvera peers using avalon are probably using HLS.

So how do you create and serve HLS?

Once you create it, it’s actually just static files on disk, you can serve with a plain old static HTTP web server. You don’t need any kind of fancy media server to serve HLS! (Another misconception I started out not being sure of, that maybe used to be different, in the days when “RTP” and such were your only adaptive streaming options). An HLS copy on disk is one (or usually several) .m3u8 manifest files, and a lot of .ts files with chunked data referenced by the manifests.

You can (of course) use ffmpeg to create HLS. A lot of people do that happily, but it doesn’t work well for our cloud deployment — creating HLS takes too long for us to want to do it in a Rails background job on a heroku worker dyno, and we don’t want to be in the business of running our own servers otherwise.

Another thing some of our peers use is the wowza media server. We didn’t really look at that seriously either, I think our peers using it are at universities with enormous campus-wide site licenses for running it on-premises (which we don’t want to do), there might be a cloud-hosted SaaS version… but I just mention this product for completeness in case you are interested, it looked too “enterprisey” for our needs, and we didn’t significantly investigate.

The solutions we found and looked at that fit into the needs we had for our cloud-hosted application were:

Use AWS Elemental MediaConvert to create the HLS variants, host on S3 and serve from there (probably with CloudFront, paying AWS data egrees fees for video watched). This is sort of like a cloud alternative to ffmpeg, you tell it exactly what HLS variants you want.
AWS Elemental MediaPackage ends up working more like a “video server” — you just give it your originals, and get back an HLS URL, you leave the characteristics of the HLS variants up to the black box, and it creates them apparently on-the-fly as needed. You don’t pay for storage of any HLS variants, but do pay a non-trivial fee for every minute of video processed (potentially multiple times if it expires from cache and is watched again) on top of the AWS egrees fees.
CloudFlare Stream is basically CloudFlare’s version of MediaPackage. They charge by second of footage instead of byte (for both storage and the equivalent of egress bandwidth), and it’s not cheap… whether it’s cheaper or more expensive than MediaPackage can depend on the bitrate of your material, and the usage patterns of viewing/storing. For our low-bitrate unpredictable-usage patterns, it looks to me likely to be end up more expensive? But not sure.
Just upload them all to youtube and/or vimeo, and serve from there? Kind of crazy but it just might work? Could still be playing on our web pages, but they’re actually pointing at youtube or vimeo hosting the video…. While this has the attraction of being the only SaaS/PaaS solutions I know of that won’t have metered bandwidth (you don’t pay per minute viewed)… there are some limitations too. I couldn’t really find any peers doing this with historical cultural heritage materials.

I’m going to talk about each of these in somewhat more detail below, especially with regard to costs.

First a word on cost estimating of video streaming

One of our biggest concerns with beginning to include video in digital collections is cost. Especially because, serving out of S3/Cloudfront, or most (all?) other cloud options, we pay data egress costs for every minute of video viewed. As a non-profit educational institution, the more the video gets viewed, the more we’re achieving our mission — but our revenue doesn’t go up with minutes viewed, and it can really add up.

So that’s one the things we’re most interested in comparing between different HLS options.

But comparing it requires guessing at a bunch of metrics. Some that are easier to guess are: How many hours of video we’ll be storing; and How many hours we’ll be ingesting per month. Some services charge in bytes, some in hours; to convert from hours to bytes requires us to guess our average bitrate, which we can take a reasonable stab at. (Our digitized VHS is going to be fairly low quality, maybe 1500-2500 kbps).

Then there are the numbers that are a lot harder to guess — how many hours of video will be viewed a month? 100 hours? 1000? 10,000? I am not really sure what to expect, it’s the least under our control, it could grow almost unboundedly and cost us a lot of money. Similarly, If we offer full-file downloads, how many GB of video files will be downloaded a month?

Well, I made some guesses, and I made a spreadsheet that tried to estimate costs of various platforms under various scenarios. (There are also probably assumptions I don’t even realize I’m making not reflected in the spreadsheet that will effect costs!). Our initial estimates are pretty small for a typical enterprise, just hosting maybe 100 hours of video, maybe 200 hours viewed a month? Low-res VHS digitized material. (Our budget is also fairly sensitive to what would be very small amounts in a commercial enterprise!)

You can see/copy the spreadsheet here, and I’ll put a few words about each below.

Serve MP4 files from S3

The base case! Just serve plain MP4 files from S3 (probably with CloudFront in front). Sufficient if our MP4 bitrates are 500 kbps, maybe up to around 1.5 Mbps.

Our current app architecture actually keeps three copies of all data — production, a backup, and a sync’d staging environment. So that’s some S3 storage charges, initially estimated at just full-cost standard S3. There are essentially no “ingest” costs (some nominal cost to replicate production to our cross-region backup).

Then there are the standard AWS data egress costs — Cloudfront not actually that different from standard S3, until you get into trying to do bulk reserved purchases or something, but we’ll just estimate at the standard rate.

The storage costs will probably be included in any other solution too, since we’ll probably still keep our “canonical” cop(ies) on S3 regardless.

HLS via AWS Elemental MediaConvert

AWS Elemental MediaConvert is basically a transcoding service — think of it like ffmpeg transcoding but AWS-hosted SaaS. Your source needs to be on S3 (well, technically it can be a public URL elsewhere), you decide what variants you want to create, they are written to an S3 bucket.

Bandwidth costs are exactly the same as our base MP4 S3 case, since we’d still serving from Cloudfront — so essentially scales up with traffic exactly the same as our base case, which is nice. (hypothetically could be somewhat less bandwidth depending on how many users receive lower-bitrate variants via HLS, but we just estimated that everyone would get the high-quality one, as an upper bound).

We pay a bit more for storage (have to store the HLS derivatives, just standard S3 prices).

Then we pay an ingest cost to create the HLS, that is actually charged per minute (rather than per GB) — for SD, if we avoid “professional tier” features and stay no more than 30 fps, $0.0075 per minute of produced video (meaning the more HLS variants you create, the more you pay).

Since we probably will be digitizing a fairly small package per month, and there is no premium over standard S3 for bandwidth (our least predictable cost), this may be a safe option?

Also, because we know of at least one peer using AWS Elemental MediaConvert. (Northwestern University Libraries).

AWS Elemental MediaPackage

MediaPackage is basically an AWS media server offering. You can use it for live streaming, but we’re looking at the “Video on Demand” use case and pricing only. You just give it a source video, and it creates an HLS URL for it. (Among other possible streaming formats; HLS is the one we care about). You don’t (and I think can’t) tell it what bitrates/variants to create in the HLS stream, it just does what it thinks best. On the fly/on-demand, I think?

The pricing model includes a fairly expensive $0.05/GB packaging fee — that is, fee to create the stream. (why not per-minute like MediaConvert? I don’t know). This is charged on-demand: Not until someone tries to watch a video. If multiple people are watching the same video at the same time, you’ll only pay the packaging fee once as it’ll be cached in CloudFront. But I don’t know if it’s clear exactly how long it will remain cached in CloudFront — and I don’t know how to predict my viewers usage patterns anyway, how much they’ll end up watching the same videos taking advantage of cache — how many views will result in packaging fees vs cached views.

So taking a worst-case estimate of zero cache utilization, MediaPackage basically adds a 50% premium to our bandwidth costs. These being our least predictable and unbounded costs, this is a bit risky — if we have a lot of viewed minutes, that don’t cache well, this could end up being much more expensive than MediaConvert approach.

But, you don’t pay any storage fees for the HLS derivatives at all. If you had a large-stored-volume and relatively small viewed-minutes, MediaPackage could easily end up cheaper than doing it yourself with MediaConvert (as you can see in our spreadsheet). Plus, there’s just less to manage or code, you really just give it an S3 source URL, and get back an HLS URL, the end.

CloudFlare Stream

CloudFlare Stream is basically Cloudflare’s altenrative to MediaPackage. Similarly, it can be used for livestreaming, but we’re just looking at it for “video on demand”. Similarly, you basically just give it video and get back HLS URL (or possibly other streaming formats), without specifying the details.

The big difference is that CloudFlare meters per minute instead of per GB. For storage of a copy of “originals” in the Stream system, which is required– and we’d end up storing an extra copy in CloudFlare, since we’re still going to want our canonical copy in AWS. (Don’t know if we can do everything we’d need/want with a copy only in CloudFlare stream). And CloudFlare charges per minute for bandwidth from Stream too. (There is no ingest/packaging fee, and no cost for storage of derived HLS).

Since they charge per minute, how competitive it is really depends on your average bitrate, the higher your average bitrate the better a deal CloudFlare is compared to AWS! At an average bitrate of more than 1500kbps, the CloudFlare bandwidth cost starts beating AWS — at 10 MBps HD, it’s going to really beat it. But we’re looking at relatively low-quality SD under 1500kbps, so.

Whether CloudFlare Stream is more or less expensive than one of the AWS approaches is going to depend not only on bandwidth, but on your usage patterns (how much are you storing, how much are you ingesting a month) — from a bit more expensive to, in the right circumstances, a lot less expensive.

CloudFlare Stream has pretty much the convenience of AWS MediaPackage, except that we need to deal with getting a separate copy of originals into CloudFlare, with prepaid storage limits (you need to kind of sign up for what storage limit you want). Which is actually kind of inconvenient.

What if we used YouTube or Vimeo though?

What if we host our videos on YouTube or Vimeo, and deliver them from there? Basically use them as a cloud hosted video server? I haven’t actually found any peers doing this with historical/cultural heritage/archival materials. But the obvious attraction is that these services don’t meter bandwidth, we could get out of paying higher egress/bandwidth as viewing usage goes up — our least predictable and potentially otherwise largest budget component.

The idea is that this would be basically invisible to the end-user, they’d still be looking at our digital collections app and an embedded viewer; ideally the items would not be findable in youtube/vimeo platform search or on google to a youtube/vimeo page. It would also be mostly invisible to content management staff, they’d ingest into our Digital Collections system same as ever, and our software would add to vimeo and get referencing links via vimeo API.

We’d just be using youtube or vimeo as a video server platform, really not too different from how one uses AWS MediaPackage or Cloudflare Stream.

Youtube is completely free, but, well, it’s youtube. It’s probably missing features we might want (I don’t think you can get a direct HLS link), has unclear/undocumented limits or risk of having your account terminated, you get what you pay for as far as support, etc.

Vimeo is more appealing. Features included (some possibly only in “pro” account and above) seem (am still at beggining of investigation) to include:

HLS URLs we could use with whatever viewers we wanted, same viewers we’d be using with any other HLS URL.
- Also note direct video download links, if we want, so we can avoid bandwidth/egress charges on downloads too!
Support for high-res videos, no problem, all the way up to 4K/HDR. (Although we don’t need that for this phase of VHS digitization)
“team” accounts where multiple staff accounts can have access to Vimeo management of our content. (Only 3 accounts on $20/month “pro”, 10 on “Business” and higher)
Unlisted/private settings that should keep our videos off of any vimeo searches or google. Even a “Hide from Vimeo” setting where the video cannot be viewed on a vimeo page at all, but only as embedded (say via HLS)!
- One issue, though, is that the HLS and video download links we do have probably won’t be signed/expiring — once someone has it, they have it and can share it (until/unless you delete the content from vimeo). This is probably fine for our public content use cases, but worth keeping in mind.
An API that looks reasonable and full-featured.

Vimeo storage/ingest limits

The way Vimeo does price tiers/metering is a bit odd, especially at the $20/month “Pro” level. It’s listed as being limited to ingesting 20GB/week, and 1TB/year. But I guess it can hold as much content you want as long as you ingest it at no more than 20GB/week? Do I understand that right? For our relatively low-res SD content, let’s say at a 3Mbs bitrate — 20GB/week is about 15 hours/week — at our current planned capacity, we wouldn’t be ingesting more than that as a rule, although it’s a bit annoying that if we did, as an unusual spike, our software would have to handle the weekly rate limit.

At higher plans, the vimeo limits are total storage rather than weekly ingest. The “Business” plan at $50/month has 5TB total storage. At 3Mbs bitrate, that’s around 3700 hours of content. At our current optimistic planned ingest capacity, it would take us over 10 years to fill that up. If it were HD content at 10 Mbps, 5TB is around 1100 hours of content, which we might reach in 4 or 5 years at our current planned ingest rate.

The “Premium” plan lets you bump that up to 7TB for $75/month.

It’s certainly conceivable we could reach those limits — and even sooner if we increase our digitization/ingest capacity beyond what we’re starting with. I imagine at that point we’d have to talk to them about a custom “enterprise” plan, and hope they can work out something reasonable for a non-profit academic institution that just needs expanded storage limits.

I imagine we’d write our software so it could serve straight MP4 if the file wasn’t (yet?) in Vimeo, but would just use vimeo HLS (and download link?) URLs if it was.

It’s possible there will be additional unforeseen limitations or barriers once we get into an investigatory implementation, but this seems worth investigating.

So?

Our initial implementation may just go with static MP4 files, for our relatively low-bitrate SD content.

When we are ready to explore streaming (which could be soon after MVP), I think we’d probably explore either vimeo? If not Vimeo… AWS MediaConvert is more “charted territory” as we have cooperating peers who have used it… but the possibility of MediaConvert or CloudFront Stream to be cheaper under some usage patterns is interesting. (And they are possibly somewhat simpler to implement). However, their risk of being much more expensive under other usage patterns may be too risky. Predictability of budget is really high-value in the non-profit world, which is a large part of our budgeting challenge here, the unpredictability of costs when increased usage means increased costs due to metered bandwidth/egress.

Finding source install location of a loaded ruby gem at runtime

jrochkind — Mon, 10 Jan 2022 19:34:54 +0000

Not with a command line, but from within a ruby program, that has gems loaded … how do you determine the source install location of such a loaded gem?

I found it a bit difficult to find docs on this, so documenting for myself now that I’ve figured it out. Eg:

Gem.loaded_specs['device_detector'].full_gem_path
# => "/Users/jrochkind/.gem/ruby/2.7.5/gems/device_detector-1.0.5"

Blacklight: Automatic retries of failed Solr requests

jrochkind — Wed, 05 Jan 2022 20:42:40 +0000

Sometimes my Blacklight app makes a request to Solr and it fails in a temporary/intermittent way.

Maybe there was a temporary network interupting, resulting in a failed connection or timeout
Maybe Solr was overloaded and being slow, and timed out
- (Warning: Blacklight by default sets no timeouts, and is willing to wait forever for Solr, which you probably don’t want. How to set a timeout is under-documented, but set a read_timeout: key in your blacklight.yml to a number of seconds; or if you have RSolr 2.4.0+, set key timeout. Both will do the same thing, pass the value timeout to an underlying faraday client).
Maybe someone restarted the Solr being used live, which is not a good idea if you’re going for zero uptime, but maybe you aren’t that ambitious, or if you’re me maybe your SaaS solr provider did it without telling you to resolve the Log4Shell bug.
- And btw, if this happens, can appear as a series of connection refused, 503 responses, and 404 responses, for maybe a second or three.

(By the way also note well: Your blacklight app may be encountering these without you knowing, even if you think you are monitoring errors. Blacklight default will take pretty much all Solr errors, including timeouts, and rescue them, responding with an HTTP 200 status page with a message “Sorry, I don’t understand your search.” And HoneyBadger or other error monitoring you may be using will probably never know. Which I think is broken and would like to fix it, but have been having trouble getting consensus and PR reviews to do so. You can fix it with some code locally, but that’s a separate topic, ANYWAY…)

So I said to myself, self, is there any way we could get Blacklight to automatically retry these sorts of temporary/intermittent failures, maybe once or twice, maybe after a delay? So there would be fewer errors presented to users (and fewer errors alerting me, after I fixed Blacklight to alert on em), in exhange for some users in those temporary error conditions waiting a bit longer for a page?

Blacklight talks to Solr via RSolr — can use 1.x or 2.x — and RSolr, if you’re using 2.x, uses faraday for it’s solr http connections. So one nice way might be to configure the Blacklight/RSolr faraday connection with the faraday retry middleware. (1.x rubydoc). (moved into its own gem in the recently released faraday 2.0).

Can you configure custom faraday middleware for the Blacklight faraday client? Yeesss…. but it requires making and configuring a custom Blacklight::Solr::Repository class, most conveniently by sub-classing the Blacklight class and overriding a private method. :( But it seems to work out quite well after you jump through some a bit kludgey hoops! Details below.

Questions for the Blacklight/Rsolr community:

Is this actually safe/forwards-compatible/supported, to be sub-classing Blacklight::Solr::Repository and over-riding build_connection with a call to super? Is this a bad idea?
Should Blacklight have it’s own supported and more targeted API for supplying custom faraday middleware generally (there are lots of ways this might be useful), or setting automatic retries specifically? i’d PR it, if there was some agreement about what it should look like and some chance of it getting reviewed/merged.
Is there anyone, anyone at all, who is interested in giving me emotional/political/sounding-board/political/code-review support for improving Blacklight’s error handling so it doesn’t swallow all connection/timeout/permanent configuration errors by returning an http 200 and telling the user “Sorry, I don’t understand your search”?

Oops, this may break in Faraday 2?

I haven’t actually tested this on the just-released Faraday 2.0, that was released right after I finished working on this. :( If faraday changes something that makes this approach infeasible, that might be added motivation to make Blacklight just have an API for customizing faraday middleware without having to hack into it like this.

The code for automatic retries in Blacklight 7

(and probably many other versions, but tested in Blacklight 7).

Here’s my whole local pull request if you find that more covenient, but I’ll also walk you through it a bit below and paste in frozen code.

There were some tricks to figuring out how to access and change the middleware on the existing faraday client returned by the super call; and how to remove the already-configured Blacklight middleware that would otherwise interfere with what we wanted to do (including an existing use of the retry middleware that I think is configured in a way that isn’t very useful or as intended). But overall it works out pretty well.

I’m having it retry timeouts, connection failures, 404 responses, and any 5xx response. Nothing else. (For instance it won’t retry on a 400 which generally indicates an actual request error of some kind that isn’t going to have any different result on retry).

I’m at least for now having it retry twice, waiting a fairly generous 200ms before first retry, then another 400ms before a second retry if needed. Hey, my app can be slow, so it goes.

Extensively annotated:

	# ./lib/scihist/blacklight_solr_repository.rb

	module Scihist
	# Custom sub-class of stock blacklight, to override build_connection
	# to provide custom faraday middleware for HTTP retries
	#
	# This may not be a totally safe forwards-compat Blacklight API
	# thing to do, but the only/best way we could find to add-in
	# Solr retries.
	class BlacklightSolrRepository < Blacklight::Solr::Repository
	# this is really only here for use in testing, skip the wait in tests
	class_attribute :zero_interval_retry, default: false

	# call super, but then mutate the faraday_connection on
	# the returned RSolr 2.x+ client, to customize the middleware
	# and add retry.
	def build_connection(_args, *_kwargs)
	super.tap do \|rsolr_client\|
	faraday_connection = rsolr_client.connection

	# remove if already present, so we can add our own
	faraday_connection.builder.delete(Faraday::Request::Retry)

	# remove so we can make sure it's there AND added AFTER our
	# retry, so our retry can succesfully catch it's exceptions
	faraday_connection.builder.delete(Faraday::Response::RaiseError)

	# add retry middleware with our own confiuration
	# https://github.com/lostisland/faraday/blob/main/docs/middleware/request/retry.md
	#
	# Retry at most twice, once after 300ms, then if needed after
	# another 600 ms (backoff_factor set to result in that)
	# Slow, but the idea is slow is better than an error, and our
	# app is already kinda slow.
	#
	# Retry not only the default Faraday exception classes (including timeouts),
	# but also Solr returning a 404 or 502. Which gets converted to
	# Faraday error because RSolr includes raise_error middleware already.
	#
	# Log retries. I wonder if there's a way to have us alerted if
	# there are more than X in some time window Y…
	faraday_connection.request :retry, {
	interval: (zero_interval_retry ? 0 : 0.300),
	# exponential backoff 2 means: 1) 0.300; 2) .600; 3) 1.2; 4) 2.4
	backoff_factor: 2,
	# But we only allow the first two before giving up.
	max: 2,
	exceptions: [
	# default faraday retry exceptions
	Errno::ETIMEDOUT,
	Timeout::Error,
	Faraday::TimeoutError,
	Faraday::RetriableResponse, # important to include when overriding!
	# we add some that could be Solr/jetty restarts, based
	# on our observations:
	Faraday::ConnectionFailed, # nothing listening there at all,
	Faraday::ResourceNotFound, # HTTP 404
	Faraday::ServerError # any HTTP 5xx
	],

	retry_block: -> (env, options, retries_remaining, exc) do
	Rails.logger.warn("Retrying Solr request: HTTP #{env["status"]}: #{exc.class}: retry #{options.max – retries_remaining}")
	# other things we could log include `env.url` and `env.response.body`
	end
	}

	# important to add this AFTER retry, to make sure retry can
	# rescue and retry it's errors
	faraday_connection.response :raise_error
	end
	end
	end
	end

view raw blacklight_solr_repository.rb hosted with ❤ by GitHub

Then in my local CatalogController config block, nothing more than:

config.repository_class = Scihist::BlacklightSolrRepository

I had some challenges figuring out how to test this. I ended up testing against a live running Solr instance, which my app’s test suite does sometimes (via solr_wrapper, for better or worse).

One test that’s just a simple smoke test that this thing seems to still function properly as a Blacklight::Solr::Repository without raising. And one that of a sample error

	require "rails_helper"

	describe Scihist::BlacklightSolrRepository do
	# a way to get a configured repository class…
	let(:repository) do
	Scihist::BlacklightSolrRepository.new(CatalogController.blacklight_config).tap do \|repo\|
	# if we are testing retries, don't actually wait between them
	repo.zero_interval_retry = true
	end
	end

	# A simple smoke test against live solr hoping to be a basic test that the
	# thing works like a Blacklight::Solr::Repository, our customization attempt
	# hopefully didn't break it.
	describe "ordinary behavior smoke test", solr: true do
	before do
	create(:public_work).update_index
	end

	it "can return results" do
	response = repository.search
	expect(response).to be_kind_of(Blacklight::Solr::Response)
	expect(response.documents).to be_present
	end
	end

	# We're actually going to use webmock to try to mock some error conditions
	# to actually test retry behavior, not going to use live solr.
	describe "retry behavior", solr:true do
	let(:solr_select_url_regex) { /^#{Regexp.escape(ScihistDigicoll::Env.lookup!(:solr_url) + "/select")}/ }

	describe "with solr 400 response" do
	before do
	stub_request(:any, solr_select_url_regex).to_return(status: 400, body: "error")
	end

	it "does not retry" do
	expect {
	response = repository.search
	}.to raise_error(Blacklight::Exceptions::InvalidRequest)

	expect(WebMock).to have_requested(:any, solr_select_url_regex).once
	end
	end

	describe "with solr 404 response" do
	before do
	stub_request(:any, solr_select_url_regex).to_return(status: 404, body: "error")
	end

	it "retries twice" do
	expect {
	response = repository.search
	}.to raise_error(Blacklight::Exceptions::InvalidRequest)

	expect(WebMock).to have_requested(:any, solr_select_url_regex).times(3)
	end
	end
	end
	end

view raw blacklight_solr_repository_spec.rb hosted with ❤ by GitHub

Github Action setup-ruby needs to quote ‘3.0’ or will end up with ruby 3.1

jrochkind — Wed, 29 Dec 2021 03:04:25 +0000

You may be running builds in Github Actions using the setup-ruby action to install a chosen version of ruby, looking something like this:

    - name: Set up Ruby
      uses: ruby/setup-ruby@v1
      with:
        ruby-version: 3.0

A week ago, that would have installed the latest ruby 3.0.x. But as of the christmas release of ruby 3.1, it will install the latest ruby 3.1.x.

The workaround and/or correction is to quote the ruby version number. If you actually want to get latest ruby 3.0.x, say:

      with:
        ruby-version: '3.0'

This is reported here, with reference to this issue on the Github Actions runner itself. It is not clear to me that this is any kind of a bug in the github actions runner, rather than just an unanticipated consequence of using a numeric value in YAML here. 3.0 is of course the same number as 3, it’s not obvious to me it’s a bug that the YAML parser treats them as such.

Perhaps it’s a bug or mis-design in the setup-ruby action. But in lieu of any developers deciding it’s a bug… quote your 3.0 version number, or perhaps just quote all ruby version numbers with the setup-ruby task?

If your 3.0 builds started failing and you have no idea why — this could be it. It can be a bit confusing to diagnose, because I’m not sure anything in the Github Actions output will normally echo the ruby version in use? I guess there’s a clue in the “Installing Bundler” sub-head of the “Setup Ruby” task:

Of course it’s possible your build will succeed anyway on ruby 3.1 even if you meant to run it on ruby 3.0! Mine failed with LoadError: cannot load such file -- net/smtp, so if yours happened to do the same, maybe you got here from google. :) (Clearly net/smtp has been moved to a different status of standard gem in ruby 3.1, I’m not dealing with this further becuase I wasn’t intentionally supporting ruby 3.1 yet).

Note that if you are building with a Github actions matrix for ruby version, the same issue applies. Maybe something like:

matrix:
        include:
          - ruby: '3.0' 
 steps:
    - uses: actions/checkout@v2

    - name: Set up Ruby
      uses: ruby/setup-ruby@v1
      with:
        ruby-version: ${{ matrix.ruby }}

Setting and handling solr timeouts in Blacklight

jrochkind — Mon, 11 Oct 2021 16:39:18 +0000

When using the Blacklight gem for a Solr search front-end (most used by institutions in the library/cultural heritage sector), you may wish to set a timeout on how long to wait for Solr connection/response.

It turns out, if you are using Rsolr 2.x, you can do this by setting a read_timeout key in your blacklight.yml file. (This under-documented key is a general timeout, despite the name; I have not investigated with Rsolr 1.x).

But the way it turns into an exception and the way that exception is handled is probably not what you would consider useful for your app. You can then change this by over-riding the handle_request_error method in your CatalogController.

I am planning on submitting some PR’s to RSolr and Blacklight to improve some of these things.

Read on for details.

Why set a timeout?

It’s generally considered important to always set a timeout value on an external network request. If you don’t do this, your application may wait indefinitely for the remote server to respond, if the remote server is being slow or hung; or it may depend on underlying library default timeouts that may not be what you want.

What can happen to a Blackligh that does not a set a Solr timeout? We could have a Solr server that takes a really long time — or is entirely hung — on returning a response for one request, or many, or all of them.

Your web workers (eg puma or passenger) will be waiting a while for Solr. Either indefinitely, or maybe there’s a default timeout in the HTTP client (I’m actually not sure, but maybe 60s for net-http?). During this time, the web workers are busy, and unable to handle other requests. This will reduce the traffic capacity of your app, for a very slow/repeatedly misbehaving Solr possibly catastrophically leading to an app that appears unresponsive.

There may be some other part of the stack that will timeout waiting for the web worker to return a response (while the web worker is waiting for Solr). For instance, heroku is willing to wait a maximum of 30 seconds, and I think Passenger also has timeouts (although may default to as long as 10 minutes??). But this may be much longer than you really want your app to wait on Solr for reasons above, and when it does get triggered you’ll get a generic “Timed out waiting for app response” in your logs/monitoring, it won’t be clear the web worker was waiting on solr, making operational debugging harder.

How to set a Blacklight Solr timeout

A network connection to Solr in the Blacklight stack first goes through RSolr, which then (in Rsolr 2.x) uses the Faraday ruby gem, which can use multiple http drivers but default uses net-http from the stdlib.

For historical reasons, how to handle timeouts has been pretty under-documented (and sometimes changing) at all these levels! They’re not making it easy to figure out how to effectively set timeouts! It took the ruby community a bit of time to really internalize the importance of timeouts on HTTP calls.

So I did some research, in code and in manual tests.

Faraday timeouts?

If we start in the middle at Faraday, it’s not clearly documented… and may be http-adapter-specific? Faraday really doesn’t make this easy for us!

But from googling, it looks like Faraday generally means to support keys open_timeout (waiting for a network connection to open), and timeout (often waiting for a response to be returned, but really… everything else, and sometimes includes open_timeout too).

If you want some details….

For instance, if we look at the faraday adapter for http-rb, we can see that the faraday timeout option is passed to http-rb for each of connect, read, and write.

(Which really means if you set it to 5 seconds… it could wait 5 seconds for connect then another 5 seconds for write and another 5 seconds for read . http-rb actually provided a general/global timeout at one point, but faraday doens’t take advatnage of it. ).

And then http-rb adapter uses the open_timeout value just for connect and write. That is, setting both faraday options timeout and open_timeout to the same value would be redundant for the the http-rb adapter at present. the http-rb adapter doesn’t seem to do anything with any other faraday timeout options.

If we look at the default net-http adapter… It’s really confusing! We have to look at this method in faraday generic too. But I confirmed by manual testing that net-http actually supports faraday read_timeout, write_timeout, and open_timeout (different values than http-rb), but will also use timeout as a default for any of them. (Again your actual end-to-end timeout can be sum of open/read/write. ).

It looks like different Faraday adapters can use different timeout values, but Faraday tries to make the basic timeout value at least do something useful/general for each adapter?

Most blacklight users are probably using the default net-http adapter (Curious to hear about anyone who isn’t?)

What will Blacklight actually pass to Faraday?

This gets confusing too!

Blacklight seems to take whatever keys you have set in your blacklight.yml for the given environment, and pass them to RSolr.connect. With one exception, you have to say http_adapter in blacklight.config to translate to adapter passed to Rsolr.

(I didn’t find the code that makes that blacklight_config invocation be your environment-specific hash from blackight.yml, but I confirmed that’s what it is!)

What does Rsolr 2.x do? It does not pass on anything to Faraday, but only certain allow-listed items, after translating. Confusingly, it’s only wiling to pass on open_timeout, and also translate a read_timeout value from blacklight.yml to Faraday timeout.

Phew! So Blacklight/Rsolr only supports two timeout values to be passed to faraday:

open_timeout to Faraday open_timeout
read_timeout to Faraday timeout.

PR to Rsolr on timeout arguments?

I think ideally RSolr would pass on any of the values Faraday seems to recognize, at least with some adapters, for timeouts: read_timeout, open_timeout, write_timeout, as well as just timeout.

But to get from what it does now to there in a backwards compatible way… kind of impossible because of how it’s currently translating read_timeout to timeout. :(

I think I may PR one that just recognizes timeout too, while leaving read_timeout as a synonym with a deprecation warning telling you to use timeout? Still thinking this through.

What happens when a Timeout is triggered?

Here we have another complexity. Just as the timeout configuration values are translated on the way down the stack, the exceptions raised when a timeout happens are translated again on the way up, HTTP library => Faraday => RSolr => Blacklight.

Faraday basically has two exception classes it tries to normalize all underlying HTTP library timeouts to: Faraday::ConnectionFailed < Faraday::Error (for timeouts opening the connection) and Faraday::TimeoutError < Faraday::ServerError < Faraday::Error for other timeouts, such as read timeouts.

What happens with a connection timeout?

Faraday raises a Faraday::ConnectionFailed error. (For instance from the default Net::HTTP Adapter)
RSolr rescues it, and re-raises as an RSolr::Error::ConnectionRefused, which sub-classes the ruby stdlib Errno::ECONNREFUSED
Blacklight rescues that Errno::ECONNREFUSED, and translates it to a Blacklight::Exceptions::ECONNREFUSED, (which is still a sub-class of stdlib Errno::ECONNREFUSED)

That just rises up to your application, to give the end-user probably a generic error message, be logged, be caught by any error-monitoring services you have, etc. Or you can configure your application to handle these Blacklight::Exceptions::ECONNREFUSED errors in some custom way using standard Rails rescue_from functionality, etc.

This is all great, just what we expect from exception handling.

The one weirdness is that the exception suggests connection refused, when really it was a timeout, which is somewhat different… but Faraday itself doesn’t distinguish between those two situations, which some people would like to improve for a while now, but there isn’t much a client of Faraday can do about in the meantime.

What happens with other timeouts?

Say, the network connection opened fine, but Solr is just being really slow returning a response (it totally happens) and exceeding a Faraday timeout value set.

The picture here is a bit less good.

Faraday will raise a Faraday::TimeoutError (eg from the net-http adapter).
RSolr does not treat this specially, but just rescues and re-raises it just like any other Faraday::Error as a generic RSolr::Error::Http
Blacklight will take it, just as any other RSolr::Error::Http, and rescues and re-raise as a generic Blacklight::Exceptions::InvalidRequest
Blacklight does not allow this to just rise up through the app, but instead uses Rails rescue_from to register it’s own handler for it, a handle_request_error method.
The handle_request_error method will log the error, and then just display the current Blacklight controller “index” page (ie search form), with a message “Sorry, I don’t understand your search.”

This is… not great.

From a UX point of view, this is not good, we’re telling the user “sorry I don’t understand your search” when the probelm was a Solr timeout… it makes it seem like there’s something the user did wrong or could do differently, but that’s not what’s going on.
- In fact that’s true for a lot of errors Blacklight catches this way. Solr is down? Solr collection doesn’t exist? Solr configuration has a mismatch with Blacklight configuration? All of these will result in this behavior, none of them are something the end-user can do anything about.
If you have an error monitoring service like Honeybadger, it won’t record this error, since the app handled it instead of letting it rise unhandled. So you may not even know this is going on.
If you have an uptime monitoring service, it might not catch this either, since the app is returning a 200. You could have an app pretty much entirely down and erroring for any attempt to do a search… but returning all HTTP 200 responses.
While Blacklight does log the error, it does it in a DIFFERENT way than Rails ordinarily does… you aren’t going to get a stack trace, or any other contextual information, it’s not really going to be clear what’s going on at all, if you mention it at all.

Not great. One option is to override the handle_request_error method in your own CatalogController to: 1) Disable this functionality entirely, don’t swallow the error with a “Sorry, I don’t understand your search” message, just re-raise it; and 2) unwrap the underlying Faraday::TimeoutError before re-raising, so that gets specifically reported instead of a generic “Blacklight::Exceptions::InvalidRequest”, so we can distinguish this specific situation more easily in our logs and error monitoring.

Here’s an implementation that does both, to be put in your catalog_controller.rb:

  # OVERRIDE of Blacklight method. Blacklight by default takes ANY kind
  # of Solr error (could be Solr misconfiguraiton or down etc), and just swallows
  # it, redirecting to Blacklight search page with a message "Sorry, I don't understand your search."
  #
  # This is wrong.  It's misleading feedback for user for something that is usually not
  # something they can do something about, and it suppresses our error monitoring
  # and potentially misleads our uptime checking.
  #
  # We just want to actually raise the error!
  #
  # Additionally, Blacklight/Rsolr wraps some errors that we don't want wrapped, mainly
  # the Faraday timeout error -- we want to be able to distinguish it, so will unwrap it.
  private def handle_request_error(exception)
    exception_causes = []
    e = exception
    until e.cause.nil?
      e = e.cause
      exception_causes << e
    end

    # Raies the more specific original Faraday::TimeoutError instead of
    # the generic wrapping Blacklight::Exceptions::InvalidRequest!
    if faraday_timeout = exception_causes.find { |e| e.kind_of?(Faraday::TimeoutError) }
      raise faraday_timeout
    end

    raise exception
  end

PRs to RSolr and Blacklight for more specific exception?

RSolr and Blacklight both have a special error class for the connection failed/timed out condition. But just lump Faraday::Timeout in with any other kind of error.

I think this logic is probably many years old, and pre-dates Faraday’s current timeout handling.

I think they should both have a new exception class which can be treated differently. Say RSolr::Error::Timeout and Blacklight::Exceptions::RepositoryTimeout?

I plan to make these PRs.

PR to Blacklight to disable that custom handle_request_error behavior

I think the original idea here was that something in the user’s query entry would trigger an exception. That’s what makes rescueing it and re-displaying it with the message “Sorry, I don’t understand your search” make some sense.

At the moment, I have no idea how to reproduce that, figure out a user-entered query that actually results in a Blacklight::Exceptions::InvalidRequest. Maybe it used to be possible to do in an older version of Solr but isn’t anymore? Or maybe it still is, but I just don’t know how?

But I can reproduce ALL SORTS of errors that were not about the user’s entry and which the end-user can do nothing about, but which still result in this misleading error message, and the error getting swallowed by Blacklight and avoiding your error- and uptime-monitoring services. Solr down entirely; Solr collection/core not present or typo’d. Mis-match between Solr configuration and Blacklight configuration, like Blacklight mentioning an Solr field that doens’t actually exist.

All of these result in Blacklight swallowing the exception, and returning an HTTP 200 response with the message “Sorry, I don’t understand your search”. This is not right!

I think this behavior should be removed in a future Blacklight version.

I would like to PR such a thing, but I’m not sure if I can get it reviewed/merged?

Blacklight 7.x, deprecation of view overrides, paths forward

jrochkind — Thu, 02 Sep 2021 18:33:13 +0000

This post will only be of interest to those who use the blacklight ruby gem/platform, mostly my collagues in library/cultural heritage sector.

When I recently investigated updating our Rails app from Blacklight to the latest 7.19.2, I encountered a lot of deprecation notices. They were related to code both in my local app and a plugin trying to override parts of Blacklight views — specifically the “constraints” (ie search limits/query “breadcrumbs” display) area in the code I encountered, I’m not sure if it applies to more areas of view customization.

Looking into this more to see if I could get a start on changing logic to avoid deprecation warnings — I had trouble figuring out any non-deprecated way to achieve the overrides. After more research, I think it’s not totally worked out how to make these overrides keep working at all in future Blacklight 8, and that this affects plugins including blacklght_range_limit, blacklight_advanced_search, geoblacklight, and possibly spotlight. Some solutions need to be found if these plugins are to be updated keep working in future Blacklight 8.

I have documented what I found/understood, and some ideas for moving forward, hoping it will help start the community process of figuring out solutions to keep all this stuff working. I may not have gotten everything right or thought of everything, this is meant to help start the discussion, suggestions and corrections welcome.

This does get wordy, I hope you can find it useful to skip around or skim if it’s not all of interest. I believe the deprecations start around Blacklight 7.12 (released October 2020). I believe Blacklight 7.14 is the first version to suport ruby 3.0, so anyone wanting to upgrade to ruby 3 will encounter these issues.

Background

Over blacklight’s 10+ year existence, it has been a common use-case to customize specific parts of Blacklight, including customizing what shows up on one portion of a page while leaving other portions ‘stock’. An individual local application can do this with it’s own custom code; it is also common from many of shared blacklight “plug-in”/”extension” engine gems.

Blacklight had tradtionally implemented it’s “view” layer in a typical Rails way, involving “helper” methods and view templates. Customizations and overrides, by local apps or plugins, were implemented by over-riding these helper methods and partials. This traditional method of helper and partial overrides is still described in the Blacklight project wiki — it possibly could use updating for recent deprecations/new approaches).

This view/helper/override approach has some advantages: It just uses standard ruby and Rails, not custom Blacklight abstractions; multiple different plugins can override the same method, so long as they all call “super”, to cooperatively add funtionality; it is very flexible and allows overrides that “just work”.

It also has some serious disadvantages. Rails helpers and views are known in general for leading to “spaghetti” or “ball of mud” code, where everything ends up depending on everything/anything else, and it’s hard to make changes without breaking things.

In the context of shared gem code like Blacklight and it’s ecosystem, it can get even messier to not know what is meant to be public API for an override. Blacklight’s long history has different maintainers with different ideas, and varying documentation or institutional memory of intents can make it even more confusing. Several generations of ideas can be present in the current codebase for both backwards-compatibility and “lack of resources to remove it” reasons. It can make it hard to make any changes at all without breaking existing code, a problem we were experiencing with Blacklight.

One solution that has appeared for Rails is the ViewComponent gem (written by github, actually), which facilitates better encapsulation, separation of concerns, and clear boundaries between different pieces of view code.The current active Blacklight maintainers (largely from Stanford I think?) put in some significant work — in Blacklight 7.x — to rewrite some significant parts of Blacklight’s view architecture based on the ViewComponent gem. This is a welcome contribution to solving real problems! Additionally, they did some frankly rather heroic things to get this replacement with ViewComponent to be, as a temporary transition step, very backwards compatible, even to existing code doing extensive helper/partial overrides, which was tricky to accomplish and shows their concern for current users.

Normally, when we see deprecation warnings, we like to fix them, to get them out of our logs, and prepare our apps for the future version where deprecated behavior stops working entirely. To do otherwise is considered leaving “technical debt” for the future, since a deprecation warning is telling you that code will have to be changed eventually.

The current challenge here is that it’s not clear (at least to me) how to change the code to still work in current Blacklight 7.x and upcoming Blacklight 8x. Which is a challenge both for running in current BL 7 without deprecation, and for the prospects of code continuing to work in future BL 8. I’ll explain more with examples.

Blacklight_range_limit (and geoblacklight): Add a custom “constraint”

blacklight_range_limit introduces new query parameters for range limit filters, not previously recognized by Blacklight, that look eg like &range[year_facet][begin]=1910 In addition to having these effect the actual Solr search, it also needs to display this limit (that Blacklight core is ignoring) in the “constraints” area above the search results:

To do this it overrides the render_constraints_filters helper method from Blacklight, through some fancy code effectively calling super to render the ordinary Blacklight constraints filters but then adding on it’s rendering of the contraints only blacklight_range_limit knows about. One advantage of this “override, call super, but add on” approach is that multiple add-ons can do it, and they don’t interfere with each other — so long as they all call super, and only want to add additional content, not replace pre-existing content.

But overriding this helper method is deprecated in recent Blacklight 7.x. If Blacklight detects any override to this method (among other constraints-related methods), it will issue a deprecation notice, and also switch into a “legacy” mode of view rendering, so the override will still work.

OK, what if we wanted to change how blacklight_range_limit does this, to avoid triggering the deprecation warnings, and to have blacklight continue to use the “new” (rather than “legacy”) logic, that will be the logic it insists on using in Blacklight 8?

The new logic is to render with the new view_component, Blacklight::ConstraintsComponent.new. Which is rendered in the catalog/_constraints.html.erb partial. I guess if we want the rendering to behave differently in that new system, we need to introduce a new view component that is like Blacklight::ConstraintsComponent but behaves differently (perhaps a sub-class, or a class using delegation). Or, hey, that component takes some dependent view_components as args, maybe we just need to get the ConstraintsComponent to be given an arg for a different version of one of the _component args, not sure if that will do it.

It’s easy enough to write a new version of one of these components… but how would we get Blacklight to use it?

I guess we would have to override catalog/_constraints.html.erb. But this is unsastisfactory:

I thougth we were trying to get out of overriding partials, but even if it’s okay in this situation…
It’s difficult and error-prone for an engine gem to override partials, you need to make sure it ends up in the right order in Rails “lookup paths” for templates, but even if you do this…
What if multiple things want to add on a section to the “constraints” area? Only one can override this partial, there is no way for a partial to call super.

So perhaps we need to ask the local app to override catalog/_constraints.html.erb (or generate code into it), and that code calls our alternate component, or calls the stock component with alternate dependency args.

This is already seeming a bit more complex and fragile than the simpler one-method override we did before, we have to copy-and-paste the currently non-trivial implementation in _constraints.html.erb, but even if we aren’t worried about that….
Again, what happens if multiple different things want to add on to what’s in the “constraints” area?
What if there are multiple places that need to render constraints, including other custom code? (More on this below). They all need to be identically customized with this getting-somewhat-complex code?

That multiple things might want to add on isn’t just theoretical, geoblacklight also wants to add some things to the ‘constraints’ area and also does it by overriding the render_constraints_filters method.

Actually, if we’re just adding on to existing content… I guess the local app could override catalog/_constraints.html.erb, copy the existing blacklight implementation, then just add on the END a call to both say <%= render(BlacklightRangeLimit::RangeConstraintsComponent %> and then also <%= <%= render(GeoBlacklight::GeoConstraintsComponent) %>… it actually could work… but it seems fragile, especially when we start dealing with “generators” to automatically create these in a local app for CI in the plugins, as blacklight plugins do?

My local app (and blacklight_advanced_search): Change the way the “query” constraint looks

If you just enter the query ‘cats’, “generic” out of the box Blacklight shows you your search with this as a sort of ‘breadcrumb’ constraint in a simple box at the top of the search:

My local app (in addition to changing the styling) changes that to an editable form to change your query (while keeping other facet etc filters exactly the same). Is this a great UX? Not sure! But it’s what happens right now:

It does this by overriding `render_constraints_query` and not calling super, replace the standard implementation with my own.

How do we do this in the new non-deprecated way?

I guess again we have to either replace Blacklight::ConstraintsComponent with a new custom version… or perhaps pass in a custom component for query_constraint_component… this time we can’t just render and add on, we really do need to replace something.

What options do we? Maybe, again, customizing _constraints.html.erb to call that custom component and/or custom-arg. And make sure any customization is consistent with any customization done by say blacklight_range_limit or geoblacklight, make sure they aren’t all trying to provide mutually incompatible custom components.

I still don’t like:

having to override a view partial (when before I only overrode a helper method), in local app instead of plugin it’s more feasible, but we still have to copy-and-paste some non-trivial code from Blacklight to our local override, and hope it doesn’t change
Pretty sensitive to implementation of Blacklight::ConstraintsComponent if we’re sub-classing it or delegating it. I’m not sure what parts of it are considered public API, or how frequently they are to change… if we’re not careful, we’re not going ot have any more stable/reliable/forwards-compatible code than we did under the old way.
This solution doesn’t provide a way for custom code to render a constraints area with all customizations added by any add-ons, which is a current use case, see next section.

It turns out blacklight_advanced_search also customizes the “query constraint” (in order to handle the multi-field queries that the plugin can do), also by overriding render_constraints_query, so this exact use case affects that plug-in too, with a bit more challenge in a plugin instead of a local app.

I don’t think any of these solutions we’ve brainstormed are suitable and reliable.

But calling out to Blacklight function blocks too, as in spotlight….

In addition to overriding a helper method to customize what appears on the screen, traditionally custom logic in a local app or plug-in can call a helper method to render some piece of Blacklight functionality on screen.

For instance, the spotlight plug-in calls the render_constraints method in one of it’s own views, to include that whole “constraints” area on one of it’s own custom pages.

Using the legacy helper method architecture, spotlight can render the constraints including any customizations the local app or other plug-ins have made via their overriding of helper methods. For instance, when spotlight calls render_constraints, it will get the additional constraints that were added by blacklight_range_limit or geoblacklight too.

How would spotlight render constraints using the new architecture? I guess it would call the Blacklight view_component directly, render(Blacklight::ConstraintsComponent.new(.... But how does it manage to use any customizations added by plug-ins like blacklight_range_limit? Not sure. None of the solutions we brainstormed above seem to get us there.

I suppose (Eg) spotlight could actually render the constraints.html.erb partial, that becomes the one canonical standardized “API” for constraints rendering, to be customized in the local app and re-used every time constraints view is needed? That might work, but seems a step backwards to go toward view partial as API to me, I feel like we were trying to get away from that for good reasons, it just feels messy.

This makes me think new API might be required in Blacklight, if we are not to have reduction in “view extension” functionality for Blacklgiht 8 (which is another option, say, well, you just cant’ do those things anymore, significantly trimming the scope of what is possible with plugins, possibly abandoning some plugins).

There are other cases where blacklight_range_limit for example calls helper methods to re-use functionality. I haven’t totally analyzed them. It’s possible that in some cases, the plug-in just should copy-and-paste hard-coded HTML or logic, without allowing for other actors to customize them. Examples of what blacklight_range_limit calls here include

New API? Dependency Injection?

Might there be some new API that Blacklight could implement that would make this all work smoother and more consistently?

“If we want a way to tell Blacklight “use my own custom component instead of Blacklight::ConstraintsComponent“, ideally without having to override a view template, at first that made me think “Inversion of Control with Dependency Injection“? I’m not thrilled with this generic solution, but thinking it through….

What if there was some way the local app or plugin could do Blacklight::ViewComponentRegistration.constraints_component_class = MyConstraintsComponent, and then when blacklight wants to call it, instead of doing, like it does now, <%= render(Blacklight::ConstraintsComponent.new(search_state: stuff) %>, it’d do something like: `<%= Blacklight::ViewComponentRegistration.constraints_component_class.new(search_state: stuff) %>.

That lets us “inject” a custom class without having to override the view component and every other single place it might be used, including new places from plugins etc. The specific arguments the component takes would have to be considered/treated as public API somehow.

It still doesn’t let multiple add-ons cooperate to each add a new constraint item though. i guess to do that, the registry could have an array for each thing….

Blacklight::ViewComponentRegistration.constraints_component_classes = [
  Blacklight::ConstraintsComponent,
  BlacklightRangeLimit::ConstraintsComponent,
  GeoBlacklight::ConstraintsComponent
]

# And then I guess we really need a convenience method for calling
# ALL of them in a row and concatenating their results....

Blacklight::ViewComponentRegistration.render(:constraints_component_class, search_state: stuff)

On the plus side, now something like spotlight can call that too to render a “constraints area” including customizations from BlacklightRangeLimit, GeoBlacklight, etc.

But I have mixed feelings about this, it seems like the kind of generic-universal yet-more-custom-abstraction thing that sometimes gets us in trouble and over-complexified. Not sure.

API just for constraints view customization?

OK, instead of trying to make a universal API for customizing “any view component”, what if we just focus on the actual use cases in front of us here? All the ones I’ve encountered so far are about the “constraints” area? Can we add custom API just for that?

It might look almost exactly the same as the generic “IoC” solution above, but on the Blacklight::ConstraintsComponent class…. Like, we want to customize the component Blacklight::ConstraintsComponent uses to render the ‘query constraint’ (for my local app and advanced search use cases), right now we have to change the call site for Blacklight::ConstraintsComponent.new every place it exists, to have a different argument… What if instead we can just:

Blacklight::ConstraintsComponent.query_constraint_component =
   BlacklightAdvancedSearch::QueryConstraintComponent

And ok, for these “additional constraint items” we want to add… in “legacy” architecture we overrode “render_constraints_filters” (normally used for facet constraints) and called super… but that’s just cause that’s what we had, really this is a different semantic thing, let’s just call it what it is:

Blacklight::ConstraintsComponent.additional_render_components <<
  BlacklightRangeLimit::RangeFacetConstraintComponent

Blacklight::ConstraintsComponent.additional_render_components <<
  GeoBlacklight::GeoConstraintsComponent

All those component “slots” would still need to have their initializer arguments be established as “public API” somehow, so you can register one knowing what args it’s initializer is going to get.

Note this solves the spotlight case too, spotlight can just simply call render Blacklight::ConstraintsComponent(..., and it now does get customizations added by other add-ons, because they were registered with the Blacklight::ConstraintsComponent.

I think this API may meet all the use cases I’ve identified? Which doesn’t mean there aren’t some I haven’t identified. I’m not really sure what architecture is best here, I’ve just trained to brainstorm possibilities. It would be good to choose carefully, as we’d ideally find something that can work through many future Blacklight versions without having to be deprecated again.

Need for Coordinated Switchover to non-deprecated techniques

The way Blacklight implements backwards-compatible support for the constraints render, is if it detects anything in the app is overriding a relevant method or partial, it continues rendering the “legacy” way with helpers and partials.

So if I were to try upgrading my app to do something using a new non-deprecatred method, while my app is still using blacklight_range_limit doing things the old way… it woudl be hard to keep them both working. If you have more than one Blacklight plug-in overriding relevant view helpers, it of course gets even more complicated.

It pretty much has to be all-or-nothing. Which also makes it hard for say blacklight_range_limit to do a release that uses a new way (if we figured one out) — it’s probably only going to work in apps that have changed ALL their parts over to the new way. I guess all the plug-ins could do releases that offered you a choice of configuration/installation instructions, where the host app could choose new way or old way.

I think the complexity of this makes it more realistic, especially based on actual blacklight community maintenance resources, that a lot of apps are just going to keep running in deprecated mode, and a lot of plugins only available triggering deprecation warnings, until Blacklight 8.0 comes out and the deprecated behavior simply breaks, and then we’ll need Blacklight 8-only versions of all the plugins, with apps switching everything over all at once.

If different plugins approach this in an uncoordianted fashion, each trying to investnt a way to do it, they really risk stepping on each others toes and being incompatible with each other. I think really something has to be worked out as the Blacklgiht-recommended consensus/best practice approach to view overrides, so everyone can just use it in a consistent and compatible way. Whether that requires new API not yet in Blacklight, or a clear pattern with what’s in current Blacklight 7 releasees.

Ideally all worked out by currently active Blacklight maintainers and/or community before Blacklight 8 comes out, so people at least know what needs to be done to update code. Many Blacklight users may not be using Blacklight 7.x at all yet (7.0 released Dec 2018) — for instance hyrax still uses Blacklight 6 — so I’m not sure what portion of the community is already aware this is coming up on the horizon.

I hope the time I’ve spent investigating and considering and documenting in this piece can be helpful to the community as one initial step, to understanding the lay of the land.

For now, silence deprecations?

OK, so I really want to upgrade to latest Blacklight 7.19.2, from my current 7.7.0. To just stay up to date, and to be ready for ruby 3.0. (My app def can’t pass tests on ruby 3 with BL 7.7; it looks like BL added ruby 3.0 support in BL 7.14.0? Which does already have the deprecations).

It’s not feasible right now to eliminate all the deprecated calls. But my app does seem to work fine, just with deprecation calls.

I don’t really want to leave all those “just ignore them for now”. deprecation messages in my CI and production logs though. They just clutter things up and make it hard to pay attention to the things Iwant to be noticing.

Can we silence them? Blacklight uses the deprecation gem for it’s deprecation messages; the gem is by cbeer, with logic taken out of ActiveSupport.

We could wrap all calls to deprecated methods in Deprecation.silence do…. including making a PR to blacklight_range_limit to do that? I’m not sure I like the idea of making blacklight_range_limit silent on this problem, it needs more attention at this point! Also I’m not sure how to use Deprecation.silence to effect that clever conditional check in the _constraints.html.erb template.

We could entirely silence everything from the deprecation gem with Deprecation.default_deprecation_behavior — I don’t love this, we might be missing deprecations we want?

The Deprecation gem API made me think there might be a way to silence deprecation warnings from individual classes with things like Blacklight::RenderConstraintsHelperBehavior.deprecation_behavior = :silence, but I think I was misinterpreting the API, there didn’t seem to be actually methods like that available in Blacklight to silence what I wanted in a targetted way.

Looking/brainstormign more in Deprecation gem API… I *could* change it’s behavior to it’s “notify” strategy that sends ActiveSupport::Notification events instead of writing to stdout/log… and then write a custom ActiveSupport::Notification subscriber which ignored the ones I wanted to ignore… ideally still somehow keeping the undocumented-but-noticed-and-welcome default behavior in test/rspec environment where it somehow reports out a summary of deprecations at the end…

This seemed too much work. I realized that the only things that use the Deprecation gem in my project are Blacklight itself and the qa gem (I don’t think it has caught on outside blacklight/samvera communities), and I guess I am willing to just silence deprecations from all of them, although I don’t love it.

Notes on retrying all jobs with ActiveJob retry_on

jrochkind — Mon, 23 Aug 2021 14:49:24 +0000

I would like to configure all my ActiveJobs to retry on failure, and I’d like to do so with the ActiveJob retry_on method.

So I’m going to configure it in my ApplicationJob class, in order to retry on any error, maybe something like:

class ApplicationJob < ActiveJob::Base
  retry_on StandardError # other args to be discussed
end

Why use ActiveJob retry_on for this? Why StandardError?

Many people use backend-specific logic for retries, especially with Sidekiq. That’s fine!

I like the idea of using the ActiveJob functionality:

I currently use resque (more on challenges with retry here later), but plan to switch to something else at some point medium-term. Maybe sideqkiq, but maybe delayed_job or good_job. (Just using the DB and not having a redis is attractive to me, as is open source). I like the idea of not having to redo this setup when I switch back-ends, or am trying out different ones.
In general, I like the promise of ActiveJob as swappable commoditized backends
I like what I see as good_job’s philosophy here, why have every back-end reinvent the wheel when a feature can be done at the ActiveJob level? That can help keep the individual back-end smaller, and less “expensive” to maintain. good_job encourages you to use ActiveJob retries I think.

Note, dhh is on record from 2018 saying he thinks setting up retries for all StandardError is a bad idea. But I don’t really understand why! He says “You should know why you’d want to retry, and the code should document that knowledge.” — but the fact that so many ActiveJob back-ends provide “retry all jobs” functionality makes it seem to me an established common need and best practice, and why shouldn’t you be able to do it with ActiveJob alone?

dhh thinks ActiveJob retry is for specific targetted retries maybe, and the backend retry should be used for generic universal ones? Honestly I don’t see myself doing much specific targetted retries, making all your jobs idempotent (important! Best practice for ActiveJob always!), and just having them all retry on any error seems to me to be the way to go, a more efficient use of developer time and sufficient for at least a relatively simple app.

One situation I have where a retry is crucial, is when I have a fairly long-running job (say it takes more than 60 seconds to run; I have some unavoidably!), and the machine running the jobs needs to restart. It might interrupt the job. It is convenient if it is just automatically retried — put back in the queue to be run again by restarted or other job worker hosts! Otherwise it’s just sitting there failed, never to run again, requiring manual action. An automatic retry will take care of it almost invisibly.

Resque and Resque Scheduler

Resque by default doens’t supprot future-scheduled jobs. You can add them with the resque-scheduler plugin. But I had a perhaps irrational desire to avoid this — resque and it’s ecosystem have at different times had different amounts of maintenance/abandonment, and I’m (perhaps irrationally) reluctant to complexify my resque stack.

And do I need future scheduling for retries? For my most important use cases, it’s totally fine if I retry just once, immediately, with a wait: 0. Sure, that won’t take care of all potential use cases, but it’s a good start.

I thought even without resque supporting future-scheduling, i could get away with:

retry_on StandardError, wait: 0

Alas, this won’t actually work, it still ends up being converted to a future-schedule call, which gets rejected by the resque_adapter bundled with Rails unless you have resque-scheduler installed.

But of course, resque can handle wait:0 semantically, if the code was willing to do it by queing an ordinary resque job…. I don’t know if it’s a good idea, but this simple patch to Rails-bundled resque_adapter will make it willing to accept “scheduled” jobs when the time to be scheduled is actually “now”, just scheduling them normally, while still raising on attempts to future schedule. For me, it makes retry_on.... wait: 0 work with just plain resque.

Note: retry_on `attempts` count includes first run

So wanting to retry just once, I tried something like this:

# Will never actually retry
retry_on StandardError, attempts: 1

My job was never actually retried this way! It looks like the attempts count includes the first non-error run, the total number of times job will be run, including the very first one before any “retries”! So attempts 1 means “never retry” and does nothing. Oops. If you actually want to retry only once, in my Rails 6.1 app this is what did it for me:

# will actually retry once
retry_on StandardError, attempts: 2

(I think this means the default, attempts: 5 actually means your job can be run a total of 5 times– one original time and 4 retries. I guess that’s what was intended?)

Note: job_id stays the same through retries, hooray

By the way, I checked, and at least in Rails 6.1, the ActiveJob#job_id stays the same on retries. If the job runs once and is retried twice more, it’ll have the same job_id each time, you’ll see three Performing lines in your logs, with the same job_id.

Phew! I think that’s the right thing to do, so we can easily correlate these as retries of the same jobs in our logs. And if we’re keeping the job_id somewhere to check back and see if it succeeded or failed or whatever, it stays consistent on retry.

Glad this is what ActiveJob is doing!

Logging isn’t great, but can be customized

Rails will automatically log retries with a line that looks like this:

Retrying TestFailureJob in 0 seconds, due to a RuntimeError.
# logged at `info` level

Eventually when it decides it’s attempts are exhausted, it’ll say something like:

Stopped retrying TestFailureJob due to a RuntimeError, which reoccurred on 2 attempts.
# logged at `error` level

This does not include the job-id though, which makes it harder than it should be to correlate with other log lines about this job, and follow the job’s whole course through your log file.

It’s also inconsistent with other default ActiveJob log lines, which include:

the Job ID in text
tags (Rails tagged logging system) with the job id and the string "[ActiveJob]". Because of the way the Rails code applies these only around perform/enqueue, retry/discard related log lines apparently end up not included.
The Exception message not just the class when there’s a class.

You can see all the built-in ActiveJob logging in the nicely compact ActiveJob::LogSubscriber class. And you can see how the log line for retry is kind of inconsistent with eg perform.

Maybe this inconsistency has persisted so long in part because few people actually use ActiveJob retry, they’re all still using their backends backend-specific functionality? I did try a PR to Rails for at least consistent formatting (my PR doesn’t do tagging), not sure if it will go anywhere, I think blind PR’s to Rails usually do not.

In the meantime, after trying a bunch of different things, I think I figured out the reasonable way to use the ActiveSupport::Notifications/LogSubscriber API to customize logging for the retry-related events while leaving it untouched from Rails for the others? See my solution here.

(Thanks to BigBinary blog for showing up in google and giving me a head start into figuring out how ActiveJob retry logging was working.)

(note: There’s also this: https://github.com/armandmgt/lograge_active_job But I’m not sure how working/maintained it is. It seems to only customize activejob exception reports, not retry and other events. It would be an interesting project to make an up-to-date activejob-lograge that applied to ALL ActiveJob logging, expressing every event as key/values and using lograge formatter settings to output. I think we see exactly how we’d do that, with a custom log subscriber as we’ve done above!)

Warning: ApplicationJob configuration won’t work for emails

You might think since we configured retry_on on ApplicationJob, all our bg jobs are now set up for retrying.

Oops! Not deliver_later emails.

Good_job README explains that ActiveJob mailers don’t descend from ApplicationMailer. (I am curious if there’s any good reason for this, it seems like it would be nice if they did!)

The good_job README provides one way to configure the built-in Rails mailer superclass for retries.

You could maybe also try setting delivery_job on that mailer superclass to use a custom delivery job (thanks again BigBinary for the pointer)… maybe one that subclasses the default class to deliver emails as normal, but let you set some custom options like retry_on? Not sure if this would be preferable in any way.

logging URI query params with lograge

jrochkind — Wed, 04 Aug 2021 19:35:57 +0000

The lograge gem for taming Rails logs by default will lot the path component of the URI, but leave out the query string/query params.

For instance, perhaps you have a URL to your app /search?q=libraries.

lograge will log something like:

method=GET path=/search format=html…

The q=libraries part is completely left out of the log. I kinda want that part, it’s important.

The lograge README provides instructions for “logging request parameters”, by way of the params hash.

I’m going to modify them a bit slightly to:

use the more recent custom_payload config instead of custom_options. (I’m not certain why there are both, but I think mostly for legacy reasons and newer custom_payload? is what you should read for?)
If we just put params in there, then a bunch of ugly show up in the log if you have nested hash params. We could fix that with params.to_unsafe_h, but…
We should really use request.filtered_parameters instead to make sure we’re not logging anything that’s been filtered out with Rails 6 config.filter_parameters. (Thanks /u/ezekg on reddit). This also converts to an ordinary hash that isn’t ActionController::Parameters, taking care of previous bullet point.
(It kind of seems like lograge README could use a PR updating it?)




  config.lograge.custom_payload do |controller|
    exceptions = %w(controller action format id)
    params: controller.request.filtered_parameters.except(*exceptions)
  end



That gets us a log line that might look something like this:


method=GET path=/search format=html controller=SearchController action=index status=200 duration=107.66 view=87.32 db=29.00 params={"q"=>"foo"}



OK. The params hash isn’t exactly the same as the query string, it can include things not in the URL query string (like controller and action, that we have to strip above, among others), and it can in some cases omit things that are in the query string. It just depends on your routing and other configuration and logic.



The params hash itself is what default rails logs… but what if we just log the actual URL query string instead? Benefits:



it’s easier to search the logs for actually an exact specific known URL (which can get more complicated like /search?q=foo&range%5Byear_facet_isim%5D%5Bbegin%5D=4&source=foo or something). Which is something I sometimes want to do, say I got a URL reported from an error tracking service and now I want to find that exact line in the log. 
I actually like having the exact actual URL (well, starting from path) in the logs. 
It’s a lot simpler, we don’t need to filter out controller/action/format/id etc. 
It’s actually a bit more concise? And part of what I’m dealing with in general using lograge is trying to reduce my bytes of logfile for papertrail!



Drawbacks?



if you had some kind of structured log search (I don’t at present, but I guess could with papertrail features by switching to json format?), it might be easier to do something like “find a /search with q=foo and source=ef without worrying about other params)
To the extent that params hash can include things not in the actual url, is that important to log like that?
….?



Curious what other people think… am I crazy for wanting the actual URL in there, not the params hash?



At any rate, it’s pretty easy to do. Note we use filtered_path rather than fullpath to again take account of Rails 6 parameter filtering, and thanks again /u/ezekg: 


  config.lograge.custom_payload do |controller|
    {
      path: controller.request.filtered_path
    }
  end



This is actually overwriting the default path to be one that has the query string too:



method=GET path=/search?q=libraries format=html ...



You could of course add a different key fullpath instead, if you wanted to keep path as it is, perhaps for easier collation in some kind of log analyzing system that wants to group things by same path invariant of query string. 



I’m gonna try this out!



Meanwhile, on lograge…



As long as we’re talking about lograge…. based on commit history, history of Issues and Pull Requests… the fact that CI isn’t currently running (travis.org grr) and doesn’t even try to test on Rails 6.0+ (although lograge seems to work fine)… one might worry that lograge is currently un/under-maintained….  No comment on a GH issue filed in May asking about project status. 



It still seems to be one of the more popular solutions to trying to tame Rails kind of out of control logs. It’s mentioned for instance in docs from papertrail and honeybadger, and many many other blog posts. 



What will it’s future be? 



Looking around for other possibilties, I found semantic_logger (rails_semantic_logger). It’s got similar features.  It seems to be much more maintained.  It’s got a respectable number of github stars, although not nearly as many as lograge, and it’s not featured in blogs and third-party platform docs nearly as much. 



It’s also a bit more sophisticated and featureful. For better or worse. For instance mainly I’m thinking of how it tries to improve app performance by moving logging to a background thread. This is neat… and also can lead to a whole new class of bug, mysterious warning, or configuration burden. 



For now I’m sticking to the more popular lograge, but I wish it had CI up that was testing with Rails 6.1, at least! 



Incidentally, trying to get Rails to log more compactly like both lograge and rails_semantic_logger do… is somewhat more complicated than you might expect, as demonstrated by the code in both projects that does it! Especially semantic_logger is hundreds of lines of somewhat baroque code split accross several files. A refactor of logging around Rails 5 (I think?) to use ActiveSupport::LogSubscriber made it possible to customize Rails logging like this (although I think both lograge and rails_semantic_logger still do some monkey-patching too!), but in the end didn’t make it all that easy or obvious or future-proof. This may discourage too many other alternatives for the initial primary use case of both lograge and rails_semantic_logger — turn a rails action into one log line, with a structured format.



Notes on Cloudfront in front of Rails Assets on Heroku, with CORS
jrochkind — Wed, 23 Jun 2021 16:37:50 +0000

Heroku really recommends using a CDN in front of your Rails app static assets — which, unlike in non-heroku circumstances where a web server like nginx might be taking care of it,  otherwise on heroku static assets will be served directly by your Rails app, consuming limited/expensive dyno resources. 



After evaluating a variety of options (including some heroku add-ons), I decided AWS Cloudfront made the most sense for us — simple enough, cheap, and we are already using other direct AWS services (including S3 and SES). 



While heroku has an article on using Cloudfront, which even covers Rails specifically, and even CORS issues specifically, I found it a bit too vague to get me all the way there. And while there are lots of blog posts you can find on this topic, I found many of them outdated (Rails has introduced new API; Cloudfront has also changed it’s configuration options!), or otherwise spotty/thin. 



So while I’m not an expert on this stuff, i’m going to tell you what I was able to discover, and what I did to set up Cloudfront as a CDN in front of Rails static assets running on heroku — although there’s really nothing at all specific to heroku here, if you have any other context where Rails is directly serving assets in production. 



First how I set up Rails, then Cloudfront, then some notes and concerns. Btw, you might not need to care about CORS here, but one reason you might is if you are serving any fonts (including font-awesome or other icon fonts!) from Rails static assets. 



Rails setup



In config/environments/production.rb


# set heroku config var RAILS_ASSET_HOST to your cloudfront
# hostname, will look like `xxxxxxxx.cloudfront.net`
config.asset_host = ENV['RAILS_ASSET_HOST']

config.public_file_server.headers = {
  # CORS:
  'Access-Control-Allow-Origin' => "*", 
  # tell Cloudfront to cache a long time:
  'Cache-Control' => 'public, max-age=31536000' 
}



Cloudfront Setup



I changed some things from default. The only one that absolutely necessary — if you want CORS to work — seemed to be changing Allowed HTTP Methods to include OPTIONS. 



Click on “Create Distribution”. All defaults except:



Origin Domain Name:  your heroku app host like app-name.herokuapp.com
Origin protocol policy:  Switch to “HTTPS Only”. Seems like a good idea to ensure secure traffic between cloudfront and origin, no?
Allowed HTTP Methods: Switch to GET, HEAD, OPTIONS. In my experimentation, necessary for CORS from a browser to work — which AWS docs also suggest.
Cached HTTP Methods: Click “OPTIONS” too now that we’re allowing it, I don’t see any reason not to?
Compress objects automatically: yes Sprockets is creating .gz versions of all your assets, but they’re going to be completely ignored in a Cloudfront setup either way.  (Is there a way to tell Sprockets to stop doing it? WHO KNOWS not me, it’s so hard to figure out how to reliably talk to Sprockets).  But we can get what it was trying to do by having Cloudfront encrypt stuff for us, seems like a good idea, Google PageSpeed will like it, etc?  
I noticed by experimentation that Cloudfront will compress CSS and JS (sometimes with brotli sometimes gz, even with the same browser, don’t know how it decides, don’t care), but is smart enough not to bother trying to compress a .jpg or .png (which already has internal compression). 
Comment field: If there’s a way to edit it after you create the distribution, I haven’t found it, so pick a good one!



Notes on CORS



AWS docs here and here suggest for CORS support you also need to configure the Cloudfront distribution to forward additional headers — Origin, and possibly Access-Control-Request-Headers and Access-Control-Request-Method. Which you can do by setting up a custom “cache policy”. Or maybe instead by by setting the “Origin Request Policy”. Or maybe instead by setting custom cache header settings differently using the Use legacy cache settings option. It got confusing — and none of these settings seemed to be necessary to me for CORS to be working fine, nor could I see any of these settings making any difference in CloudFront behavior or what headers were included in responses. 



Maybe they would matter more if I were trying to use a more specific Access-Control-Allow-Origin than just setting it to *? But about that….



If you set Access-Control-Allow-Origin to a single host, MDN docs say you have to also return a Vary: Origin header. Easy enough to add that to your Rails config.public_file_server.headers.  But I couldn’t get Cloudfront to forward/return this Vary header with it’s responses. Trying all manner of cache policy settings, referring to AWS’s quite confusing documentation on the Vary header in Cloudfront and trying to do what it said — couldn’t get it to happen. 



And what if you actually need more than one allowed origin?  Per spec  Access-Control-Allow-Origin as again explained by MDN, you can’t just include more than one in the header, the header is only allowed one: ” If the server supports clients from multiple origins, it must return the origin for the specific client making the request.”  And you can’t do that with Rails static/global config.public_file_server.headers, we’d need to use and setup rack-cors instead, or something else. 



So I just said, eh, * is probably just fine. I don’t think it actually involves any security issues for rails static assets to do this?  I think it’s probably what everyone else is doing? 



The only setup I needed for this to work was setting Cloudfront to allow OPTIONS HTTP method, and setting Rails config.public_file_server.headers to include 'Cache-Control' => 'public, max-age=31536000'.



Notes on Cache-Control max-age



A lot of the existing guides don’t have you setting config.public_file_server.headers to include 'Cache-Control' => 'public, max-age=31536000'. 



But without this, will Cloudfront actually be caching at all? If with every single request to cloudfront, cloudfront makes a request to the Rails app for the asset and just proxies it — we’re not really getting much of the point of using Cloudfront in the first place, to avoid the traffic to our app! 



Well, it turns out yes, Cloudfront will cache anyway. Maybe because of the Cloudfront Default TTL setting? My Default TTL was left at the Cloudfront default, 86400 seconds (one day). So I’d think that maybe Cloudfront would be caching resources for a day when I’m not supplying any Cache-Control or Expires headers?  



In my observation, it was actually caching for less than this though. Maybe an hour? (Want to know if it’s caching or not? Look at headers returned by Cloudfront. One easy way to do this? curl -IXGET https://whatever.cloudfront.net/my/asset.jpg, you’ll see a header either x-cache: Miss from cloudfront or x-cache: Hit from cloudfront). 



Of course, Cloudfront doesn’t promise to cache for as long as it’s allowed to, it can evict things for it’s own reasons/policies before then, so maybe that’s all that’s going on. 



Still, Rails assets are fingerprinted, so they are cacheable forever, so why not tell Cloudfront that? Maybe more importantly, if Rails isn’t returning a Cache-Cobntrol header, then Cloudfront isn’t either to actual user-agents, which means they won’t know they can cache the response in their own caches, and they’ll keep requesting/checking it on every reload too, which is not great for your far too large CSS and JS application files!



So, I think it’s probably a great idea to set the far-future Cache-Control header with config.public_file_server.headers as I’ve done above. We tell Cloudfront it can cache for the max-allowed-by-spec one year, and this also (I checked) gets Cloudfront to forward the header on to user-agents who will also know they can cache. 



Note on limiting Cloudfront Distribution to just static assets?



The CloudFront distribution created above will actually proxy/cache our entire Rails app, you could access dynamic actions through it too. That’s not what we intend it for, our app won’t generate any URLs to it that way, but someone could. 



Is that a problem?



I don’t know? 



Some blog posts try to suggest limiting it only being willing to proxy/cache static assets instead, but this is actually a pain to do for a couple reasons:



Cloudfront has changed their configuration for “path patterns” since many blog posts were written (unless you are using “legacy cache settings” options), such that I’m confused about how to do it at all, if there’s a way to get a distribution to stop caching/proxying/serving anything but a given path pattern anymore?
Modern Rails with webpacker has static assets at both /assets and /packs, so you’d need two path patterns, making it even more confusing. (Why Rails why? Why aren’t packs just at public/assets/packs so all static assets are still under /assets?)



I just gave up on figuring this out and figured it isn’t really a problem that Cloudfront is willing to proxy/cache/serve things I am not intending for it?  Is it? I hope? 



Note on Rails asset_path helper and asset_host



You may have realized that Rails has both asset_path and asset_url helpers for linking to an asset. (And similar helpers with dashes instead of underscores in sass, and probably different implementations, via sass-rails)



Normally asset_path returns a relative URL without a host, and asset_url  returns a URL with a hostname in it. Since using an external asset_host requires we include the host with all URLs for assets to properly target CDN… you might think you have to stop using asset_path anywhere and just use asset_url… You would be wrong. 



It turns out if config.asset_host is set, asset_path starts including the host too. So everything is fine using asset_path.  Not sure if at that point it’s a synonym for asset_url? I think not entirely, because I think in fact once I set config.asset_host, some of my uses of asset_url actually started erroring and failing tests? And I had to actually only use asset_path? In ways I don’t really understand what’s going on and can’t explain it?



Ah, Rails.