Beware sinatra, rails 7.1, rack 3, resque bundler dependency resolution

tldr practical advice for google: If you use resque 3.6.0 or less, and Rails 7.1, and are getting an error: cannot load such file -- rack/showexceptions — you probably need to add rack "~> 2.0" to your Gemfile!

The latest version of the ruby gem sinatra, as I write this, is 3.1.0, and it does not yet support the recently released rack 3. It correctly specifies that in it’s gemspec, with gem "rack", "~> 2.2", ">= 2.2.4”

[And as of this writing, that is true in sinatra github main branch too, no work has been done to allow rack 3.x]

The new Rails 7.1 does work with and allow Rack 3.x, as well as still working with Rack 2.x, it allows any rack >= 2.2.4 (specifying it will be compatible with a future rack 4.x too, which seems dangerous, for reasons, read on)

There is a version of sinatra that (wrongly) specifies working with rack 3.x: Sinatra 1.0 (Released March 2010!) specifies in it’s gemspec that it will work with any rack >= 1.0. They quickly corrected that in Sinatra 1.1a to say “~> 1.1”, meaning “1.x greater than or equal to 1.1 only”.

But sinatra 1.0 is still there in the repo, as a target for bundler dependency resolution, claiming to work fine with rack 3.x. By the way, sinatra 1.0 is wrong about that, it certainly does not work with rack 3.x. One error you might get from it is cannot load such file -- rack/showexceptions on boot, which is a lot better than a subtler error that only shows up at runtime, for sure!

Do you see where this is going?

I am in process of updating my app to Rails 7.1. I didn’t even know my app had a sinatra dependency… but it turns out it did, my app uses resque latest version 2.6.0, which has a dependency on sinatra sinatra >= 0.9.2

So okay, poor bundler has to take this dependency tree and create a resolution for it. Rails 7.1 allows rack 2 or 3; resque 2.6.0 allows any sinatra at all; sinatra 1.0 allows any rack, but sinatra 3.1.0 only allows rack 2.x.

There are two possible resolutions that satisfy those restrictions (really more than two if you can use any old version of a dependency), but the one bundler picked was:

rack 3.0.8
sinatra 1.0

Which then failed CI because sinatra 1.0 doesn’t really work with rack 3.x.

The other possible resolution would have been rack 2.2.8 and sinatra 3.1.0.

That’s the one I actually want.

To help it it along I just need to add rack "~> 2.0" to my Gemfile. This was a bit confusing to debug!

What is the problem? The danger of open-ended gem dependencies

So the problem here is sinatra 1.0 (ten years ago!) claiming it supported any rack version no matter how high! It should have said ~> 1.0 meaning “1.x, but not 2” — how could it possibly predict it would work with rack 2, or 3, or 4, or 9.0?

If sinatra 1.0 had put an upper bound on the version of rack it woudl work with, bundler would have done the ‘correct’ (to us humans) resolution out of the box, cause the ‘wrong’ one it did would not have been available as satisfying all restrictions. Doing an open-ended spec like this leaves a bomb that can get someone decades later, as it did here.

And Rails is still doing that! actionpack 7.1.x says it works with any rack >= 2.2.4 — it ought to add in a < 4 there, it knows it works with rack 2.x and 3.x, but how can it predict it works with rack 5.x or 6.x, which don’t exist at all yet? It’s leaving the same bomb for bundler dependency resolution in the future that sinatra 1.0 did, and there’s no real way to fix it once the versions are out there.

Alternately, if sinatra released a version that did support rack 3, and said so, bundler would preferentially choose that version, with rack 3, and we wouldn’t have a problem. (Bundler’s dependency resolution is actually really amazing, it’s amazing how often it makes the “right” choice among many possible versions that would satisfy all dependency restrictions) I’m not sure how much maintenance energy sinatra is getting, but eventually it’s going to have to get there or there’s going to be a conflict with something that has sinatra in it’s dependency tree and also has something that requires rack 3 in it’s dependency tree.

And more immediately… resque says it works with any sinatra >= 0.9.2 (released in 2009)…. but does it really? Who knows. Releasing a resque that says it needs, oh, sinatra >= 2.0 (released 2017) might help bundler come to a more satisfying dependency resolution… or could just result in bundler deciding to use an old version of resque so it can use an old version of sinatra which says (incorrectly!) it supprots rack 3…. hard to predict. But maybe I’ll PR resque. But resque is also not exactly overflowing with maintenance applied to it these days…

Eventually I just need to switch away from resque. I have my eye on good_job.

S3 CORS headers proxied by CloudFront require HEAD not just GET?

I’m not totally sure what happened, but the tldr is that at the end of last week, our video.js-played HLS videos served from an S3 bucket — via CloudFront — appears to have started requiring us to list “HEAD” in the “AllowedMethods” for CORS configuraton, in addition to pre-existing “GET”.

I’m curious if anyone else has any insight into what’s going on there… I have some vague guesses at the end, but still don’t really have a handle on it.

Our setup: HLS video from S3 buckets

We use the open-source video.js to display some video, in the HLS format. Which involves linking to a .m3u8 manifest file, which is the first file the user-agent will request.

When implementing, we discovered that if the .m3u8 and other HLS files are on a different domains than the web page running the JS, you need the server hosting the HLS files to supply CORS headers. Makes sense, reasonable.

Our HLS files are on a public S3 bucket. We also have a simple Cloudfront distribution in front of the public S3 bucket.

We set this CORS policy on the S3 bucket, probably one I just found/copy/pasted at some point. (CORS policies on S3 are now set, I think, only in JSON form; in the past they could be XML and you can find XML examples too). (warning, may not be sufficient)

[
    {
        "AllowedHeaders": 
            "*"
        ],
        "AllowedMethods": [
            "GET"
        ],
        "AllowedOrigins": [
            "*"
        ],
        "ExposeHeaders": [],
        "MaxAgeSeconds": 43200
    }
]

And for a long time, that just worked. The S3 bucket responded with proper CORS headers for video.js to work. The CloudFront distribution appropriately cached/forwarded the response with those headers. (note * as allowed origin, so the cache is not origin-specific, which should be fine then!)

Last week it broke? How I investigated

Some time around Wednesday Oct 4-Thursday Oct 5th, our browser video display started broking. In a very hard to reproduce way.

Some viewers got the error from video.js it gives when it can’t fetch the video source (for instance, a network failure might give you this same error message):

“Media could not be loaded, either because the server or network failed or because the format is not supported.”

(and, by the way, this error could happen on new videos at new urls that didn’t exist 24 hours previous…)

Once a developer managed to reproduce this, looking in Dev Tools console in the browser, we could see a CORS error reported:

Access to XMLHttpRequest at ‘[url]’ has been blocked by CORS policy: No ‘Access-Control-Allow-Origin’ header is present on the requested resource.

It took me a bit to figure out how to investigate whether CORS headers were being returned appropriately or not. It turns out that S3, at least, only returns the CORS headers when an Origin header is present in the request, and it matches the CORS rules (the second condition, in this case, should be universal, as our allowed origin is *). Maybe this is how CORS always works?

So we could investigate like, so using verbose mode to see headers from a GET request:

curl -H "Origin: https://our-example.org" -v "https://some-s3-or-cloudfront/etc"

Doing this, I discovered that for some people a cloudfront request as above would return CORS headers (we’re looking for eg Access-Control-Allow-Origin: * in the response!), and other times it wouldn’t! Cloudfront headers include a x-amz-cf-pop header, which reminded me, right there are different Cloudfront POPs different people could be connecting to… okay, so some Cloudfront POPs are returning the CORS headers others not? Which kind of violates my model of CloudFront, i thought POPs would be synchronized to always return the same content, but who knows.

But okay then, was the S3 original source returning CORS headers?

Well, to make matters more confusing, I made a mistake which ultimately led me to the solution too. Instead of doing curl -v, I had originally been doing curl -I, which I had come to think as “just show me the response headers not body”, but of course actually is a synonym for --head and tells curl to do a HTTP HEAD method request.

And I configured S3 to allow only GET method, so, no, when I did a HEAD request to the direct S3 source, no CORS headers were included, duh. If I did it with GET they were.

I actually didn’t totally realize what was going on at first (really forgot that -I was a HEAD request to curl, not a GET where it only showed me resposne headers!)…. but something about this experience, and while googling seeing an occasional S3 CORS example that included HEAD as well as GET in allowed options…

Led me to try just adding HEAD to my AllowedOptions… So now this is my public S3 buckets CORS policy:

[
    {
        "AllowedHeaders": [
            "*"
        ],
        "AllowedMethods": [
            "GET",
            "HEAD"
        ],
        "AllowedOrigins": [
            "*"
        ],
        "ExposeHeaders": [],
        "MaxAgeSeconds": 43200
    }
]

And… this seemed to fix things? Along with clearing the CloudFront cache though, to make sure it wasn’t serving bad response headers from cache, so that could have played a role too.

At this point I really don’t understand what was going on, why/how I fixed it more or less accidentally, how I got lucky enough to fix it… or honestly if I even really did fix it?

What is going on anyway?

We have had this system in place for over a year, with no changes I know about — no changes to S3 or CloudFront configuration, or to video.js version. What changed?

I feel like the symptoms probably mean that CloudFront is sometimes doing a HEAD request to S3 for these files, and caching the response headers, and then using those cached response headers from a HEAD request on a GET request response… but why would it do that? And again why would it start encountering this situation now after a long time working fine?

At first I wondered, wait… we’ve had this setup for about a year… and we tell CloudFront to cache these responses (with content-unique URLs) for the HTTP max cache age of a year…. has our content just started to exceed it’s year max-age… so now CloudFront is maybe doing some conditional HEAD requests to S3 to see if it’s cache is still good (it is, Etag unchanged)… and for some reason it uses the CORS headers it gets back from there to update it’s cached headers, while still using it’s original cached body?

That seemed maybe plausible (if unclear whether it was a defensible thing for CloudFront to do), but then I remembered — no, we are seeing this problem too with new content and URLs that have only existed for less than 24 hours, so it can’t be a case of year-old content that CloudFront has been caching for a year.

I’m pretty mystified. Why this started breaking now after working for months, with no known changes. Has something on S3’s end changed with how it executes CORS policies to produce CORS headers? Or something on CloudFront changed with how it forwards/caches them? Or something in browsers or video.js changed with regard to exactly what requests are made? (is the browser now making HEAD requests for this content, and requiring CORS headers on response, in places it didn’t before? But that doesn’t explain why CloudFront POPs were giving me unexpectedly inconsistent results to GET requests, sometimes including CORS headers in response sometimes not!)

AND I don’t really understand why I have to include HEAD in my S3 CORS policy at all — I hadn’t been expecting to need to authorize HEAD requests via CORS, I expected video.js would be doing GET requests, and that’s all I’d need to authorize.

So I seem to have fixed the problem… but I never like it when I don’t understand my own solution. Have I really fixed it, or will it just come back?

Googling I can not find anything that seems relevant to this at all. Should anyone using CloudFront in front of a public S3 bucket, where responses need origin * CORS headers — always include HEAD as well as GET in AllowedMethods? Is this really such a weird situation? Why can’t I find anyone talking about it? Is it for some reason special to HLS video?

So anyway, I blog this. Hoping that someone else running into a mysterious problem will find this post, when I could find nothing! And hoping for the even slimmer chance that someone will see this who thinks they know exactly what was going on and can explain it!

Investigating OCR and Text PDFs from Digital Collections

At the Science History Institute Digital Collections, we have a fairly small collection compared to some peers (~70,000 images) of historical materials. Many of those images are of text: Books, pamphlets, advertisements, memos, etc.

We haven’t previously done any OCR (Optical Character Recognition), but started thinking about doing that. In addition to using captured text for the site-wide search, it made sense to us to look to providing:

“search inside the book/work”, with results highlighted on in-browser page images, such as provided by this example from Internet Archive using their own viewer, or this example from NCSU using UniversalViewer (search for say “corn”, note how results are highlighted on the page image)
Provide downloadable PDFs with a “text layer” — select-copyable and searchable text in the PDF, as in this corresponding example from Internet Archive.

Both of those use cases require not just text out of OCR, but text with position information so it can be overlaid on the page image.

We decided to investigate starting with the PDF-with-text-layer as the first product, because I (naively, i think!) believed this would be straightforward to do, and because we have user research indications that some portion of our users really love PDFs (which I think would be common among at least any user groups of academics, probably others too).

I had to do a lot of research to understand the technologies, techniques, and tools that are out there in this domain. So I capture my findings here in a giant blog post, partially to capture my own notes, and ideally to help give someone else a head start. (This recorded conference presentation from Merlijn Wajer at Internet Archive is also a good overview of the technical ecosystem! Merlijn and the IA are very central figures in what open source work is going on in this area!)

I was a bit bewildered to notice that few of our peers seemed to offer PDF downloads (especially with a text layer), as I’m pretty confident our collective users would want this. I then discovered the tooling is somewhat limited, and used to be worse. I’m not sure and am curious why our sector/field/industry hasn’t invested more in software development to create better tooling here!

tldr Summary of findings and plan

So after some initial research, I had discovered that several peers used the hOCR format to represent OCR-with-position output, to power their in-browser online “highlight a search result on a scanned page” interfaces.

I somewhat naively figured I could use my choice of OCR engines (the open source tesseract seemed most popular; or maybe Amazon Textract?) to produce HOCR, which I’d do on image ingest.

And then I imagined (and thought I saw confirmed based on a bit of googling), I’d have my choice of tools to combine the hOCR with raster images into PDFs. Ideally I could use a fast compiled-C tool that could easily be installed via apt on ubuntu.

It turns out — that was over-optimistic. hOCR isn’t as quite as widely inter-operably standard as hoped. Tools for rendering a PDF based on hOCR are in fact very limited (and mostly python). There is a field of abandoned and not-super-robust partial solutions.

In fact, I only found one piece of software that could do this job well, the Internet Archive’s archive-pdf-tools. And even it does not currently seem to do as good a job of positioning text in the PDF as tesseract itself does (although it intends to port tesseract’s logic). It’s also a bit difficult to install, and may not be installable on MacOS. It’s not super widely-used software.

Later I also discovered a clever kind of compression meant for this kind of textual PDF, called Mixed Raster Content (MRC). When this works, it can really reduce the file-size of otherwise enormous PDFs of hundreds of pages of page images. And in fact there is only one working open-source implementation of this as far as I can tell, again the Internet Archive’s archive-pdf-tools. (There are implementations in commercial software, I have not evaluated them, more later).

The rest of this post will be some lengthy notes going into my findings about PDF technology; and evaluation of all the software I could find and easily evaluate that would render hOCR to a PDF or be otherwise useful; and tips and tricks and difficulties in using that software.

But in the end, I identified basically only two realistic paths to get to PDFs with text layer from my scanned TIFFs.

Have tesseract output a “text-only” PDF (a tiny PDF that includes only invisible “text layer”), and then use another tool (such as qpdf) to combine it with raster images.
- tesseract just does a better job of laying out text in the PDFs it outputs them any other tool I could find (and there aren’t many). Although archive-pdf-tool intends to match tesseract, with a port of tesseract’s logic even — it’s not currently doing so.
- Optionally, you could take the output of this, and try to run it through archive-pdf-tool’s experimental pdfcomp tool, to apply the MRC compression to a PDF it did not itself create. (I haven’t yet figured out how to access/run this experimental tool)
- If we aren’t doing MRC compression, look into using JP2 instead of JPEG — turns out PDF supports jp2, and it may be a smaller file for same quality.
- I’m still going to need the hOCR for future online-search applications, so I’ll be having tesseract output both hOCR and the text-layer PDF, one way or another, and storing them both.
- This does not give us the opportunity to manually correct hOCR — if we wanted to correct PDFs (perhaps for accessibility), it would have to be directly on the PDF, and would not apply corrections to other uses of the hOCR.
- Details on this approach below in the tesseract section.
Have tesseract output hOCR (probably at time of ingest), and then use archive-pdf-tool’s recode_pdf tool to assemble the hOCR with raster images into PDFs.
- At present, we won’t get quite as good text positioning as with tesseract
- it’s a bit harder to get installed (and may not be installable on our dev box) — which would apply if we tried to use pdfcomp for compression in path 1, but in path 1 it’s an optional add-on, here we’d have to get it solved from go
- But the text positioning is better than anything else I found but tesseract, and we’ll get good MRC compression out of it too.
- This would give us an opportunity to apply corrections to hOCR (perhaps for accessibility remediation) and (re-)generate PDFs accordingly, if we had a workflow and tooling for that.
- recode_pdf starts with TIFFs, so the PDF-generation process is going to be a bit slower and more resource-intensive.
- details of this approach below in the archive-pdf-tools section.

With either of those paths, it might be convenient to generate single-image single-page PDFs at file ingestion time, and then combine them into an aggregated multi-page PDF-demand. This makes it somewhat easier to deal with the fact that our app allows staff to add/remove or publish/unpublish individual pages on demand, which would invalidate generated PDFs. This approach would wind up with duplicated copies of an embedded font, but tesseract’s embedded “glyphless” font is only ~600 bytes, less than 1% of the likely sizes of outputs.

Anyway, these are pretty much the only options I came up with after much investigation of software that didn’t quite cut the mustard. It turns out going from hOCR to positioned text in a PDF is non-trivial, different tools do it differently, and not as well as others. Other open source software investigated (there isn’t a lot!):

ocropus/hocr-tools a python package including a hocr-pdf tool for rendering PDF from hOCR. Didn’t do a great job positioning the hOCR, was unable to handle positioning non-completely-horizontal lines diagonally, which tesseract and archive-pdf-tools were.
eloops/hocr2pdf a Javascript package that was meant by it’s original author as just a proof-of-concept experiment and hadn’t been touched in a while, did not do a good job of positioning
Exactimage hocr2pdf: At first appeared to be the compiled C hocr=>pdf tool I imagine existed. But it seems to be old unmaintained software, and I could not get it to work with contemporary tesseract hOCR.
pdfbeads: Ancient ruby software that can hypothetically do hOCR positioning and a MRC-like compression. I could not get it’s MRC to work for me; it’s hOCR positioning was inferior to archive-pdf-tool’s; and it’s weird zombie software with unclear mainline source repository.

You’ve now gotten the important bits of this post summarized. In the remainder of this post, we have more musings on state of the field, context of technologies available, and notes on individual software packages reviewed — it’s a LOT of stuff. I am not certain how useful others may find these notes on what I have discovered!

Other options? Commercial options? State of the market?

I just couldn’t find many tools for eg hOCR rendering — although there may be more in the .Net world. There are some relevant commercial offerings here, that deal with OCR and PDF generation. They are often Windows-only, and often GUI software meant for someone to be operating as part of a scanning workflow. I think the market may be “corporate document management”. Some (or maybe just one?) of them claim to do MRC compression. Some of them have cloud “SDKs”. (as far as actual local SDKs, the market seems to be only for .Net).

I got the feeling that there was a lot of collaborative open-source energy on these techniques, for purposes of “ebooks” and “scanned books” 10-15 years ago (around the time of Google Books introduction?), but that it sort of petered out. This does not seem to be something our library and archive institutions have invested in. Thanks to the Internet Archive for being the main player working in this field and releasing open source tools! (Here is a video from the Internet Archive’s Merlijn Wajer that explains their procedures and how in 2020 they moved to an open source stack here. It also serves as a great overview of the technologies and tools discussed in this blog post.)

With few open source options, I would be potentially willing to pay for an appropriate tool at the right price. But the publicly-available documentation and general “developer experience” of commercial tools tends to be even worse than open source, it’s very difficult to even figure out if it’s going to work for you. I have a few notes on commercial tools in the “MRC Compression” section below, but mostly I have not spent the time to understand the market.

OCR: Tesseract is the open source option

Optical Character Recognition, or OCR, is the process of taking an image, and extracting the text from it as text.

As far as I can tell, Tesseract is the only current widely used (or at all?) open source OCR option.

There were other packages at one time popular, but for instance I don’t believe “Cuneiform” is currently being maintained or getting much use. (Wikipedia says last cuneiform release was in 2011, so).

Tesseract is currently at version 5.x (5.0 released Nov 2021) — but Ubuntu 22 apt repo still only has the latest 4.x release. And when I tried asking library field peers, it seems most are currently still using latest 4.x release. Tesseract 4.0 actually introduced “a new neural network-based recognition engine” (although it still supports using models with the old engine too, I think), so earlier than 4.x would be a really different product, but 4.x already has you on the new engine.

Tesseract works with human-language-based models, so you have to tell it which languages you expect in a document (you can tell it more than one). It has official support for a lot of human languages (including some historical early-modern ones). It does not, as far as I know, have official support for handwriting (rather than type-set) recognition.

It is also possible to train your own models for tesseract, and some people may be sharing non-official trained models for certain kinds of materials. I am not certain if I’ve seen any such that use the new “neural network-based recognition engine”, and at any rate I haven’t spent any time investigating this area.

On ubuntu, you can install tesseract with apt-get install tesseract-ocr, and install individual language packs with eg apt-get install tesseract-ocr-deu (you need to look up the appropriate tesseract language code). On MacOS, you can install tesseract with brew install tesseract, and install all supported language packs at once with brew install tesseract-lang.

For officially supported language packs, there are “FAST” and “BEST” model variants available. The distribution packages above will install the “FAST” packages. The “FAST” packages are smaller on disk and intended to result in much faster operation, with only slightly decreased accuracy. If you want to install and use the “BEST” packages instead… I am not sure how, and have not spent time with them or comparing.

Other OCR options? Commercial? AWS Textract?

I looked briefly at AWS Textract. It only handles six major European languages. BUT it claims to be able to recognize hand-writing? We def have hand-written items in the collection, would be big if it worked well.

It has all sorts of fancy tools for recognizing structured text on various types of business documents (invoices, business cards) that are mostly not of concern to us. It does not produce hOCR, but does produce it’s own XML format that maybe could be converted to hOCR, although a converter isn’t included in this project I found of other hOCR conversions.

If I understand the pricing properly, At $15/1000 pages it’s quite expensive. We estimated the price of CPU time on heroku using tesseract to be 100x less.

Perhaps we’d investigate in the future to expand our processes to OCR’ing handwriting. But first the lower-hanging fruit.

There are other commercial OCR solutions, including Google Cloud Vision, and lots and lots of Windows-based “document management workflow” solutions, that I haven’t really even looked at.

Note on Accessibility and OCR

Automatically OCR’d text does not necessarily produce an “accessible” copy, say for people with vision impairments. While current OCR results from eg tesseract are surprisingly good, and provide a good product for “searchability”, they still include too many errors to be simply read as a primary text, as you can see if you look at the text alone.

Additionally, in PDF form, I am told for accessibility for man purposes the text really needs to be “tagged” in a way that simple OCR will not produce.

It is almost certain that we do not have the resources to produce this level of accessibility for the tens of thousands of page images in our corpus. While adding machine-generated OCR may increase accessiblity somewhat for some purposes, contexts, and users — it definitely is going to leave a lot of people out, people who have vision impairments among others.

Another possible intervention is that we could provide a clear functions for users to request accessible/remediated PDF (or other) copies on a per-work basis. It’s still not totally clear how we’d best provide that, whether we’d do remediation in-house (and using what tools and workflows), or perhaps send PDFs to vendors to produce accessible copies (this is not cheap, it looks like maybe $1/page or more for PDFs with accessible tagging, although I’m not certain).

In my imagination, a well-engineered process for remediating OCR might involve producing hOCR, then correcting the hOCR that is then used to (re-)generate PDFs. This way the corrections would also apply to other uses of the (h)OCR such as indexing for collection-wide search in Solr, or for search-inside-the-page with highlighting features offered via a web browser.

However, the tooling for this seems to be pretty limited, this kind of workflow does not in fact seem to be common. hocrjs is a possibly still-maintained tool for viewing hocr in a browser. It could be a building block into making a GUI for reviewing/fixing hocr (which may be internet-archive has for their own use, see this video?). Here is a more full-featured proof-of-concept for actually editing/correcting hocr. Alternately, hocr-proofreader seems to be a proof-of-concept not “finished” into actually supporting some kinds of review and editing of HOCR — while the notes suggest it’s not ‘finished’ it is a very impressive proof-of-concpet — check out the demo!

Of course, even if you corrected typos in the hOCR, that wouldn’t necessarily give you enough for the accessible “tagging” in the PDF. (Is there even a format that can capture OCR-with-position and all the semantics necessary to produce PDF tagging too? I don’t think it’s hOCR. The state of the ecosystem is underwhelming here).

A more realistic approach for the existing eco-system might be remediating a PDF as PDF (either sending to a vendor, or in-house with tools like Acrobat Pro or Abbyy FineReader) — and then extracting the (corrected) text from it as hOCR, to put into our system for other uses. The Internet Archive archive-hocr-tools project has a script that can extract a text layer from PDF to hocr, although it’s not mentioned in project readme (I might PR this), I’m not sure how I found it!

Some PDF tech details

What is a “PDF with text-layer” anyway?

PDF’s don’t actually have “layers” or “text layers”. But this is shorthand for a PDF that includes actual computer-readable text in addition, in these cases, to a “raster” (pixel-based) image of a photograph of a physical text.

The PDF text isn’t in a “layer”, it’s just individual pieces of text positioned in the PDF. PDF actually has a “rendering mode” (constant 3) for non-displayed text. (See this StackOverlow for some discussion).

In the kind of PDFs we’re talking about there are non-displayed text objects positioned in the same place/size as the words in the picture, so you can select (to copy and paste) text, and it looks like you are selecting the image itself. And you can “search within the document” in a PDF viewer, and it will highlight your results, looking like it’s highlighting the image itself.

Even though the text is not displayed, it needs to be associated with a font. There are fonts that are “built-in” to PDF, but they can only display characters in traditional “Latin-1” character sets. Displaying text in this pre-Unicode-asendance format is a bit tricky if you are trying to do it yourself with raw bytes. Fonts in a PDF can be embedded in the PDF itself — and typically are for this sort of thing — to make sure the text can be displayed (or possibly even interpreted at all?) on a machine without the chosen font installed.

The text that isn’t even going to be displayed can get by with just a bare-bones stub of a font, a “glyphless” font, since they don’t actually have to display, they just need to be encoded as machine-readable text. Tesseract, for instance, seems to use it’s own TrueType “glyphless font” that weighs only 572 bytes. It has in the past sometimes had to be tweaked, almost anything you want to do with a PDF ends up being non-trivial to do reliably.

HOCR and Alto: Formats to Represent OCR data with positions

You could do an OCR operation and just get text out. But if you want to overlay the text on top of the scanned image for select-copy-paste or search-result-highlighting, you need position information too.

Are there standard interchangeable formats that encode this information? Yes…. sort of.

The most popular one seems to be hOCR. It literally is an HTML document, with <p> for paragraphs and <div>s and <span>s, that embeds positional and other information in title attributes. (Flashback to “HTML microformats” for anyone else? Nevermind).

When I asked around for colleagues to see what they were using to power online on-scanned-page search-highlighting, the answer was hOCR. tesseract can output hocr. There were several tools I found that could take hOCR as input.

The thing is… it’s unclear how well hOCR actually serves as a mutually-intelligible interchange format. Going back to 2016, there has been some concern that hOCR allows too much variation and hOCR from different producers may not truly be mutually intelligible. I think some of the tools I found that take “hOCR” as input may really only work with tesseract hOCR, and maybe even only certain versions of tesseract.

At the moment, there seem to be very few pieces of currently-maintained software that produce hOCR directly. (tesseract and… maybe there’s another open source package called Kraken? And a couple other barely- or non-maintained little-used open source packages).

As far as I can tell, most proprietary/commercial solutions can not read or write hOCR; they mostly use their own proprietary XML formats, if anything. Hypothetically you could translate from and to hOCR, and for some formats there are tools that claim to do so. Github cneud/ocr-conversion is a repository of scripts to convert between various OCR-position formats; it contains scripts to convert FROM several vendor formats (incluing Abbyy) to hOCR, but not usually the reverse.

There is another similar format, endorsed by the Library of Congress, called ALTO, which some think is technically superior, but it doesn’t seem to be supported by very many (any?) tools. (Tesseract can output ALTO, although it isn’t very well documented).

The end result is that this field isn’t quite as standards-based inter-operable as I had hoped/assumed.

MRC compression

So, raster (pixel) images are big, especially when you have hundreds of them. In our current application, we’re making PDFs out of 1200 pixel JPGs (made at default JPG compression level). The PDF for one particular 700-page book is 325MB. That’s a big file.

You could reduce the resolution or image quality. But 1200pixels is already only ~150dpi for an 8.5″/11″ page, and increasing JPG compression may introduce noticeable artifacts in some images — although we could experiment with this more. (If you do want to reduce byte size, do you get better perceived quality for the reduced size with less resolution or more JPG compression? I suspect keeping the resolution but increasing compression is the way to go, but I’m not sure).

However, it turns out someone (maybe these guys in 1998?) invented a very clever way to apply higher compression with less loss of perceived quality — specifically for the kinds of images likely in scanned books or scanned text. Called “Mixed Raster Compression” (MRC), or “hyper-compression”, it involves separating the page “background” (which can be highly compressed), from any embedded graphics and text (which can’t be compressed as much without noticeable problems — especially the text), separating them in separate images with separate levels of compression and/or resolutions, then combining them back together with a “mask”, in a way that PDF technology supports.

More information on MRC can be found on Wikipedia , this vendor’s markettng page, this other vendor’s marketing page, or the Internet Archive’s archive-pdf-tools README. Merlijn from Internet Archive also explains it in a conference presentation.

My sense that is that MRC compression is more of a technique than an exact algorithm. Different implementations may do it differently, and have output that can be more or less successful. There can be bugs or areas for improvement, that can differ between tools. The different layers can be split purely by automated image processing, but also can use the (eg) hOCR file to identify regions with text that need higher fidelity than backgrounds.

I believe the Internet Archive’s archive-pdf-tools is the only functional open source implementation of MRC encoding.

One commercial tool that may do some kind of MRC compression is the suite of tools known as “GDPicture” (the company behind that has merged with competitor Orpalis making things even more confusing). They do advertise supporting MRC compression. I had a brief phone call with a sales engineer, who wasn’t super familiar with this feature but confirmed they had it, and gave me an overview of the products in general. There is a page at avepdf.com that is “powered by GDPicture MRC Compression SDK” that will let you apply MRC compression to existing PDFs for free… but only a couple an hour, so I haven’t managed to totally wrap my head around it. Hypothetically, then, the PassportPDF cloud SDK from the same company would give me access to “GDPicture MRC Compression” — but I haven’t yet managed to figure out how. (But see if you can at the API reference?). Figuring out what is available from proprietary projects can sometimes seem even more challenging than from open source.

The market-leader Abbyy also says they support MRC, including via an SDK? One of the first or most popular commercial tools to apply this technique may have been called “LuraTech”, I’m not sure the current status of that software.

Evaluating Internet Archive recode_pdf, compared to alternatives

When I ran internet archive’s recode_pdf (with arg --bg-downsample 3 and otherwise default arguments) on full-resolution TIFFs, it resulted in PDFs that were about 6% the size of a PDF I made from a full-res JPG! Or about 50% the size of the PDFs we make from 1200px JPGs — still a significant reduction. Looking at them side-by-side… in one of my samples the MRC-compressed PDFs did have some visible artifacts, but text is still sharp. In two other cases, no visible artifacts.

I tried to test the free trial at avepdf.com — the extreme rate limit and cumbersome manual browser process made it hard to test a lot. I tested with PDFs that included lossless full-res PNG images, to avoid any lossy=>lossy quality issues. My initial reaction is that the text seems noticeably less sharp in the avepdf MRC-compressed PDF, even at “low” compression level — but if you zoom in, the text seems to get sharp again, which I don’t understand. My subjective impression of image quality is of course subjective, it’s hard to compare. avepdf MRC compression at “low” or “medium” compression seem to be approximately the same size as my recode_pdf output.

If we end up not using MRC, then our 1200px JPG PDFs would be maybe ~2x the size of the recode_pdf full-resolution MRC PDFs. I learned from Merlijn’s presentation that PDF actually supports embedded JPEG2000 (jp2) instead of JPEG (which their MRC technqiue uses), and that jp2 may compress smaller for the same quality. Switching to jp2 instead of jpeg and playing around with maximum compression without artifacts across my sample size… I can get my 1200px JPG PDFs to be about the same size as the recode_pdf full-res MRC compressed PDFs — although of course at reduced resolution.

note on dpi and PDF page size and variation

PDFs as a format is based on a 72 dots-per-inch (dpi) standard grid, with objects sometimes measured in actual inches. (It was a format meant for encoding things to be printed physically!).

You can embed an image of a given resolution, say 500×500 pixels in a PDF, but say it should take up however many “inches” you want, and it will be scaled on display. And the page size can be a given number of “inches” high and wide, which will determine how big it displays on a screen in most viewers.

The TIFF format also has a dpi value embedded in it, which sort of says how big in inches the TIFF (or the thing photographed) was. Some of the tools I tested detected the dpi from the source TIFF, and used it to determine the PDF page size. Others did not, and used a default or guessed size.

Many tools allow you to pass a dpi argument that it will use to determine the “page size” in resulting PDF — in my understanding this should not effect actual image resolution or much other than initial zoom level or size of page if printed. If it does with a given tool, I don’t understand what is going on.

In my tests, I generally did not supply an explicit dpi value, to have one less knob to twiddle. So resulting PDF page sizes can vary.

Available Software to make text PDF from hOCR+images

Source Test Material and Methodology

To try out different tools and techniques, I started with three somewhat representative images from our collection.

A fairly ordinary page of single-column clear text from a book
A page where the photo has text more at an angle and contains figures and several text blocks
A graphical advert that has text headlines and blocks in several places and sizes

Note on embedded thumbnails: Our original TIFFs in our actual repository often have an embedded second image, a tiny ~100px thumbnail. (did you know TIFFs can contain more than one image file?). This is something software involved in some of our photographing workflow at some points in history did without us totally knowing. It can really mess up various image tools, including some included in these tests (I had some really confusing errors at some points, thanks to @MerlijnWajer for helping me out.). So the first thing I did was extract just the first image with vips copy original.tiff just_one_image.tiff (verified with imagemagick identify, which will tell you how many images are in there). (This may also have stripped some metadata, but preserved DPI metadata)

Tesseract — can create PDF with text layer directly

So you can ask tesseract to do the OCR and output PDF with text layers.

You have very little control of the raster image in the output — tesseract will convert your TIFF to a JPG (no control over JPG compression level), of the same resolution as the TIFF you used as input. This results in a pretty large PDF file — for our one page samples: 3.5M -5M per page, which is a lot, when we consider we will want PDFs for books hundred of pages long.

You want to give tesseract the full-resolution TIFF for best OCR, but maybe want to use smaller files in the PDF. Or maybe you want to manually correct the OCR output before making a PDF?

One obvious option is having Tesseract generate an HOCR file with OCR-positional info, and using another tool to combine the HOCR with a raster image into a PDF. But, it turns out — no other tools I found actually render the tesseract-produced HOCR with text postioned as well as Tesseract itself does.

It made me wish tesseract had an option to take the HOCR (that you have perhaps edited), and combine it with images (of your choice of resolution and compression quality) to make a PDF, using tesseract’s superior HOCR-rendering. It turns out, things along those lines have been suggested, but rejected by tesseract developers who don’t want to get into the general business of creating PDFs.

Instead, they introduced a feature to create a “text-only” PDF — a very small PDF that actually only has the invisible glyphless text layer. The idea, as shown in that ticket, is that you can then use external tools to merge that with images or a PDF with raster images, to create the actual PDF with your choice of raster image and text layer.

I did get this to work pretty well, with qpdf as my merge utility. I merged the tesseract (invisible) text-only PDF with my “legacy” PDF of 1200px-wide JPGs, using these commands:

$ tesseract source.tiff source.tesseract_text_only -l eng -c textonly_pdf=1 pdf

$ qpdf image_only.pdf --underlay source.tesseract_text_only.pdf -- output_image_plus_text.pdf

One caveat — PDF pages have an inherent page size (usually expressed in inches, believe it or not). If the two PDFs you are merging are exactly the same size, that’s fine. If the text-only PDF is bigger than the image one (in PDF inches), that’s fine — that qpdf command will scale it down to match, and the output is just right. But if the text-only PDF is smaller, that qpdf command will just embed it in the middle, and the embedded invisible text won’t be properly aligned with the visible text on raster image.

You can supply dpi arguments to most PDF-creating utilities (including tesseract and recode_pdf below), which basically just effect the PDF inches size set on the resulting PDF. So you just want to make sure to do this to ensure the text-only PDF is larger in PDF-inches.

This works — but doesn’t accomodate the use case where we might want to correct errors in the OCR by editing an HOCR file, before producing the PDF. I haven’t found any way to take advantage of tesseract’s superior layout of OCRd text in the PDF, while correcting the OCR content before the PDF is produced. You can of course edit the PDF directly, but this is cumbersome, and doesn’t get you a corrected HOCR file you can use for other purposes too.

I’m probably going to need HOCR anyway for other purposes. You can have tesseract produce PDF and HOCR in one go if you want. (Btw it turns out tesseract can also produce alto although I’m not sure where this is documented).

tesseract dhc6a4r.tiff scratch/test.tesseract -l eng -c textonly_pdf=1 pdf hocr

Beware that if you produce individual tesseract PDFs with text content and try to combine them… you’ll wind up with duplicate copies of tesseract’s “glyphless font” embedded, one per each source PDF. I haven’t found a good way to merge/de-duplicate them, but I think the embedded glyphless font is only 527 bytes?

Other tools can take the hocr tesseract produces, and use it to position a text layer on a PDF… with mixed results. None currently do as well as tesseract’s own PDF positioning. It turns out going from tesseract hOCR to correctly positioned text on PDF is not a trivial operation?

archive-pdf-tools: recode_pdf — a sophisticated, and supported, tool

The Internet Archive’s archive-pdf-tools is a currently maintained, well-written package in python, extracted from their own workflow and shared. It began with an effort at the Archive that began in 2020 to move to an open source pipeline.

The recode_pdf command takes a TIFF and HOCR, and renders a PDF with text layer, and compressed with the sophisticated MRC compression. It may be the only open-source implementation of MRC compression technique.

It has quite a few non-python dependencies. Installations directions specified for ubuntu worked well for me on ubuntu. One C dependency, jbig2enc — does not exist in the standard Ubuntu package manager. It built from source fine for me on ubuntu, but that gives me some challenges for trying to get it installed on heroku. jbig2enc also has a non-standard-location apt package and a snap, as well as a brew package (I think from former Code4Libber Misty De Meo?). jbib2enc appears mostly unmaintained (although it does have occasional trivial new PR merged, it’s not totally abandoned); but also appears to have a variety of forks out there with different bugfixes/patches, so I’m not sure all those sources are actually the same code!

I am having a bit of trouble installing archive-pdf-tools reliably on MacOS, but that may be corrected soon or may be my own fault.

recode_pdf’s rendering of text from the hOCR file delivered by tesseract — is currently not as good rendering as tesseract itself does when making PDFs for my samples. I describe my observations in this issue filed at archive-pdf-tools.

In fact, archive-pdf-tools’s HOCR rendering is ported from tesseract (and writes PDFs directly with raw bytes, not using a PDF library of any kind). So why isn’t it’s rendering/positioning as good? Not yet clear.

This inferior HOCR rendering is unfortunate, because this is otherwise for sure the most mature/supported open-source HOCR rendering solution I found, which does do a better job of positioning than any other open source code I found. It’s also the only working open-source MRC compression implementation I found.

It was interesting to see the MRC compression. The output PDFs, which have as many pixels as our full-size source images (but under increased lossy compression), fro the most part really do look just as good as much larger bytesize PDFs, while being very small on disk. (There are compression artifacts in some samples though). The archive-pdf-tools MRC-compressed TIFFs are about 10% of the size of tesseract’s PDFs created with full-size JPGs. For our two high-text images they were about 50% of the size of our 1200px wide JPG PDFs; for the graphical image with less text, it was about 80% the size of our 1200px-wide JPG PDF.

As this is the only open-source implementation known for MRC compression, it would be nice to be able to apply it de-coupled from the HOCR rendering. There has been some discussion and work on creating a pdfcomp executable for this, but it seems to still be ongoing. I have not managed to figure out how to test it myself yet. (It’s not clear to me if you are going to have quality problems giving it PDF input that is already JPG lossy compressed, or if this own’t matter in the end).

recode_pdf --bg-downsample 3 --from-imagestack source.tiff --hocr-file  source.tesseract.hocr -o output.recode_pdf.pdf

While I was running only on one page at a time, I believe if you are running on multiple pages, recode_pdf wants a single HOCR file, with multiple pages, in the right order to correspond to the order of TIFF input arguments.

(Note, it turns out you CAN use recode_pdf without jbigenc2, by telling it to use a different inferior compression algorithm, with --mask-compression ccitt. For my three samples, this resulted in 13-25% larger file output. In the case of the mostly graphical one only, it made the PDF output larger than my 1200px JPG output.)

ocropus/hocr-tools: hocr-pdf (python) — has some problems

The hocr-tools package in python includes an hocr-pdf command that is intended to combine HOCR and JPGs to produce PDFs with text layers.

I installed hocr-tools 1.3.0 on my MacOS laptop with simple pip install hocr-tools.

The way hocr-pdf takes it’s input is a bit confusing — you need to run it on a directory which includes only source files, where a corresponding JPG and HOCR have the same name but for suffix. (JPEG must end in .jpg not .jpeg!)

hocr-pdf ./directory > output.pdf

The apache-licensed source code creates a PDF using a python PDF generation library — this is different than some code (such as archive-pdf-tools) that writes raw PDF bytes. So it may make it a good place to look to understand the/an algorithm, possibly for porting to another language, if you want to use a PDF library rather than write raw PDF bytes. I considered this at one point; I’m not sure if (eg) ruby’s prawn has analagous features to all being used from the python PDF library, I’m not sure how hard it would be.

It did not like it when I tried using with JPG with different smaller resolution than the TIFF the HOCR was created from — it produced wrongly scaled output. There are some tools/scripts available to resize HOCR coordinates (javascript, ruby), that I believe would be what you’d need to do this.

To begin with as a demonstration, though, I just used it with a full-size JPG converted from the source TIFF at same resolution.

I did not get great results — the page sizes were weird. For the standard and graphical pages, the image was cut off, not entirely in the PDF — it seems to insist on 8.5″/11″ aspect ratio/page size? For the intermediate “diagonal” page, the page just took up a portion of the canvas, it was too small. The text still did line up with the image, but it seems like perhaps some assumptions about DPI we are not meeting, or other bugs in how the tool calculates PDF page size. I have not yet spent time to report these problems on Github Issues, because other problems encountered probably make this tool unsuitable for me anyway.

In all cases, the HOCR rendering is… OK. I would say it is about the same quality as archive-pdf-tool’s, although it is not identical to archive-pdf-tools, even from the same HOCR file. Apparently HOCR positioning is non-trivial.

On the “diagonal” page, hocr-pdf didn’t make the lines too high like recode_pdf — but it seemed incapable of including angled lines at all, the lines are rendered straight, which makes them not match up with the actual image text. (Try selecting the line “The liquor…” at the bottom). This seems to make it pretty unsuitable for use with our actual input corpus.

`hocr-pdf` also strangely bloated the size of resulting PDFs. Creating a PDF from a JPG that was 3.3M, the resulting PDF was 4.1M! (Compare to tesseract-produced PDF of 3.5M, which makes sense, adding just 200K for textual info). And the PDFs it created generated lots of warning-complaints from poppler and other pdftools.

eloops/hocr2pdf (js) — proof of concept without great rendering

When looking for any open source HOCR rendering code I could find, I found this package on github. At the time I found it, it hadn’t had a commit in many years, and from the commit history and repo activity it was unclear if it had ever really been used at scale, and it didn’t have a license on it. At that early point, if it was working code in Javascript (which uses a PDF-generating library instead of writing raw PDF bytes), I was potentially interested in porting it’s logic to ruby.

I got the author’s email address from the commit history, and emailed them to inquire. Stephen Poole kindly got back to me to confirm this was basically a proof-of-concept that was never used for real work. Stephen kindly added an MIT license in case I wanted to use it.

Curious, I wanted to test it on my test images and hocr. I was able to get it to work after realizing it needed an old version of the cheerio HTML parser, and fixing up the example in the README.

It didn’t do a great job of rendering. Trying to highlight-select lines, it was often impossible to select a line continuously, perhaps because the words on the line ended up with very different heights and baseline positions. It was not able to render the diagonal text diagonally in the diagonal example. (Try selecting “This effect, especially as
regards purples” in the diagonal file to see both issues).

An interesting example, mostly demonstrating that positioning rendered HOCR even as well as archive-pdf-tools does is not necessarily trivial.

Exactimage hocr2pdf — didn’t work for me at all

At first I imagined I was going to find a compiled executable available through package managers that simply combined hocr and images to make a PDF, as if this were a normal thing.

At first that’s what it looked like the ExactImage hocr2pdf tool was. Available via “brew install exactimage” or “apt-get install exactimage“.

The problem is… it didn’t work for me.

At first I had trouble getting it to take my inputs at all, it said “Error reading input file.” If I opened the TIFF in MacOS preview and re-exported as a TIFF again, I could get it to read it.

But it produced weird PDFs with no scanned images at all, and just a portion of the HOCR text rendered visibly in giant font.

It is an old package that doesn’t seem to be getting maintenance; the docs suggest it was written for use with HOCR from the (also non-maintained) cuneiform OCR package. Either I don’t understand how to use it, or HOCR has changed over time/between vendors that it can’t handle contemporary tesseract HOCR.

hocr2pdf -i scan.tiff -o test.pdf < ocr.hocr

pdfbeads (ruby) — a historical artifact, of unclear current utility

Researching this stuff, I found mention of this mythical project “pdfbeads”, which was written, in ruby, over 10 years ago, and appeared to be targetted at creating “ebook” PDFs from scans — there was a lot of energy in this domain back then, and at one point this was a well-known package with implementations of some things not found elsewhere.

It did/does both HOCR rendering and a kind of compression that seems to be similar to MRC, if not being MRC, although it’s not referred to as such in rubybeads code or docs.

I am not certain when it was first written, because it was originally in a “rubyforge” repo, and rubyforge is gone, along with it’s commit history and discussions that were there, which is sad. Some “forks” of pdfbeads exist on github, but none of them copied history from the original rubyforge (svn?) repo. Some claim to do things like “update for ruby 2.0”. For instance d235j/pdfbeads (which has a version number of 1.1.1), and ifad/pdfbeads (which has a version number of 1.0.11).

OK, the weird thing is… rubybeads got a rubygems release 1.1.3 in Jan 2022 — only a year ago — the first rubygems release since 2014. I have no idea if the repo this release came from is public, or really where to find the code for this release (other than in the gem package) — rubygem metadata for “homepage” still points to rubyforge!

But a CHANGELOG file is captured in the rubygem package, which rubygems.org conveniently shows us in a diff, so we can see what features have been added/changed in the latest release.

The READMEs found in all those locations do have an email address for the pdfbeads author, Alexey Kryukov. I tried emailing him for info (and if there is a public repo), but haven’t heard back.

I was initially interested in pdfbeads because I thought it might have a useful ruby implementation of HOCR rendering (writing direct raw PDF bytes, it looks like), and because I thought it might be the only other identified open source implementation of MRC-style compression!

pdfbeads input methods are kind of confusing — not sure if it wants an HOCR file per image, or one combined one like archive-pdf-tools. I tested it on just one image/hocr at a time. Input files can’t have more than one . (period) in them, which had me stuck for a bit. It will leave a lot of intermediate files around, so is best run in a scratch or per-work directory. Using latest 1.1.3 release from rubygems.

 pdfbeads dhc6a4r.tiff dhc6a4r.hocr -f -o dhc6a4r.pdfbeads.pdf

Whatever compression it’s supposed to be doing isn’t working at all for me. That output a 15M PDF, which is 5x the size of the PDF tesseract outputs from the same TIFF input! So… negative compression for me?. Extracting images from the produced PDF shows it was making multiple image overlays MRC-style (and that it decided to downsample the pixel resolution from the source TIFF, by different factors for different images, maybe depending on DPI) — but I guess it’s algorithm just didn’t work well with my input? Maybe it expects black and white input only?

There is probably something I don’t understand about how it is intended to be used. I have found it hard to find instructions/documentation (here’s an HTML doc page at some historical version?), and hard for me to understand.

The HOCR rendering was okay on some pages, but had some serious problems on others. On our image test #2, with “diagonal” text, the diagonal angle of the rendered lines was correct, but they were wrongly vertically offset from their true positions by about half a line? And our #3 graphical image, the line beginning “for home users” was just plain missing, although other lines were positioned well?

Overall, I’m not sure what’s going on with this code.

Ocrmypdf — a high-level tool for adding OCR to PDFs, usually with tesseract

For completeness, I thought I’d mention Ocrmypdf, because it is something that’s actually still maintained/developed (which seems to be unusual in this field!), with a lot of functionality.

It seems focused on the use case of having PDFs of scans, say from a photocopy machine, and wanting to have a “just works” tool that takes that as input and leaves you with a text layer. It’s sort of a high-level integration of lots of other tools to try to give you this one-click solution. It itself is written in python.

While it does have it’s own python implementation of HOCR rendering/positioning, by default it uses a “sandwhich” mode to have tesseract position the OCR’d text, and does not use it’s own HOCR renderer by default. It does say it’s own HOCR renderer “has the best compatibility with Mozilla’s PDF.js viewer”, but also warns it doesn’t currently handle non-Latin Unicode properly.

It does not do MRC compression, but is intersted in in it, began talking to Merlijn from Internet Archive about it, which led to the archive-pdf-tools attempts at the pdfcomp tool.

I didn’t spend too much time actually investigating this, when I saw that it by default just used tesseract for text rendering, and didn’t implement MRC. I haven’t actually tested it’s built-in HOCR rendering, I only just noticed now that OcrMyPDF docs suggest you might want to use it for “better compantibility with Mozilla’s PDF.js viewer”?

jbrinley/HocrConverter (python) — one more

I didn’t take the time to actually play with this one yet, but for completeness — former Code4Libber Jon Brinley has some 13-year-old python code at https://github.com/jbrinley/HocrConverter/blob/master/HocrConverter.py, which also links to a blog post of his at https://xplus3.net/2009/04/02/convert-hocr-to-pdf/

A copyright notice in OcrMyPDF source for Jon Brinley suggests maybe their implementation came first from here? Maybe.

10-15 years ago, people were doing a lot of work in this area that just kind of… stalled out?

OCFL and “source of truth” — two options

Some great things about conferences is how different sessions can play off each other, and how lots of people interested in the same thing are in the same place (virtual or real) at the same time, to bounce ideas off each other.

I found both of those things coming into play to help elucidate what I think is an important issue in how software might use the Oxford Common File Layout (OCFL). Prompted by the Code4Lib 2023 session The Oxford Common File Layout – Understanding the specification, institutional use cases and implementations, with presentations by Tom Wrobel, Stefano Cossu, and Arran Griffith. (recorded video here).

OCFL is a specification for laying files out in a disk-like storage system, in a way that is suitable for long-time preservation. With a standard simple layout that is both human- and machine-readable, and would allow someone (some software) at a future point to reconstruct digital objects and metadata from the record left on disk.

The role of OCFL in a software system: Two choices

After the conference presentation, Matt Lincoln from JStor Labs asked a question in Slack chat that had been rising up in my mind too, but which Matt said more clearly than was in my mind at the time! This prompted a discussion on Slack, largely but not entirely between me and Stefano Cossu, which I found to be very productive, and which I’m going to detail here with my own additional glosses, but first let’s start with Matt’s question.

(I will insert slack links to quotes in this piece; you probably can’t see the sources unless you are a member of the Code4Lib workspace).

For the OCFL talk, I’m still unclear what the relationship is/can/will be in these systems between the database supporting the application layer, and the filesystem with all the OCFL-laid-out objects. Does DB act as a source of truth and OCFL as a copy? OCFL as source of truth and DB as cache? No db at all, and just r/w directly to OCFL? If I’m a content manager and edit an item’s metadata in the app’s web interface, does that request get passed to a DB and THEN to OCFL? Is the web app reading/writing directly to the OCFL filesystem without mediating DB representation? Something else?
Matt Lincoln

I think Matt, utilizing the helpful term “source of truth”, accurately identifies two categories of use of OCFL in a software system — and in fact, that different people in the OCFL community — even different presenters in this single OCFL conference presentation — had been taking different paths, and maybe assuming that everyone else was on the same page as them, or at least not frequently drawing out the difference and consequences of these two paths.

Stefano Cossu, one of the presenters from the OCFL talk at Code4Lib, described it this way in a Slack response:

IMHO OCFL can either act as a source from which you derive metadata, or a final destination for preservation derived from a management or access system, that you don’t want to touch until disaster hits. It all depends on how your ideal information flow is. I believe Fedora is tied to OCFL which is its source of truth, upon which you can build indices and access services, but it doesn’t necessarily need to be that way.
Stefano Cossu

It turns out that both paths are challenging in different ways; there is no magic bullet. I think this is a foundational question for the software engineering of systems that use OCFL for preservation, with significant implications on the practice of digital preservation as a whole.

First, let’s say a little bit more about what the paths are.

“OCFL as a source of truth”

If you are treating OCFL as a “source of truth”, the files stored in OCFL are the main primary location of your data.

When the software wants to add, remove, or change data, it will probably happen to the OCFL first, or at any rate won’t be considered a successful change until it is reflected in OCFL.

There might be other layers on top providing alternate access to the OCFL, some kind of “index” to OCFL for faster and/or easier access to the data, but these are considered “derivative”, and can always be re-created from just the OCFL. The OCFL is “the data”, everything else is “derivative” and can be re-created by an automated process from the OCFL on disk.

This may be what some of the OCFL designers were assuming everyone would do; as we’ll see, it makes certain things possible, and provides the highest level of confidence in our preservation activities.

“OCFL off to the side”

Alternately, you might write an application more or less using standard architectures for writing (eg) web applications. The data is probably in a relational database system (rdbms) like postgres or MySQL, or some other data store meant for supporting application development.

When the application makes a change to the data, it’s made to the primary data store.

Then the data is “mirrored” to OCFL. Possibly after every change, or possibly periodically. The OCFL can be thought of as a kind of “backup” — a backup in a specific standard format meant to support long-term preservation and interoperability. I’m calling this “off to the side”, Stefano aboves calls it “final destination”, in either case contrasted with “source of truth”.

It’s possible you haven’t stored all the data the application uses to OCFL, only the data you want to backup “for long-term preservation purposes”. (Stefano later suggests this is their practice, in fact). Maybe there is some data you think is necessary only for the particular present application’s functionalities (say, to support back-end accounts and workflows), which you think of as accidental, ephemeral, contextual, or system-specific and non-standard– and which you don’t see any use to storing for long-term preservation.

In this path, if ALL you have is the OCFL, you aren’t intending that you can necessarily stand your actual present application back up — maybe you didn’t store all the data you’d need for that; maybe you don’t have existing software capable of translating the OCFL back to the form the application actually needs it in to function. Of if you are intending that, the challange is greater to accomplish it, as we’ll see.

So why would you do this? Well, let’s start with that.

Why not OCFL as a source of truth?

There’s really only one reason — because it makes application development a lot harder. What do I mean by “a lot harder”? I mean, it’s going to take more development time, and more development care and decisions, you’re going to have more trouble achieving reasonable performance in a large-scale system — and you’re going to make more mistakes, have more bugs and problems, more initial deliveries that have problems. It’s not all “up-front” cost or known cost, but as you continue to develop the system, you’re going to keep struggling with these things. You honestly have increased chance of failure.

Why?

In the Slack thread, Stefano Cossu spoke up for OCFL to be a “final destination”, not the “source of truth” for the daily operating software:

I personally prefer OCFL to be the final destination, since if it’s meant to be for preservation, you don’t want to “stir” the medium by running indexing and access traffic, increasing the chances of corruption.
Stefano Cossu

If you’re using it as the actual data store for a running application, instead of leaving it off to the side as a backup, it perhaps increases the chances of bugs effecting data reliability.

The problem with that setup [OCFL as source of truth] is that a preservation system has different technical requirements from an access system. E.g. you may not want store (and index) versioning information in your daily-churn system. Or you may want to use a low-cost, low-performance medium for preservation
Stefano Cossu

OCFL is designed to rebuild knowledge (not only data, but also the semantic relationships between resources) without any supporting software. That’s what I intend for long-term preservation. In order to do that, you need to serialize everything in a way that is very inefficient for daily use.
Stefano Cossu

The form that OCFL prescribes is cumbersome to use for ordinary daily functionality. It makes it harder to achieve the goals you want for your actually running software.

I think Stefano is absolutely right about all of this, by the way, and also thank him for skillfully and clearly delineating a perspective that may, explicitly or not, actually be somewhat against the stream of some widespread OCFL assumptions.

One aspect of the cumbersomeness is that writes to OCFL need to be “synchronized” with regard to concurrency — the contents of a new version written to OCFL are as deltas on the previous version, so if another version is added while you are working on preparing your additional version — your version will be wrong. You need to use some form of locking, whether optimistic or naive pessimistic locks.

Whereas a relational database system is built on decades of work to ensure ACID (atomicity, consistency, isolation, durability) with regard to writes, while also trying to optimize performance within these constraints (which can be a real tension) — with OCFL we don’t have the built-up solutions (tools and patterns) for this to the same extent.

Application development gets a lot harder

In general, building a (say) web app on a relational database system is a known problem with a huge corpus of techniques, patterns, shared knowledge, and toolsets available. A given developer may be more or less experienced or skilled; different developers may disagree on optimal choices in some cases. But those choices are being made from a very established field, with deep shared knowledge on how to build applications rapidly (cheaply), with good performance and reliability.

When we switch to OCFL as the primary “source of truth” for an app, we in some ways are charting new territory and have to figure out and invent the best ways to do certain things, with much less support from tooling, the “literature” (even including blogs you find on google etc), and a much smaller community of practice.

The Fedora repository platform is in some sense meant to be a kind of “middleware” to make this lift easier. In its version 6 incarnation, it’s own internal data store is OCFL. It doesn’t give you a user-facing app. It gives you a “middleware” you can access over a more familiar HTTP API with clear semantics, and you don’t have to deal with the underlying OCFL (or in previous incarnations other internal formats) yourself. (Seth Erickson’s ocfl_index could be thought of as similar peer “middleware” in some ways, although it’s read-only, it doesn’t provide for writing).

But it’s still not the well-trodden path of rapid web application development on top of an rdbms.

I think that the samvera (née hydra) community really learned this to some extent the hard way, the way trying to build on top of this novel architecture really raised the complexity, cost, and difficulty of implementing the user-facing application (with implications on succession, hiring, and retention too). I’m not saying this happened becuase Fedora team did something wrong, I’m saying a novel architecture like this inherently and neccessarily raises the difficulty over a well-trodden architectural path. (although it’s possible to recognize the challenge and attempt to ameliorate with features that make things easier on developers, it’s not possible to eliminate).

Some samvera peer instititions have left the Fedora-based architecture, I think as a result of this experience. Where I work at Science History Institute, we left sufia/hydra/samvera to write a closer to “just plain Rails app”, and I believe it successfully and seriously increased our capacity to meet organizational and business needs within our available software engineering capacity. I personally would be really relutant to go back to attempting to use Fedora and/or OCFL as a “source of truth”, instead of more conventional web app data storage patterns.

So… that’s why you might not… but what do you lose?

What do you lose without OCFL as source of truth?

The trade-off is real though — I think some of the assumptions about what OCFL provides how are actually based on assumptions of OCFL as source of truth in your application.

Mike Kastellec’s Code4Lib presentation just before the OCFL one, on How to Survive a Disaster [Recovery] really got me thinking about backups and reliability.

Many of us have heard (or worse, found out ourselves the hard way) the adage: You don’t really know if you have a good backup unless you regularly go through the practice of recovery using it, to test it. Many have found that what they thought was their backup — was missing, was corrupt, or was not in a format suitable for supporting recovery. Because they hadn’t been verifying it would work for recovery, they were just writing to it but not using it for anything.

(Where I work, we try to regularly use our actual backups as the source of sync’ing from a production system to a staging system, in part as a method of incorporating backup recovery verification into our routine).

How is a preservation copy analogous? If your OCFL is not your source of truth, but just “off to the side” as a “preservation copy” — it can easily be a similar “write-only” copy. How do you know what you have there is sufficient to serve as a preservation copy?

Just as with backups, there are (at least) two categories of potential problem: It could be there are bugs in your synchronization routines, such that what you thought was being copied to OCFL was not, or not on the schedule you thought, or was getting corrupted or lost in transit. But the other category, even worse — it could be that your design had problems, and what you chose to sync to OCFL left out some crucial things that these future consumers of your preservation copy would have needed to fully restore and access the data. Stefano also wrote:

We don’t put everything in OCFL. Some resources are not slated for long-term preservation. (or at least, we may not in the future, but we do now)

If you are using the OCFL as your daily “source of truth”, you at least know the data you have stored in OCFL is sufficient to run your current system. Or at least you haven’t noticed any bugs with it yet, and if anyone notices any you’ll fix them ASAP.

The goal of preservation is that some future system will be able to use these files to reconstruct the objects and metadata in a useful way… It’s good to at least know it’s sufficient for some system, your current system. If you are writing to OCFL and not using it for anything… it reminds us of writing to a backup that you never restore from. How do you know it’s not missing things, by bug or by misdesign?

Do you even intend the OCFL to be sufficient to bring up your current system (I think some do, some don’t, some haven’t thought about it), and if you do, how do you know it meets your intents?

OCFL and Completeness and Migrations

The OCFL web page lists as one of its benefits (which I think can also be understood as design goals for OCFL):

Completeness, so that a repository can be rebuilt from the files it stores

If OCFL is your applications “source of truth”, you have this necessarily, in the sense of that almost being the definition of OCFL being the “source of truth”. (maybe suggesting at least some OCFL designers were assuming it as source of truth).

But if your OCFL is “off to the side”… do you even have that? I guess it depends on if you intended the OCFL to be transformable back to your application’s own internal source of truth, and if that intention was successful. If we’re talking about data from your application being written “off to the side” to OCFL, and then later transformed back to your application — I think we’re talking about what is called “round-tripping” the data.

There was another Code4Lib presentation about repository migration at Stanford, in the Slack discussion happening about that presentation, Stanford’s Justin Coyne and Mike Giarlo wrote:

I don’t recommend “round trip mappings”. I was a developer on this project. It’s very challenging to not lose data when going from A -> B -> A
Justin Coyne

We spent sooooo much time on getting these round-trip mappings correct. Goodness gracious.
Mike Giarlo

So, if you want to make your OCFL “off to the side” provide this quality of completeness via round-trippability, you probably have to be focusing on it intentionally, and then it’s still going to be really hard, maybe one of the hardest (most time-consuming, most buggy) aspects of your application, or at least it’s persistence layer.

I found this presentation about repository migration really connecting my neurons to the OCFL discussion generally — when i thought about this I realized, well, that makes sense, woah, is one description of “preservation” activities actually: a practice of trying to plan and provide for unknown future migrations not yet fully spec’d?

So, while we were talking about repository migrations on Slack, and how challenging the data migrations were (several conf presentations dealt with data migrations in repositories) Seth Erickson made a point about OCFL:

One of the arguments for OCFL is that the repository software should upgradeable/changeable without having to migrate the data… (that’s the aspiration, anyways)
Seth Erickson

If the vision is that with nothing more than an OCFL storage system, we can point new software to it and be up and running without a data migration — I think we can see this is basically assuming OCFL as the “source of truth”, and also talking about the same thing the OCFL webpage calls “completeness” again.

And why is this vision aspirational? Well, to begin with, we don’t actually have very many repository systems that use OCFL as a source of truth. We may only have Fedora — that is, systems that use Fedora as middleware. Or maybe ocfl_index too, although it being only read-only and also middleware that doesn’t necessarily have user-facing software built on it yet, it’s probably currently a partial entry at most.

If we had multiple systems that could already do this, we’d be a lot more confident it would work out — but of course, the expense and difficulty of building a system using OCFL as the “source of truth” is probably a large part of why we don’t!

OK, do we at least have multiple systems based on fedora? Well… yes. Even before Fedora was based on OCFL, it would hypothetically be possible to upgrade/change repository software without a data migration if both source and target software were based on Fedora… except, in fact, it was not possible to do this between Samvera sufia/hydra and Islandora, despite both being based on fedora, because even though they both used fedora, their metadata stored in Fedora (or OCFL) was not consistent. A whole giant topic we’re not going to cover here, except to point out it’s a huge challenge for that vision of “completeness” providing for software changes without data migration, a huge challenge that we have seen in practice, without necessarily seeing a success in practice. (Even within hyrax alone, there are currently two different possible fedora data layouts, using traditional activefedora with “wings” adapter or instead valkyrie-fedora adapter, requiring data migration between them!)

And if we think of the practice of preservation as being trying to maximize chances of providing for migration to future unknown systems with unknown needs… then we see it’s all aspirational (that far-future digital preservation is an aspirational endeavor is of course probably not a controversial thing to say either).

But the little bit of paradox here is that while “completeness” makes it more likely you will be able to easily change systems without data loss, the added cost of developing systems that achieve “completeness” via OCFL as “source of truth” means — you will probably have much fewer, if any, choices of suitable systems to change to, or resources available to develop them!

So… what do we do? Can we split the difference?

I think the first step is acknowledging the issue, the tension here between completeness via “OCFL as source-of-truth” and, well, ease of software development. There is no magic answer that optimizes everything, there are trade-offs.

That quality of “completeness” of data (“source of truth”) is going to make your software much more challenging to develop. Take longer, take more skill, have more chance of problems and failures. And another way to say this is: Within a given amount of engineering resources, you will be delivering fewer features that matter to your users and organization, because you are spending more of your resources on implementing on a more challenging architecture.

What you get out of this is aspirationally increased chances of successful preservation. This doesn’t mean you shouldn’t do it, digital preservation is neccessarily aspirational. I’m not sure one balances this cost and benefit — it might likely be different for different institutions — but I think we should be careful not to be routinely under-estimating the cost or over-estimating the size or confidence of benefits from the “source of truth” approach. Undoubtedly many institutions will still choose to develop OCFL as a source of truth, especially using middleware intended to ease the burden, like Fedora.

I will probably not be one of them at my current institution — the cost is just too high for us, we can’t give up the capacity to relatively rapidly meet other organizational and user needs. But I’d like to look at incorporating OCFL as “off to the side” preservation copy anyway in the future.

(And Stefano and me are definitely not the only ones considering this or doing it. Many institutions are using an “off to the side” “final destination” approach to preservation copies, if not with OCFL, than with some of it’s progenitors or peers like BagIt or Stanford’s MOAB — the “off to the side” approach is not unusual, and for good reasons! We can acknowledge it and talk about it without shame!)

If you are developing instead with OCFL as a “off to the side” (or “final destination”), are there things you can do to try to get closer to the benefits of OCFL as “source of truth”?

The main thing I can think of involves “round-trippability”

Yes, commit to storing all of your objects and metadata necessary to restore a working current system in your OCFL
And commit to storing it round-trippably
One way to ensure/enforce this would be — every time you write a new version to OCFL, run a job that serializes those objects and metadata to OCFL, and back to your internal format, and verify that it is still equivalent. Verify the round-trip.

Round-trippability doens’t just happen on it’s own, and ensuring it will definitely significantly increase the cost of your development — as the Stanford folks said from experience, round-trippability is a headache and a major cost! But, it could conceivably get you a lot of the confidence in “completeness” that “source of truth” OCFL gets you. And as it still is “off to the side”, it still allows you to write your application using whatever standard (or innovative in different directions) architectures you want, you don’t have the novel data persistence architecture design involved in all of your feature development to meet user and business needs.

This will perhaps arrive at a better cost/benefit balance for some institutions.

There may be other approaches or thoughts, this is hopefully the beginning of a long conversation and practice.

Escaping/encoding URI components in ruby 3.2

Thanks to zverok_kha’s awesome writeup of Ruby changes, I noticed a new method released in ruby 3.2: CGI.escapeURIComponent

This is the right thing to use if you have an arbitrary string that might include characters not legal in a URI/URL, and you want to include it as a path component or part of the query string:

require 'cgi'

url = "https://example.com/some/#{ CGI.escapeURIComponent path_component }" + 
  "?#{CGI.escapeURIComponent my_key}=#{CGI.escapeURIComponent my_value}"

The docs helpfully refer us to RFC3986, a rare citation in the wild world of confusing and vaguely-described implementations of escaping (to various different standards and mistakes) for URLs and/or HTML
This will escape / as %2F, meaning you can use it to embed a string with / in it inside a path component, for better or worse
This will escape a space ( ) as %20, which is correct and legal in either a query string or a path component
There is also a reversing method available CGI.unescapeURIComponent

What if I am running on a ruby previous to 3.2?

Two things in standard library probably do the equivalent thing. First:

require 'cgi'
CGI.escape(input).gsub("+", "%20")

CGI escape but take the +s it encodes space characters into, and gsub them into the more correct %20. This will not be as performant because of the gsub, but it works.

This, I noticed once a while ago, is what ruby aws-sdk does… well, except it also unescapes %7E back to ~, which does not need to be escaped in a URI. But… generally… it is fine to percent-encode ~ as %7E. Or copy what aws-sdk does, hoping they actually got it right to be equivalent?

Or you can use:

require 'erb'
ERB::Util.url_encode(input)

But it’s kind of weird to have to require the ERB templating library just for URI escaping. (and would I be shocked if ruby team moves erb from “default gem” to “bundled gem”, or further? Causing you more headache down the road? I would not). (btw, ERB::Util.url_encode leaves ~ alone!)

Do both of these things do exactly the same thing as CGI.escapeURIComponent? I can’t say for sure, see discussion of CGI.escape and ~ above. Sure is confusing. (there would be a way to figure it out, take all the chars in various relevant classes in the RFC spec and test them against these different methods. I haven’t done it yet).

What about URI.escape?

In old code I encounter, I often see places using URI.escape to prepare URI query string values…

# don't do this, don't use URI.escape
url = "https://example.com?key=#{ URI.escape value }"

# not this either, don't use URI.escape
url = "https://example.com?" + 
   query_hash.collect { |k, v| "#{URI.escape k}=#{URI.escape v}"}.join("&")

This was never quite right, in that URI.escape was a huge mess… intending to let you pass in whole URLs that were not legal URLs in that they had some illegal characters that needed escaping, and it would somehow parse them and then escape the parts that needed escaping… this is a fool’s errand and not something it’s possible to do in a clear consistent and correct way.

But… it worked out okay because the output of URI.escape overlapped enough with (the new RFC 3986-based) CGI.escapeURIComponent that it mostly (or maybe even always?) worked out. URI.escape did not escape a /… but it turns out / is probably actually legal in a query string value anyway, it’s optional to escape it to %2F in a query string? I think?

And people used it in this scenario, I’d guess, because it’s name made it sound like the right thing? Hey, I want to escape something to put it in a URI, right? And then other people copied from code they say, etc.

But URI.escape was an unpredictable bad idea from the start, and was deprecated by ruby, then removed entirely in ruby 3.0!

When it went away, it was a bit confusing to figure out what to replace it with. Because if you asked, sometimes people would say “it was broken and wrong, there is nothing to replace it”, which is technically true… but the code escaping things for inclusion in, eg, query strings, still had to do that… and then the “correct” behavior for this actually only existed in the ruby stdlib in the erb module (?!?) (where few had noticed it before URI.escape went away)… and CGI.escapeURIComponent which is really what you wanted didn’t exist yet?

Why is this so confusing and weird?

Why was this functionality in ruby stdlib non-existent/tucked away? Why are there so many slightly different implementations of “uri escaping”?

Escaping is always a confusing topic in my experience — and a very very confusing thing to debug when it goes wrong.

The long history of escaping in URLs and HTML is even more confusing. Like, turning a space into a + was specified for application/x-www-form-urlencoded format (for encoding an HTML form as a string for use as a POST body)… and people then started using it in url query strings… but I think possibly that was never legal, or perhaps the specifications were incomplete/inconsistent on it.

But it was so commonly done that most things receiving URLs would treat a literal + as an encode space… and then some standards were retroactively changed to allow it for compatibility with common practice…. maybe. I’m not even sure I have this right.

And then, as with the history of the web in general, there have been a progression of standards slightly altering this behavior, leapfrogging with actual common practice, where technically illegal things became common and accepted, and then standards tried to cope… and real world developers had trouble underestanding there might be different rules for legal characters/escaping in HTML vs URIs vs application/x-www-form-urlencoded strings vs HTTP headers…. and then language stdlib implementers (including but not limited to ruby) implemented things with various understandings acccording to various RFCs (or none, or buggy), documented only with words like “Escapes the string, replacing all unsafe characters with codes.” (unsafe according to what standard? For what purpose?)

PHEW.

It being so confusing, lots of people haven’t gotten it right — I swear that AWS S3 uses different rules for how to refer to spaces in filenames than AWS MediaConvert does, such that I couldn’t figure out how to get AWS MediaConvert to actually input files stored on S3 with spaces in them, and had to just make sure to not use spaces in filenames on S3 destined for MediaConvert. But maybe I was confused! But honestly I’ve found it’s best to avoid spaces in filenames on S3 in general, because S3 docs and implementation can get so confusing and maybe inconsistent/buggy on how/when/where they are escaped. Because like we’re saying…

Escaping is always confusing, and URI escaping is really confusing.

Which is I guess why the ruby stdlib didn’t actually have a clearly labelled provided-with-this-intention way to escape things for use as a URI component until ruby 3.2?

Just use CGI.escapeURIComponent in ruby 3.2+, please.

What about using the Addressable gem?

When the horrible URI.escape disappeared and people that had been wrongly using it to escape strings for use as URI components needed some replacement and the ruby stdlib was confusing (maybe they hadn’t noticed ERB::Util.url_encode or weren’t confident it did the right thing and gee I wonder why not), some people turned to the addressable gem.

This gem for dealing with URLs does provide ways to escape strings for use in URLs… it actually provides two different algorithms depending on whether you want to use something in a path component or a query component.

require 'addressable'

Addressable::URI.encode_component(query_param_value, Addressable::URI::CharacterClasses::QUERY)

Addressable::URI.encode_component(path_component, Addressable::URI::CharacterClasses::PATH)

Note Addressable::URI::CharacterClasses::QUERY vs Addressable::URI::CharacterClasses::PATH? Two different routines? (Both by the way escape a space to %20 not +).

I think that while some things need to be escaped in (eg) a path component and don’t need to be in a query component, the specs also allow some things that don’t need to be escaped to be escaped in both places, such that you can write an algorithm that produces legally escaped strings for both places, which I think is what CGI.escapeURIComponentis. Hopefully we’re in good hands.

On Addressable, neither the QUERY nor PATH variant escapes /, but CGI.escapeURIComponent does escape it to %2F. PHEW.

You can also call Addressable::URI.encode_component with no second arg, in which case it seems to escape CharacterClasses::RESERVED + CharacterClasses::UNRESERVED from this list. Whereas PATH is, it looks like there, equivalent to UNRESERVED with SOME of RESERVED (SUB_DELIMS but only some of GENERAL_DELIMS), and QUERY is just path plus ? as needing escaping…. (CGI.escapeURIComponent btw WILL escape ? to %3F).

PHEW, right?

Anyhow

Anyhow, just use CGI.escapeURIComponent to… escape your URI components, just like it says on the lid.

Thanks to /u/f9ae8221b for writing it and answering some of my probably annoying questions on reddit and github.

attr_json 2.0 release: ActiveRecord attributes backed by JSON column

attr_json is a gem to provide attributes in ActiveRecord that are serialized to a JSON column, usually postgres jsonb, multiple attributes in a json hash. In a way that can be treated as much as possible like any other “ordinary” (database column) ActiveRecord.

It supports arrays and nested models as hashes, and the embedded nested models can also be treated much as an ordinary “associated” record — for instance CI build tests with cocoon , and I’ve had a report that it works well with stimulus nested forms, but I don’t currently know how to use those. (PR welcome for a test in build?)

An example:

# An embedded model, if desired
class LangAndValue
  include AttrJson::Model

  attr_json :lang, :string, default: "en"
  attr_json :value, :string
end

class MyModel < ActiveRecord::Base
   include AttrJson::Record

   # use any ActiveModel::Type types: string, integer, decimal (BigDecimal),
   # float, datetime, boolean.
   attr_json :my_int_array, :integer, array: true
   attr_json :my_datetime, :datetime

   attr_json :embedded_lang_and_val, LangAndValue.to_type
end

model = MyModel.create!(
  my_int_array: ["101", 2], # it'll cast like ActiveRecord
  my_datetime: DateTime.new(2001,2,3,4,5,6),
  embedded_lang_and_val: LangAndValue.new(value: "a sentence in default language english")
)

By default it will serialize attr_json attributes to a json_attributes column (this can also be specified differently), and the above would be serialized like so:

{
  "my_int_array": [101, 2],
  "my_datetime": "2001-02-03T04:05:06Z",
  "embedded_lang_and_val": {
    "lang": "en",
    "value": "a sentence in default language english"
  }
}

Oh, attr_json also supports some built-in construction of postgres jsonb contains (“@>“) queries, with proper rails type-casting, through embedded models with keypaths:

MyModel.jsonb_contains(
  my_datetime: Date.today,
  "embedded_lang_and_val.lang" => "de"
) # an ActiveRelation, you can chain on whatever as usual

And it supports in-place mutations of the nested models, which I believe is important for them to work “naturally” as ruby objects.

my_model.embedded_lang_and_val.lang = "de"
my_model.embedded_lang_and_val_change 
# => will correctly return changes in terms of models themselves
my_model.save!

There are some other gems in this “space” of ActiveRecord attribute json serialization, with different fits for different use cases, created either before or after I created attr_json — but none provide quite this combination of features — or, I think, have architectures that make this combination feasible (I could be wrong!). Some to compare are jsonb_accessor, store_attribute, and store_model.

One use case where I think attr_json really excels is when using Rails Single-Table Inheritance, where different sub-classes may have different attributes.

And especially for a “content management system” type of use case, where on top of that single-table inheritance polymorphism, you can have complex hierarchical data structures, in an inheritance hierarchichy, where you don’t actually want or need the complexity of an actual normalized rdbms schema for the data that has both some polymorphism and some hetereogeneity. We get some aspects of a schema-less json-document-store, but embedded in postgres, without giving up rdbms features or ordinary ActiveRecord affordances.

Slow cadence, stability and maintainability

While the 2.0 release includes a few backwards incompats, it really should be an easy upgrade for most if not everyone. And it comes three and a half years after the 1.0 release. That’s a pretty good run.

Generally, I try to really prioritize backwards compatibility and maintainability, doing my best to avoid anything that could provide backwards incompat between major releases, and trying to keep major releases infrequent. I think that’s done well here.

I know that management of rails “plugin” dependencies can end up a nightmare, and I feel good about avoiding this with attr_json.

attr_json was actually originally developed for Rails 4.2 (!!), and has kept working all the way to Rails 7. The last attr_json 1.x release actually supported (in same codebase) Rails 5.0 through Rails 7.0 (!), and attr_json 2.0 supports 6.0 through 7.0. (also grateful to the quality and stability of the rails attributes API originally created by sgrif).

I think this succesfully makes maintenance easier for downstream users of attr_json, while also demonstrating success at prioritizing maintainability of attr_json itself — it hasn’t needed a whole lot of work on my end to keep working across Rails releases. Occasionally changes to the test harness are needed when a new Rails version comes out, but I actually can’t think of any changes needed to implementation itself for new Rails versions, although there may have been a few.

Because, yeah, it is true that this is still basically a one-maintainer project. But I’m pleased it has successfully gotten some traction from other users — 390 github “stars” is respectable if not huge, with occasional Issues and PR’s from third parties. I think this is a testament to it’s stability and reliability, rather than to any (almost non-existent) marketing I’ve done.

“Slow code”?

In working on this and other projects, I’ve come to think of a way of working on software that might be called “slow code”. To really get stability and backwards compatibility over time, one needs to be very careful about what one introduces into the codebase in the first place. And very careful about getting the fundamental architectural design of the code solid in the first place — coming up with something that is parsimonious (few architectural “concepts”) and consistent and coherent, but can handle what you will want to throw at it.

This sometimes leads me to holding back on satisfying feature requests, even if they come with pull requests, even if it seems like “not that much code” — if I’m not confident it can fit into the architecture in a consistent way. It’s a trade-off.

I realize that in many contemporary software development environments, it’s not always possible to work this way. I think it’s a kind of software craftsmanship for shared “library” code (mostly open source) that… I’m not sure how much our field/industry accomnodates development with (and the development of) this kind of craftsmanship these days. I appreciate working for a non-profit academic institute that lets me develop open source code in a context where I am given the space to attend to it with this kind of care.

The 2.0 Release

There aren’t actually any huge changes in the 2.0 release, mostly it just keeps on keeping on.

Mostly, 2.0 tries to make things adhere even closer and more consistently to what is expected of Rails attributes.

The “Attributes” API was still brand new in Rails 4.2 when this project started, but now that it has shown itself solid and mature, we can always create a “cover” Rails attribute in the ActiveRecord model, instead of making it “optional” as attr_json originally did. Which provides for some code simplification.

Some rough edges were sanded involved making Time/Date attributes timezone-aware in the way Rails usually does transparently. And with some underlying Rails bugs/inconsistencies having been long-fixed in Rails, they can now store miliseconds in JSON serialization rather than just whole seconds too.

I try to keep a good CHANGELOG, which you can consult for more.

The 2.0 release is expected to be a very easy migration for anyone on 1.x. If anyone on 1.x finds it challenging, please get in touch in a github issue or discussion, I’d like to make it easier for you if I can.

For my Library-Archives-Museums Rails people….

The original motivation from this came from trying to move off samvera (nee hydra) sufia/hyrax to an architecutre that was more “Rails-like”. But realizing that the way we wanted to model our data in a digital collections app along the lines of sufia/hyrax, would be rather too complicated to do with a reasonably normalized rdbms schema.

So… can we model things in the database in JSON — similar to how valkyrie-postgres would actually model things in postgres — but while maintaining an otherwise “Rails-like” development architecture? The answer: attr_json.

So, you could say the main original use case for attr_json was to persist a “PCDM“-ish data model ala sufia/hyrax, those kinds of use cases, in an rdbms, in a way that supported performant SQL queries (minimal queries per page, avoiding n+1 queries), in a Rails app using standard Rails tools and conventions, without an enormously complex expansive normalized rdbms schema.

While the effort to base hyrax on valkyrie is still ongoing, in order to allow postgres vs fedora (vs other possible future stores) to be a swappable choice in the same architecture — I know at least some institutions (like those of the original valkyrie authors) are using valkyrie in homegrown app directly, as the main persistence API (instead of ActiveRecord).

In some sense, valkyrie-postgres (in a custom app) vs attr-json (in a custom app) are two paths to “step off” the hyrax-fedora architecture. They both result in similar things actually stored in your rdbms (and we both chose postgres, for similar reasons, including I think good support for json(b)). They have both have advantages and disadvantages. Valkyrie-postgres kind of intentionally chooses not to use ActiveRecord (at least not in controllers/views etc, not in your business logic), one advantage of such is to get around some of the known widely-commented upon deficiencies and complaints with Rails standard ActiveRecord architecture.

Whereas I followed a different path with attr_json — how can we store things in postgres similarly, but while still using ActiveRecord in a very standard Rails way — how can we make it as standard a Rails way as possible? This maintains the disadvantages people sometimes complain about Rails architecture, but with the benefit of sticking to the standard Rails ecosystem, having less “custom community” stuff to maintain or figure out (including fewer lines of code in attr-json), being more familiar or accessible to Rails-experienced or trained developers.

At least that’s the idea, and several years later, I think it’s still working out pretty well.

In addition to attr_json, I wrote a layer on top to provide some parts on top of attr_json, that I thought would be both common and somewhat tricky in writing a pcdm/hyrax-ish digital collections app as “standard Rails as much as it makes sense”. This is kithe and it hasn’t had very much uptake. The only other user I’m aware of (who is using only a portion of what kithe provides; but kithe means to provide for that as a use case) is Eric Larson at https://github.com/geobtaa/geomg.

However, meanwhile, attr_json itself has gotten quite a bit more uptake — from wider Rails developer community, not our library-museum-archives community. attr_json’s 390 github stars isn’t that big in the wider world of things, but it’s pretty big for our corner of the world. (Compare to 160 for hyrax or 721 for blacklight). That the people using attr_json, and submitting Issues or Pull Requests largely aren’t library-museum-archives developers, I consider positive and encouraging, that it’s escaped the cultural-heritage-rails bubble, and is meeting a more domain-independent or domain-neutral need, at a lower level of architecture, with a broader potential community.

A tiny donation to rubyland.news would mean a lot

I started rubyland.news in 2016 because it was a thing I wanted to see for the ruby community. I had been feeling a shrinking of the ruby open source collaborative community, it felt like the room was emptying out.

If you find value in Rubyland News, just a few dollars contribution on my Github Sponsors page would be so appreciated.

I wanted to make people writing about ruby and what they were doing with it visible to each other and to the community, in order to try to (re)build/preserve/strengthen a self-conception as a community, connect people to each other, provide entry to newcomers, and just make it easier to find ruby news.

I’ve been solely responsible for its development, and editorial and technical operations. I think it’s been a success. I don’t have analytics, but it seems to be somewhat known and used.

Rubyland.news has never been a commercial project. I have never tried to “monetize” it. I don’t even really highlight my personal involvement much. I have in the past occasionally had modest paid sponsorship barely enough to cover expenses, but decided it wasn’t worth the effort.

I have and would never provide any kind of paid content placement, because I think that would be counter to my aims and values — I have had offers, specifically asking for paid placement not labelled as such, because apparently this is how the world works now, but I would consider that an unethical violation of trust.

It’s purely a labor or love, in attempted service to the ruby community, building what I want to see in the world as an offering of mutual aid.

So why am I asking for money?

The operations of Rubyland News don’t cost much, but they do cost something. A bit more since Heroku eliminated free dynos.

I currently pay for it out of my pocket, and mostly always have modulo occasional periods of tiny sponsorship. My pockets are doing just fine, but I do work for an academic non-profit, so despite being a software engineer the modest expenses are noticeable.

Sure, I could run it somewhere cheaper than heroku (and eventually might have to) — but I’m doing all this in my spare time, I don’t want to spend an iota more time or psychic energy on (to me) boring operational concerns than I need to. (But if you want to volunteer to take care of setting up, managing, and paying for deployment and operations on another platform, get in touch! Or if you are another platform that wants to host rubyland news for free!)

It would be nice to not have to pay for Rubyland News out of my pocket. But also, some donations would, as much as be monetarily helpful, also help motivate me to keep putting energy into this, showing me that the project really does have value to the community.

I’m not looking to make serious cash here. If I were able to get just $20-$40/month in donations, that would about pay my expenses (after taxes, cause I’d declare if i were getting that much), I’d be overjoyed. Even 5 monthly sustainers at just $1 would really mean a lot to me, as a demonstration of support.

You can donate one-time or monthly on my Github Sponsors page. The suggested levels are $1 and $5.

(If you don’t want to donate or can’t spare the cash, but do want to send me an email telling me about your use of rubyland news, I would love that too! I really don’t get much feedback! jonathan at rubyland.news)

Thanks

Thanks to anyone who donates anything at all
also to anyone who sends me a note to tell me that they value Rubyland News (seriously, I get virtually no feedback — telling me things you’d like to be better/different is seriously appreciated too! Or things you like about how it is now. I do this to serve the community, and appreciate feedback and suggestions!)
To anyone who reads Rubyland News at all
To anyone who blogs about ruby, especially if you have an RSS feed, especially if you are doing it as a hobbyist/community-member for purposes other than business leads!
To my current single monthly github sponsor, for $1, who shall remain unnamed because they listed their sponsorship as private
To anyone contributing in their own way to any part of open source communities for reasons other than profit, sometimes without much recognition, to help create free culture that isn’t just about exploiting each other!

vite-ruby for JS/CSS asset management in Rails

I recently switched to vite and vite-ruby for managing my JS and CSS assets in Rails. I was switching from a combination of Webpacker and sprockets — I moved all of my Webpacker and most of my sprockets to vite.

Note that vite-ruby has smooth ready-made integrations for Padrino, Hanami, and jekyll too, and possibly hook points for integrations with arbitrary ruby, plus could always just use vite without vite-ruby — but I’m using vite-ruby with Rails.

I am finding it generally pretty agreeble, so I thought I’d write up some of the things I like about it for others. And a few other notes.

I am definitely definitely not an expert in Javascript build systems (or JS generally), which both defines me as an audience for build tools, but also means I don’t always know how these things might compare with other options. The main other option I was considering was jsbundling-rails with esbuild and cssbundling-rails with SASS, but I didn’t get very far into the weeds of checking those out.

I moved almost all my JS and (S)CSS into being managed/built by vite.

My context

I work on a monolith “full stack” Rails application, with a small two-developer team.

I do not do any very fancy Javascript — this is not React or Vue or anything like that. It’s honestly pretty much “JQuery-style” (although increasingly I try to do it without jquery itself using just native browser API, it’s still pretty much that style).

Nonetheless, I have accumulated non-trivial Javascript/NPM dependencies, including things like video.js , @shoppify/draggable, fontawesome (v4), openseadragon. I need package management and I need building.

I also need something dirt simple. I don’t really know what I’m doing with JS, my stack may seem really old-fashioned, but here it is. Webpacker had always been a pain, I started using it to have something to manage and build NPM packages, but was still mid-stream in trying to switch all my sprockets JS over to webpacker when it was announced webpacker was no longer recommended/maintained by Rails. My CSS was still in sprockets all along.

Vite

One thing to know about vite is that it’s based on the idea of using different methods in dev vs production to build/serve your JS (and other managed assets). In “dev”, you ordinarily run a “vite server” which serves individual JS files, whereas for production you “build” more combined files.

Vite is basically an integration that puts together tools like esbuild and (in production) rollup, as well as integrating optional components like sass — making them all just work. It intends to be simple and provide a really good developer experience where doing simple best practice things is simple and needs little configuration.

vite-ruby tries to make that “just works” developer experience as good as Rubyists expect when used with ruby too — it intends to integrate with Rails as well as webpacker did, just doing the right thing for Rails.

Things I am enjoying with vite-ruby and Rails

You don’t need to run a dev server (like you do with jsbundling-rails and css-bundling rails)
- If you don’t run the vite dev server, you’ll wind up with auto-built vite on-demand as needed, same as webpacker basically did.
- This can be slow, but it works and is awesome for things like CI without having to configure or set up anything. If there have been no changes to your source, it is not slow, as it doesn’t need to re-build.
- If you do want to run the dev server for much faster build times, hot module reload, better error messages, etc, vite-ruby makes it easy, just run ./bin/vite dev in a terminal.
If you DO run the dev server — you have only ONE dev-server to run, that will handle both JS and CSS
- I’m honestly really trying to avoid the foreman approach taken by jsbundling-rails/cssbundling-rails, because of how it makes accessing the interactive debugger at a breakpoint much more complicated. Maybe with only one dev server (that is optional), I can handle running it manually without a procfile.

Handling SASS and other CSS with the same tool as JS is pretty great generally — you can even @import CSS from a javascript file, and also @import plain CSS too to aggregate into a single file server-side (without sass). With no non-default configuration, it just works, and will spit out stylesheet <link> tags, and it means your css/sass is going through the same processing whether you import it from .js or .css.
- I handle fontawesome 4 this way. Include "font-awesome": "^4.7.0" in my package.json, then @import "font-awesome/css/font-awesome.css"; just works, and from either a .js or a .css file. It actually spits out not only the fontawesome CSS file, but also all the font files referenced from it and included in the npm package, in a way that just works. Amazing!!
- Note how you can reference things from NPM packages with just package name. On google for some tools you find people doing contortions involving specifically referencing node-modules, I’m not sure if you really have to do this with latest versions of other tools but you def don’t with vite, it just works.

in general, I really appreciate vite’s clear opinionated guidance and focus on developer experience. Understanding all the options from the docs is not as hard because there are fewer options, but it does everything I need it to. vite-ruby succesfully carries this into ruby/Rails, it’s documentation is really good, without being enormous. In Rails, it just does what you want, automatically.

Vite supports source maps for SASS!
- Not currently on by default, you have to add a simple config.
- Unfortunately sass sourcemaps are NOT supported in production build mode, only in dev server mode. (I think I found a ticket for this, but can’t find it now)
- But that’s still better than the official Rails options? I don’t understand how anyone develops SCSS without sourcemaps!
  - But even though sprockets 4.x finally supported JS sourcemaps, it does not work for SCSS! Even though there is an 18-month-old PR to fix it, it goes unreviewed by Rails core and unmerged.
  - Possibly even more suprisingly, SASS sourcemaps doesn’t seem to work for the newer cssbundling-rails=>sass solution either. https://github.com/rails/cssbundling-rails/issues/68
  - Previous to this switch, I was still using sprockets old-style “comments injected into CSS built files with original source file/line number” — that worked. But to give that up, and not get working scss sourcemaps in return? I think that would have been a blocker for me against cssbundling-rails/sass anyway… I feel like there’s something I’m missing, because I don’t understand how anyone is developing sass that way.

If you want to split up your js into several built files (“chunks), I love how easy it is. It just works. Vite/rollup will do it for you automatically for any dynamic runtime imports, which it also supports, just write import with parens, inside a callback or whatever, just works.

Things to be aware of

vite and vite-ruby by default will not create .gz variants of built JS and CSS
- Depending on your deploy environment, this may not matter, maybe you have a CDN or nginx that will automatically create a gzip and cache it.
- But in eg default heroku Rails deploy, it really really does. Default Heroku deploy uses the Rails app itself to deliver your assets. The Rails app will deliver content-encoding gzip if it’s there. If it’s not… when you switch to vite from webpacker/sprockets, you may now delivering uncommpressed JS and CSS with no other changes to your environment, with non-trivial performance implications but ones you may not notice.
- Yeah, you could probably configure your CDN you hopefully have in front of your heroku app static assets to gzip for you, but you may not have noticed.
- Fortunately it’s pretty easy to configure
  - For me, I do some kind of ugly JS to configure it only when I’m not using dev-mode autoBuild (in dev but without running a vite dev server), becuase it really slows down autoBuild
  - Since I migrated over, the vite-pllugin-rails plugin also does it by default. (I’m not using that, actually)

There are some vite NPM packages involved (vite itself as well as some vite-ruby plugins), as well as the vite-ruby gem, and you have to keep them up to date in sync. You don’t want to be using a new version of vite NPM packages with too-old gem, or vice versa. (This is kind of a challenge in general with ruby gems with accompanying npm packages)
- But vite_ruby actually includes a utility to check this on boot and complain if they’ve gotten out of sync! As well as tools for syncing them! Sweet!
- But that can be a bit confusing sometimes if you’re running CI after an accidentally-out-of-sync upgrade, and all your tests are now failing with the failed sync check. But no big deal.

Things I like less

vite-ruby itself doesn’t seem to have a CHANGELOG or release notes, which I don’t love.
Vite is a newer tool written for modern JS, it mostly does not support CommonJS/node require, preferring modern import. In some cases that I can’t totally explain require in dependencies seems to work anyway… but something related to this stuff made it apparently impossible for me to import an old not-very-maintained dependency I had been importing fine in Webpacker. (I don’t know how it would have done with jsbundling-rails/esbuild). So all is not roses.

Am I worried that this is a third-party integration not blessed by Rails?

The vite-ruby maintainer ElMassimo is doing an amazing job. It is currently very well-maintained software, with frequent releases, quick turnaround from bug report to release, and ElMassimo is very repsonsive in github discussions.

But it looks like it is just one person maintaining. We know how open source goes. Am I worried that in the future some release of Rails might break vite-ruby in some way, and there won’t be a maintainer to fix it?

I mean… a bit? But let’s face it… Rails officially blessed solutions haven’t seemed very well-maintained for years now either! The three year gap of abandonware between the first sprockets 4.x beta and final release, followed by more radio silence? The fact that for a couple years before webpacker was officially retired it seemed to be getting no maintainance, including requiring dependency versions with CVE’s that just stayed that way? Not much documentation (ie Rails Guide) support for webpacker ever, or jsbundling-rails still?

One would think it might be a new leaf with css/jsbundling-rails… but I am still baffled by there being no support for sass sourcemaps in cssbundling-rails and sass! Official rails support doesn’t necessarily get you much “just works” DX when it comes to asset handling for years now.

Let’s face it, this has been an area where being in the Rails github org and/or being blessed by Rails docs has been no particular reason to expect maintenance or expect you won’t have problems down the line anyway. it’s open source, nobody owes you anything, maintainers spend time on what they have interest to spend time on (including time to review/merge/maintain other’s PR’s — which is def non-trivial time!) — it just is what it is.

While the vite-ruby code provides a pretty great integrated into Rails DX, its also actually mostly pretty simple code, especially when it comes to the Rails touch points most at risk of Rails breaking — it’s not doing anything too convoluted.

So, you know, you take your chances, I feel good about my chances compared to a css/jsbundling-rails solution. And if someday I have to switch things over again, oh well — Rails just pulled webpacker out from under us quicker than expected too, so you take your chances regardless!

(thanks to colleague Anna Headley for first suggesting we take a look at vite in Rails!)

Using engine_cart with Rails 6.1 and Ruby 3.1

Rails does not seem to generally advertise ruby version compatibility, but it seems to be the case taht Rails 6.1, I believe, works with Ruby 3.1 — as long as you manually add three dependencies to your Gemfile.

gem "net-imap"
gem "net-pop"
gem "net-smtp"

(Here’s a somewhat cryptic gist from one (I think) Rails committer with some background. Although it doens’t specifically and clearly tell you to add these dependencies for Rails 6.1 and ruby 3.1… it won’t work unless you do. You can find other discussion of this on the net.)

Or you can instead add one line to your Gemfile, opting in to using the pre-release mail gem 2.8.0.rc1, which includes these dependencies for ruby 3.1 compatibility. Mail is already a Rails dependency; but pre-release gems (whose version numbers end in something including letters after a third period) won’t be included by bundler unless you mention a pre-release version (whose version number ends in…) explicitly in Gemfile.

gem "mail", ">= 2.8.0.rc1"

Once mail 2.8.0 final is released, if I understand what’s going on right, you won’t need to do any of this, since it won’t be a pre-release version bundler will just use it when bundle updateing a Rails app, and it expresses the dependencies you need for ruby 3.1, and Rails 6.1 will Just Work with ruby 3.1. Phew! I hope it gets released soon (been about 7 weeks since 2.8.0.rc1).

Engine cart

Engine_cart is a gem for dynamically creating Rails apps at runtime for use in CI build systems, mainly to test Rails engine gems. It’s in use in some collaborative open source communities I participate in. While it has plusses (actually integration testing real app generation) and minuses (kind of a maintenance nightmare it turns out), I don’t generally recommend it, if you haven’t heard of it before and am wondering “Does jrochkind think I should use this for testing engine gems in general?” — this is not an endorsement. In general it can add a lot of pain.

But it’s in use in some projects I sometimes help maintain.

How do you get a build using engine_cart to succesfully test under Rails 6.1 and ruby 3.1? Since if it were “manual” you’d have to add a line to a Gemfile…

It turns out you can create a ./spec/test_app_templates/Gemfile.extra file, with the necessary extra gem calls:

gem "net-imap"
gem "net-pop"
gem "net-smtp"

# OR, above OR below, don't need both

gem "mail", ">= 2.8.0.rc1"

I think ./spec/test_app_templates/Gemfile.extra is a “magic path” used by engine_cart… or if the app I’m working on is setting it, I can’t figure out why/how! But I also can’t quite figure out why/if engine_cart is defaulting to it…
Adding this to your main project Gemfile is not sufficient, it needs to be in Gemfile.extra
Some projects I’ve seen have a line in their Gemfile using eval_gemfile and referencing the Gemfile.extra… which I don’t really understand… and does not seem to be necessary to me… I think maybe it’s leftover from past versions of engine_cart best practices?
To be honest, I don’t really understand how/where the Gemfile.extra is coming in, and I haven’t found any documentation for it in engine_cart . So if this doens’t work for you… you probably just haven’t properly configured engine_cart to use the Gemfile.extra in that location, which the project I’m working on has done in some way?

Note that you may still get an error produced in build output at some point of generating the test app:

run  bundle binstubs bundler
rails  webpacker:install
You don't have net-smtp installed in your application. Please add it to your Gemfile and run bundle install
rails aborted!
LoadError: cannot load such file -- net/smtp

But it seems to continue and work anyway!

None of this should be necessary when mail 2.8.0 final is released, it should just work!

The above is of course always including those extra dependencies, for all builds in your matrix, when they are only necessary for Rails 6.1 (not 7!) and ruby 3.1. If you’d instead like to guard it to only apply for that build, and your app is using the RAILS_VERSION env variable convention, this seems to work:

# ./specs/test_app_templates/Gemfile.extra
#
# Only necessary until mail 2.8.0 is released, allow us to build with engine_cart
# under Rails 6.1 and ruby 3.1, by opting into using pre-release version of mail
# 2.8.0.rc1
#
# https://github.com/mikel/mail/pull/1472

if ENV['RAILS_VERSION'] && ENV['RAILS_VERSION'] =~ /^6\.1\./ && RUBY_VERSION =~ /^3\.1\./
  gem "mail", ">= 2.8.0.rc1"
end

Rails7 connection.select_all is stricter about it’s arguments in backwards incompat way: TypeError: Can’t Cast Array

I have code that wanted to execute some raw SQL against an ActiveRecord database. It is complicated and weird multi-table SQL (involving a postgres recursive CTE), so none of the specific-model-based API for specifying SQL seemed appropriate. It also needed to take some parameters, that needed to be properly escaped/sanitized.

At some point I decided that the right way to do this was with Model.connection.select_all , which would create a parameterized prepared statement.

Was I right? Is there a better way to do this? The method is briefly mentioned in the Rails Guide (demonstrating it is public API!), but without many details about the arguments. It has very limited API docs, just doc’d as: select_all(arel, name = nil, binds = [], preparable: nil, async: false), “Returns an ActiveRecord::Result instance.” No explanation of the type or semantics of the arguments.

In my code working on Rails previous to 7, the call looked like:

MyModel.connection.select_all(
  "select complicated_stuff WHERE something = $1",
  "my_complicated_stuff_name",
  [[nil, value_for_dollar_one_sub]],
  preparable: true
)

yeah that value for the binds is weird, a duple-array within an array, where the first value of the duple-array is just nil? This isn’t documented anywhere, I probably got that from somewhere… maybe one of the several StackOverflow answers.
I honestly don’t know what preparable: true does, or what difference it makes.

In Rails 7.0, this started failing with the error: TypeError: can’t cast Array.

I couldn’t find any documentation of that select_all all method at all, or other discussion of this; I couldn’t find any select_all change mentioned in the Rails Changelog. I tried looking at actual code history but got lost. I’m guessing “can’t cast Array” referes to that weird binds value… but what is it supposed to be?

Eventually I thought to look for Rails tests of this method that used the binds argument, and managed to eventually find one!

So… okay, rewrote that with new binds argument like so:

bind = ActiveRecord::Relation::QueryAttribute.new(
  "something", 
  value_for_dollar_one_sub, 
  ActiveRecord::Type::Value.new
)

MyModel.connection.select_all(
  "select complicated_stuff WHERE something = $1",
  "my_complicated_stuff_name",
  [bind],
  preparable: true
)

Confirmed this worked not only in Rails 7, but all the way back to Rails 5.2 no problem.
I guess that way I was doing it previously was some legacy way of passing args that was finally removed in Rails 7?
I still don’t really understand what I’m doing. The first arg to ActiveRecord::Relation::QueryAttribute.new I made match the SQL column it was going to be compared against, but I don’t know if it matters or if it’s used for anything. The third argument appears to be an ActiveRecord Type… I just left it the generic ActiveRecord::Type::Value.new, which seemed to work fine for both integer or string values, not sure in what cases you’d want to use a specific type value here, or what it would do.
In general, I wonder if there’s a better way for me to be doing what I’m doing here? It’s odd to me that nobody else findable on the internet has run into this… even though there are stackoverflow answers suggesting this approach… maybe i’m doing it wrong?

But anyways, since this was pretty hard to debug, hard to find in docs or explanations on google, and I found no mention at all of this changing/breaking in Rails 7… I figured I’d write it up so someone else had the chance of hitting on this answer.

1	3680×5684	60M	(STANDARD) dhc6a4r.tiff	(source)
2	3260×5185	48M	(DIAG) wg8ie02.tiff	(source)
3	4330×5760	71M	(ADVERT) 2y60cl2.tiff	(source)