Our progress on new digital collections app, and introducing kithe

In September, I wrote a post on a “Proposed Rails-based digital collections developer’s toolkit”

What has happened since then?

Yes we decided to go ahead with a rewrite of our digital collections app, with the new app not based on Hyrax or Valkryie, but a persistence layer based on ActiveRecord (making use of postgres-specific features were appropriate), and exposing ActiveRecord models to the app as a whole.

No, we are not going forward with trying to make that entire toolkit”, with all the components mentioned there.

But Yes, unlike Alberta, we are taking some functionality and putting it in a gem that can be shared between institutions and applications. That gem is kithe. It includes some sharable modeling/persistence code, like Valkyrie (but with a very different approach than Valkyrie), but also includes some additional fundamental components too.

Scaling back the ambition—and abstraction—a bit

The total architecture outlined in my original post was starting to feel overwhelming to me. After all, we also need to actually produce and launch an app for ourselves, on a “reasonable” timeline, with fairly high chance of success.  I left my conversation with U Alberta (which was quite useful, thank you to the Alberta team!), concerned about potential over-reach and over-abstraction. Abstraction always has a cost and building shared components is harder and more time-consuming than building a custom app.

But, then, also informed by my discussion with Alberta,  I realized we basically just had to build a Rails app, and this is something I knew how to do, and we could, as we progressed, jetison anything that didn’t seem actually beneficial for that goal or seem feasible at the moment. And, also after discussion with a supportive local team, my anxiety about the project went down quite a bit — we can do this.

Even when writing the original proposal, I knew that some elements might be traps. Building a generalized ACL permissions system in an rdbms-based web app… many have tried, many have fallen. :)  Generalized controllers are hard, because they are a piece very tightly tied to your particular app’s UI flows, which will vary.

So we’ve scaled back from trying to provide a toolkit which can also be “scaffolding” for a complete starter app.  The goals of the original thought-experiment proposal — a toolkit which provides  pieces developers put together when building their own app — are better approached, for now, by scaling back and providing fewer shared tools, which we can make really solid.

After all, building shared code is always harder than building code for your app. You have more use cases to figure out and meet, and crucially, shared code is harder to change because it’s (potentially) got cross-institutional dependents, which you have to not break. For the code I am putting into kithe, I’m trying to make it solidly constructed and well-polished. In purely local code,  I’m more willing to do something experimental and hacky — it’s easy enough (comparatively!) to change local app code later.  As with all software, get something out there that works, iterating, using what you learn. (It’s just that this is a lot harder to do with shared dependencies without pain!)

So, on October 1st, we decided to embark on this project. We’re willing to show you our fairly informal sketch of a work plan, if you’d like to look.

Introducing kithe

But we’re not just building a local app, we are also trying to create some shareable components. While the costs and risks of shared code and abstractions are real,  I ultimately decided that “just Rails” would not get us to the most maintainable code after all. (And of course nothing is really just Rails, you are always writing code and using non-Rails dependencies; it’s a matter of degree, how much your app seems like a “typical” Rails app to developers).

It’s just too hard to model the data we ourselves already needed (including nested/compound/repeated models) in “just” ActiveRecord, especially in a way that lets you work with it sanely as “just” ActiveRecord, and is still performant. (So we use attr_json, which I also developed, for a No-SQLy approach without giving up rdbms or ActiveRecord benefits including real foreign-key-based associations). And in another example, ActiveStorage was not flexible/powerful enough for our file-handling needs (which are of course at the core of our domain!), and I wasn’t enthused about CarrierWave either — it makes sense to me to make some solid high-quality components/abstractions for some of our fundamental business/domain concerns, while being aware of the risks/costs.

So I’ve put into kithe the components I thought seemed appropriate on several considerations:

  • Most valuable to our local development effort
  • Handling the “trickiest” problems, most useful to share
  • Handling common problems, most likely to be shareable; and it’s hard to build a suite of things that work together without some modelling/persistence assumptions, so got to start there.
  • I had enough understanding of the use-cases (local and community) that I thought I could, if I took a reasonable amount of extra time, produce something well-polished, with a good developer experience, and a relatively stable API.

That already includes, in maybe not 1.0-production-ready but used in our own in-progress app and released (well-tested and well-documented) in kithe:

  • A modeling and persistence layer tightly coupled to ActiveRecord, with some postgres-specific features, and recommending use of attr_json, for convenient “NoSQL”-like modelling of your unique business data (in common with existing samvera and valkyrie solutions, you don’t need to build out a normalized rdbms schema for your data). With models that are samvera/PCDM-ish (also like other community solutions).
    • Including pretty slick handling of “representatives”, dealing with the performance issues in figuring out representative to display with constant query time (using some pg-specific SQL to look up and set “leaf” representative on save).
    • Including UUIDs as actual DB pk/fks, but also a friendlier_id feature for shorter public URL identifiers, with logic to automatically create such if you wish.
  • A nice helper for building Rails forms with repeatable complex embedded values. Compare to the relevant parts of hydra-editor, but (I think) lighter and more flexible.
  • A flexible file-handling architecture based on shrine — meaning transparent cloud-storage support out of the box.
    • Along with a new derivatives architecture, which seems to me to have the right level of abstraction and affordances to provide a “polished” experience.
    • All file-handling support based on assuming expensive things happen in the background, and “direct upload” from browser pre-form-submit (possibly to cloud storage)

It will eventually include some solr/blacklight support, including a traject-based indexing setup, and I would like to develop an intervention in blacklight so after solr results are returned, it immediately fetches the “hit” records from ActiveRecord (with specified eager-loading), so you can write your view code in terms of your actual AR models, and not need to duplicate data to solr and logic for dealing with it. This latter is taken from the design of sunspot.

But before we get there, we’re going to spend a little bit of time on purely local features, including export/import routines (to get our data into the new app; with some solid testing/auditing to be confident we have), and some locally bespoke workflow support (I think workflow is something that works best just writing the Rails). 

We do have an application deployed as demo/staging, with a basic more-than-just-MVP-but-not-done-yet back-end management interface (note: it does not use Solr/Blacklight at all which I consider a feature), but not yet any non-logged-in end-user search front-end. If you’d like a guest login to see it, just ask.

Technical Evaluation So Far

We’ve decided to tie our code to Rails and ActiveRecord. Unlike Valkyrie, which provides a data-mapper/repository pattern abstraction, kithe expects the dependent code to use ActiveRecord APIs (along with some standard models and modelling enhancements kithe gives you).

This means, unlike Valkyrie, our solution is not “persistence-layer agnostic”. Our app, and any potential kithe apps, are tied to Rails/ActiveRecord, and can’t use fedora or other persistence mechanisms. We didn’t have much need/interest in that, we’re happy tying our application logic and storage to ActiveRecord/postgres, and perhaps later focusing on regularly exporting our data to be stored for preservation purposes in another format, perhaps in OCFL.

It’s worth noting that the data-mapper/repository pattern itself, along the lines valkyrie uses, is favored by some people for reasons other than persistence-swapability. In the Rails and ruby web community at large, there is a contingent that think the data-mapper/repository pattern is better than what Rails gives you, and gives you better architecture for maintainable code. Many of this contingent is big on hanami, and the dry-rb suite.  (I have never been fully persuaded by this contingent).

And to be sure, in building out our approach over the last 4 months, I sometimes ran right into the architectural issues with Rails “model-based” architecture and some of what it encourages like dreaded callbacks.  But often these were hypothetical problems, “What if someone wanted to do X,” rather than something I actually needed/wanted to do now. Take a breath, return to agility and “build our app”.

And a Rails/ActiveRecord-focused approach has huge advantages too. ActiveRecord associations and eager-loading support are very mature and powerful tools, that when exposed to the app as an API give you very mature, time-tested tools to build your app flexibly and performantly (at least for the architectures our community are used to, where avoiding n+1 queries still sometimes seems like an unsolved problem!).  You have a whole Rails ecosystem to rely on, which kithe-dependent apps can just use, making whatever choices they want (use reform or not?) as with most any Rails app, without having to work out as many novel approaches or APIs. (To be sure, kithe still provides some constraints and choices and novelty — it’s a question of degree).

Trying to build up an alternative based on data-mapper/repository, whether in hanami or valkyrie, I think you have a lot of work to do to be competitive with Rails mature solutions, sometimes reproducing features already in ActiveRecord or it’s ecosystem. And it’s not just work that’s “time implementing”, it’s work figuring out the right APIs and patterns. Hanami, for instance, is probably still not as mature, as Rails, or as easy to use for a newcomer.

By not having to spend time re-inventing things that Rails already has solutions for, I could spend time on our actual (digital collections) domain-specific components that I wasn’t happy with existing solutions for. Like spending time on creating shareable file handling and derivatives solutions that seem to me to be well-polished, and able to be used for flexible use-cases without feeling like you’re fighting the system or being surprised by it. Components that hopefuly can be re-used by other apps too.

I think schneem’s thoughts on “polish” are crucial reading when thinking about the true costs of shared abstractions in our community.  There is a cost to additional abstractions: in initial implementation, ongoing maintenance, developer on-boarding, and just figuring out the right architectures and APIs to provide that polish. Sometimes these costs are worthwhile in delivered benefits, of course.

I’d consider our kithe-based approach to be somewhere in between U Alberta’s approach and valkryie, in the dimension of “how close do we stick to and tie our line to ‘standard’ Rails”.

Unlike Hyrax, we are building our own app, not trying to use a shared app or “solution bundle” like Hyrax. I would suggest we share that aspect with both the U Alberta approach as well as the several institutions building valkyrie-not-hyrax apps. But if you’ve had good experiences with the over-time maintenance costs of Hyrax, you have a use case/context where Hyrax has worked well for you — then that’s great, and there’s never anything wrong with doing what has worked for you.

Overall, 4 months in, while some things have taken longer to implement than I expected, and some unexpected design challenges have been encountered — I’m still happy with the approach we are taking.

If you are considering a based-on-valkyrie-no-hyrax approach, I think you might be in a good position to consider a kithe approach too.

How do we evaluate success?


We want to have a replacement app launched in about a year.

I think we’re basically on target, although we might not hit it on the nose, I feel confident at this point that we’re going to succeed with a solid app, in around that timeline. (knock on wood).

When we were considering alternate approaches before committing to this one, we of course tried to compare how long this would take to various other approaches. This is very hard to predict, because you are trying to compare multiple hypotheticals, but we had to make some ballpark guesses (others may have other estimates).

Is this more or less time than it would have taken to migrate our sufia app to current hyrax? I think it’s probably taking more time to do it this new way, but I think migrating our sufia app to current hyrax (with all it’s custom functionality for current features) would not have been easy or quick — and we weren’t sure current hyrax was a place we wanted to end up.

Is it going to take more or less time than it would have taken to write an app on valkyrie, including any work we might contribute to valkyrie for features we needed? It’s always hard to guess these things, but I’d guess in the same ballpark, although I’m optimistic the “kithe” approach can lead to developer time-savings in the long-run.

(Of course, we hope if someone else wants to follow our path, they can re-use what’s now worked out in kithe to go quicker).

We want it to be an app whose long-term maintenance and continued development costs are good

In our sufia-based app, we found it could be difficult and time-consuming to add some of the features we needed. We also spent a lot of time trying to performance-tune to acceptable levels (and we weren’t alone), or figure out and work towards a manageable and cost-efficient cloud deployment architecture.

I am absolutely confident that our “kithe” approach will give us something with a lower TCO (“total cost of ownership”) than we had with sufia.

Will it be a lower TCO than if we were on the present hyrax (ignoring how to get there), with our custom features we needed? I think so, and that current hyrax isn’t different enough from sufia we are used to — but again this is necessarily a guess, and others may disagree. In the end, technical staff just has to make their best predictions based on experience (individual and community).  Hyrax probably will continue to improve under @no-reply’s steady leadership, but I think we have to make our decisions on what’s there now, and that potential rosey future also requires continued contribution by the community (like us) if it is to come to fruition, which is real time to be included in TCO too.   I’m still feeling good about the “write our own app” approach vs “solution bundle”.

Will we get a lower TCO than if we had a non-hyrax valkyrie-based app? Even harder to say. Valkryie has more abstractions and layers that have real ongoing maintenance costs (that someone has to do), but there’s an argument that those layers will lower your TCO over the long-term. I’m not totally persuaded by that argument myself, and when in doubt am inclined to choose the less-new-abstraction path, but it’s hard to predict the future.

One thing worth noting is the main thing that forced our hand in doing something with our existing sufia-based app is that it was stuck on an old version of Rails that will soon be out-of-support, and we thought it would have been time-consuming to update, one way or another.  (When Rails 6.0 is released, probably in the next few months, Rails maintenance policy says nothing before 5.2 will be supported.) Encouragingly, both kithe and attr_json dependency (also by me), are testing green on Rails 6.0 beta releases — and, I was gratified to see, didn’t take any code changes to do so, they just passed.  (Valkyrie 1.x requires Rails 5.1, but a soon-to-be-released 2.0 is planned to work fine up to Rails 6; latest hyrax requires Rails 5.1 as well, but the hyrax team would like to add 5.2 and 6 soon).

We want easier on-boarding of new devs for succession planning

All developers will leave eventually (which is one reason I think if you are doing any local development, a one-developer team is a bad idea — you are guaranteeing that at some point 100% of your dev team will leave at once).

We want it to be easier to on-board new developers. We share U Alberta’s goal that what we could call a “typical Rails developer” should be able to come on and maintain and enhance the app.

Are we there? Well, while our local app is relatively simple rails code (albeit using kithe API’s), the implementation of  kithe and attr_json, which a dev may have to delve into, can get a bit funky, and didn’t turn out quite as simple as I would have liked.

But when I get a bit nervous about this, I reassure myself remembering that:

  • a) Our existing sufia-based app is definitely high-barrier for new devs (an experience not unique to us), I think we can definitely beat that.
    • Also worth pointing out that when we last posted a position, we got no qualified applicants with samvera, or even Rails, experience. We did make a great hire though, someone who knew back-end web dev and knew how to learn new tools; it’s that kind of person that we ideally need our codebase to be accessible to, and the sufia-based one was not.
  • b) Recruiting and on-boarding new devs is always a challenge for any small dev shop, especially if your salaries are not seen as competitive.  It’s just part of the risk and challenge you accept when doing local development as a small shop on any platform. (Whether that is the right choice is out of scope for this post!)

I think our code is going to end up more accessible to actually-existing newly onboarded devs  than a customized hyrax-based solution would be. More than Valkyrie? I do think so myself, I think we have fewer layers of “specialty” stuff than valkyrie, but it’s certainly hard to be sure, and everyone must judge for themselves.

I do think any competent Rails consultancy (without previous LAM/samvera expertise) could be hired to deal with our kithe-based app no problem; I can’t really say if that would be true of a Valkyrie-based app (it might be); I do not personally have confidence it would be true of a hyrax-based app at this point, but others may have other opinions (or experience?).

Evaluating success with the community?

Ideally, we’d of course love it if some other institutions eventually developed with the kithe toolkit, with the potential for sharing future maintenance of it.

Even if that doesn’t happen, I don’t think we’re in a terrible place. It’s worth noting that there has been some non-LAM-community Rails dev interest in attr_json, and occasional PRs; I wouldn’t say it’s in a confidently sustainable place if I left, but I also think it’s code someone else could step into and figure out. It’s just not that many lines of code, it’s well-tested and well-documented, and and i’ve tried to be careful with it’s design — but take a look at and decide for yourself!. I can not emphasize enough my belief that if you are doing local development at all (and I think any samvera-based app has always been such), you should have local technical experts doing evaluation before committing to a platform — hyrax, valkyrie, kithe, entirely homegrown, whatever.

Even if no-one else develops with kithe itself, we’d consider it a success if some of the ideas from kithe influence the larger samvera and digital collections/repository communities. You are welcome to copy-paste-modify code that looks useful (It’s MIT licensed, have at it!). And even just take API ideas or architectural concepts from our efforts, if they seem useful.

We do take seriously participating in and giving back to the larger community, and think trying a different approach, so we and others can see how it goes, is part of that. Along with taking the extra time to do it in public and write things up, like this. And we also want to maintain our mutually-beneficial ties to samvera and LAM technologist communities; even if we are using different architectures, we still have lots of use-cases and opportunities for sharing both knowledge and code in common.

Take a look?

If you are considering development of a non-Hyrax valkyrie-based app, and have the development team to support that — I believe you have the development team to support a kithe-based approach too.

I would be quite happy if anyone took a look, and happy to hear feedback and have conversations, regardless of whether you end up using the actual kithe code or not. Kithe is not 1.0, but there’s definitely enough there to check it out and get a sense of what developing with it might be like, and whether it seems technically sound to you. And I’ve taken some time to write some good “guide” overview docs, both for potential “onboarding” of future devs here, and to share with you all.

We have a staging server for our in-development app based on kithe; if you’d like a guest login so you can check it out, just ask and I can share one with you.

Our local app also should also probably be pretty easy for you to get installed (with dependencies) from a git checkout, and just run it and see how it goes. See: https://github.com/sciencehistory/scihist_digicoll/

Hope to hear from you!


Browser dominance, standards setting, and WHATWG vs W3C

Reda Lemeden writes a warning note about what Chrome’s dominance means for the “Web as an open platform”, in “We Need Chrome No More.”

Lemeden doesn’t mention WHATWG, but in retrospect, I think the practical shifting of web-standards-setting from an at least possibly neutral standards body representing multiple interests (W3C) to a a body wholly controlled by browser-vendors (WHATWG)… may have been good for speed of “innovation” for a time, but was in the long-term not good for the “Web as an open platform” in Lemeden’s phrase.  Lemeden writes:

Making matters worse, the blame often lands on other vendors for “holding back the Web”. The Web is Google’s turf as it stands now; you either do as they do, or you are called out for being a laggard.

Indeed, I think it’s the structural politics of WHATWG that make that hard to counter. WHATWG was almost founded on the principles of “not being a laggard” and “doing what we do”. When there were several browser-vendors with roughly equal market power they could counter-balance each other, but when there’s an elephant in the room…

That is, the W3C folks that were accused of “holding back the web” while trying to keep standards setting from going to to the “faster” WHATWG… were perhaps correct all along.

People can disagree, but 10-15 years on, I think we’re overdue a larger discussion and retrospective evaluation of the consequences of the WHATWG “coup”. I haven’t seen much discussion of this yet.

On code-craft, and writing code for other programmers to use

The New Yorker this week has a profile of Google programmer pair Jeff Dean and Sanjay Ghemawat — if the annoying phrase “super star programmer” applies to anyone it’s probably these guys, who among other things conceived and wrote the original Google Map Reduce implementation–  that includes some comments I find unusually insightful about some aspects of the craft of writing code. I was going to say “for a popular press piece”, but really even programmers talking to each other don’t talk about this sort of thing much. I recommend the article, but was especially struck by this passage:

At M.I.T., [Sanjay’s] graduate adviser was Barbara Liskov, an influential computer scientist who studied, among other things, the management of complex code bases. In her view, the best code is like a good piece of writing. It needs a carefully realized structure; every word should do work. Programming this way requires empathy with readers. It also means seeing code not just as a means to an end but as an artifact in itself. “The thing I think he is best at is designing systems,” Craig Silverstein said. “If you’re just looking at a file of code Sanjay wrote, it’s beautiful in the way that a well-proportioned sculpture is beautiful.”

…“Some people,” Silverstein said, “their code’s too loose. One screen of code has very little information on it. You’re always scrolling back and forth to figure out what’s going on.” Others write code that’s too dense: “You look at it, you’re, like, ‘Ugh. I’m not looking forward to reading this.’ Sanjay has somehow split the middle. You look at his code and you’re, like, ‘O.K., I can figure this out,’ and, still, you get a lot on a single page.” Silverstein continued, “Whenever I want to add new functionality to Sanjay’s code, it seems like the hooks are already there. I feel like Salieri. I understand the greatness. I don’t understand how it’s done.”

I aspire to write code like this, it’s a large part of what motivates me and challenges me.

I think it’s something that (at least for most of us, I don’t know about Dean and Ghemawat), can only be approached and achieved with practice — meaning both time and intention. But I think many of the environments that most working programmers work in are not conducive to this practice, and in some cases are actively hostile to it.  I’m not sure what to think or do about that.

It is most important when designing code for re-use, when designing libraries to be used in many contexts and by many people.  If you are only writing code for a particular business “seeing code not just as a means to an end but as an artifact in itself” may not be what’s called for.  It really is a means to an end of the business purposes. Spending too much time on “the artifact itself”, I think, has a lot of overlap with what is often derisively called “bike-shedding”.  But when creating an artifact that is intended to be used by lots of other programmers in lots of other contexts to build things to meet their business purposes — say, a Rails… or a samvera — “empathy with readers” (which is very well-said, and very related to:) and creating an artifact where “it seems like the hooks are already there” are pretty much indispensable to creating something successful at increasing the efficiency and success of those developers using the code.

It’s also not easy even if it is your intention, but without the intention, it’s highly unlikely to happen by accident. In my experience TDD can (in some contexts) actually be helpful to accomplishing it — but only if you have the intention, if you start from developer use-cases, and if you do the “refactor” step of “red-green-refactor”.  Just “getting the tests to pass” isn’t gonna do it. (And from the profile, I suspect Dean and Ghemawat may not write tests at all — TDD is neither necessary nor sufficient).  That empathy part is probably necessary — understanding what other programmers are going to want to do with your code, how they are going to come to it, and putting yourself in their place, so you can write code that anticipates their needs.

I’m not sure what to do with any of this, but I was struck by the well-written description of what motivates me in one aspect of my programming work.

“Against software development”

Michael Arntzenius writes:

Beautiful code gets rewritten; ugly code survives.

Just so, generic code is replaced by its concrete instances, which are faster and (at first) easier to comprehend.

Just so, extensible code gets extended and shimmed and customized until under its own sheer weight it collapses, then replaced by a monolith that Just Works.

Just so, simple code grows, feature by creeping feature, layer by backward-compatible layer, until it is complicated.

So perishes the good, the beautiful, and the true.

In this world of local-optimum-seeking markets, aesthetics alone keep us from the hell of the Programmer-Archaeologist.

Code is limited primarily by our ability to manage complexity. Thus,

Software grows until it exceeds our capacity to understand it.

HackerNews discussion. 

Ruby Magic helps sponsor Rubyland News

I have been running the Rubyland.news aggregator for two years now, as just a hobby spare time thing. Because I wanted a ruby blog and news aggregator, and wasn’t happy with what was out there then,  and thought it would be good for the community to have it.

I am not planning or trying to make money from it, but it does have some modest monthly infrastructure fees that I like getting covered. So I’m happy to report that Ruby Magic has agreed to sponsor Rubyland.news for a modest $20/month for six months.

Ruby Magic is an email list you can sign up for for occasional emails about ruby. They also have an RSS feed, so I’ve been able to include them on Rubyland.news for some time.  I find their articles to often be useful introductions or refreshers to particular topics about ruby language fundamentals. (It tends not to be about Rails, I know some people appreciate some non-Rails-focused sources of ruby info).  Personally, I’ve been using ruby for years, and the way I got as comfortable with it as I am is by always asking “wait, how does that work then?” about things I run into, always being curious about what’s going on and what the alternatives are and what tools are available, starting with the ruby language itself and it’s stdlib.

These days, blogging, on a platform with an RSS feed too, seems to have become a somewhat rarer thing, so I’m also grateful that Ruby Magic articles are available through RSS feed, so I can include then in rubyland.news. And of course for the modest sponsorship of Rubyland.news, helping to pay infrastructure costs to keep the lights on.  As always, I value full transparency in any sponsorship of rubyland.news; I don’t intend it to affect any editorial policies (I was including Ruby Magic feed already); but I will continue to be fully transparent about any sponsorship arrangements and values, so you can judge for yourself (a modest $20/month from Ruby Magic; no commitment beyond a listing on About page, and this particular post you are reading now, which is effectively a sponsored post).

I also just realized I am two years into Rubyland.news. I don’t keep usage analytics (was too lazy to set it up, and not entirely clear how to do that in case where people might be consuming it as an RSS feed itself), although it’s got 156 followers on it’s twitter feed (all aggregated content is also syndicated to twitter, which I thought was a neat feature).  I’m honestly not sure how useful it is to anyone other than me, or what people changes people might want; feedback is welcome!

Some notes on what’s going on in ActiveStorage

I work in a library-archives-museum digital collections and preservation. This is of course a domain that is very file-centric (or “bytestream”-centric, as some might say). Keeping track of originals and their metadata (including digests/checksums), making lots of derivative files (or “variants” and/or “previews” as ActiveStorage calls them; of images, audio, video, or anything else)

So, building apps in this domain in Rails, I need to do a lot of things with files/bytestreams, ideally without having to re-invent wheels of basic bytestream management in rails, or write lots of boilerplate code. So I’m really interested in file attachment libraries for Rails. How they work, how to use them performantly and reliably without race conditions, how to use them flexibly to be able to write simple code to meet our business and user requirements.  I recently did a bit of a “deep dive” into some aspects of shrine;  now, I turn my attention to ActiveStorage.

The ActiveStorage guide (or in edge from master) is a great and necessary place to start (and you should read it before this; I love the Rails Guides), but there were some questions I had it didn’t answer. Here are some notes on just some things of interest to me related to the internals of ActiveStorage.

ActiveStorage is a-changing

One thing to note is that ActiveStorage has some pretty substantial changes in between the latest 5.2.1 release and master. Sadly there’s no way I could find to use github compare UI (which i love) limited just to the activestorage path in the rails repo.

If you check out Rails source, you can do: ​git diff v5.2.0...master activestorage. Not sure how intelligible you can make that output. You can also look at merged PR’s to Rails mentioning “activestorage” to try and see what’s been going on, some PR’s are more significant than others.

I’m mostly looking at 5.2.1, since that’s the one I’d be using were I use it (until Rails 6 comes out, I forget if we know when we might expect that?), although when I realize that things have changed, I make note of it.

The DB Schema

ActiveStorage requires no changes to the table/model of a thing that should have attached files. Instead, the attached files are implemented as ActiveRecord has_many (or the rare has_one in case of has_one_attached) associations to other table(s), using ordinary relational modeling designs.  Most of the fancy modelling/persistence/access features and APIs (esp in 5.2.1) are seem to be just sugar on top of ordinary AR associations (very useful sugar, don’t get me wrong).

ActiveStorage adds two tables/models.

The first we’ll look at is ActiveStorage::Blob, which actually represents a single uploaded file/bytestream/blob. Don’t be confused by “blob”, the bytestream itself is not in the db, rather there’s enough info to find it in whatever actual storage service you’ve configured. (local disk, S3, etc. Incidentally, the storage service configuration is app-wide, there’s no obvious way to use two different storage services in your app, for different categories of file).

The table backing ActiveStorage::Blob has a number of columns for holding information about the bytesteam.

  • id (ordinary Rails default pk type)
  • key: basically functions as a UID to uniquely identify the bytestream, and find it in the storage. Storages may translate this to actual paths or storage-specific keys differently, the Disk storage files in directories by key prefix, whereas the S3 service just uses the key without any prefixes.
    • The key is generated with standard Rails “secure token” functionality–pretty much just a good random 24 char token. 
    • There doesn’t appear to be any way to customize the path on storage to be more semantic, it’s just the random filing based on the random UID-ish key.
  • filename: the original filename of the file on the way in
  • content_type: an analyzed MIME/IANA content type
  • byte_size: what it says on the tin
  • metadata: a Json serialized hash of arbitrary additional metadata extracted on ingest by ActiveStorage. Default AS migrations just put this in a text column and use db-agnostic Rails functions to serialize/deserialize Json, they don’t try to use a json or jsonb column type.
  • created_at: the usual. There is no updated_at column, perhaps because these are normally expected to be immutable (which means not expected to add metadata after point of creation either?).

OK, so that table has got pretty much everything needed. So what’s the ActiveStorage::Attachment model?  Pretty much just a standard join table.  Using a standard Rails polymorphic association so it can associate an ActiveStorage::Blob with any arbitrary model of any class.  The purpose for this “extra” join table is presumably simply to allow you to associate one ActiveStorage::Blob with multiple domain objects. I guess there are some use cases for that, although it makes the schema somewhat more complicated, and the ActiveStorage inline comments warn you that “you’ll need to do your own garbage collecting” if you do that (A Blob won’t be deleted (in db or in storage) when you delete it’s referencing model(s), so you’ve got to, with your own code, make sure Blob’s don’t hang around not referenced by any models unless in cases you want them to).

These extra tables do mean there are two associations to cross to get from a record to it’s attached file(s).  So if you are, say, displaying a list of N records with their thumbnails, you do have an n+1 problem (or a 2n+1 problem if you will :) ). The Active Storage guide doesn’t mention this — it probably should — but AS some of the inline AS comment docs do, and scopes AS creates for you to help do eager loading.

Indeed a dynamically generated with_attached_avatar (or whatever your attachment is called) scope is nothing but a standard ActiveRecord includes  reaching across the join to the blog. (for has_many_attached or has_one_attached).

And indeed if I try it out in my console, the inclusion scope results in three db queries, in the usual way you expect ActiveRecord eager loading to work.

irb(main):019:0> FileSet.with_attached_avatar.all
  FileSet Load (0.5ms)  SELECT  "file_sets".* FROM "file_sets" LIMIT $1  [["LIMIT", 11]]
  ActiveStorage::Attachment Load (0.8ms)  SELECT "active_storage_attachments".* FROM "active_storage_attachments" WHERE "active_storage_attachments"."record_type" = $1 AND "active_storage_attachments"."name" = $2 AND "active_storage_attachments"."record_id" IN ($3, $4)  [["record_type", "FileSet"], ["name", "avatar"], ["record_id", 19], ["record_id", 20]]
  ActiveStorage::Blob Load (0.5ms)  SELECT "active_storage_blobs".* FROM "active_storage_blobs" WHERE "active_storage_blobs"."id" IN ($1, $2)  [["id", 7], ["id", 8]]
=> #<ActiveRecord::Relation [#<FileSet id: 19, title: nil, asset_data: nil, created_at: "2018-09-27 18:27:06", updated_at: "2018-09-27 18:27:06", asset_derivatives_data: nil, standard_data: nil>, #<FileSet id: 20, title: nil, asset_data: nil, created_at: "2018-09-27 18:29:00", updated_at: "2018-09-27 18:29:08", asset_derivatives_data: nil, standard_data: nil>]>

When is file created in storage, when are associated models created?

ActiveStorage expects your ordinary use case will be attaching files uploaded through a form user.avatar.attach(params[:avatar]), where params[:avatar] is a meaning you get the file as a ActionDispatch::Http::UploadedFile. You can also attach a file directly, in which case you are required to supply the filename (and optionally a content-type):  user.avatar.attach(io: File.open("whatever"), filename: "whatever.png").  Or you can also pass an existing ActiveStorage::Blob to ‘attach’.

In all of these case, ActiveStorage normalizes them to the same code path fairly quickly.

In Rails 5.2.1, if you call attach on an already persisted record, immediately (before any save), an ActiveStorage::Blob row and ActiveStorage::Attachment row have been persisted to the db, and the file has been written to your configured storage location.  There’s no need to call save on your original record, the update took place immediately. Your record will report it has (and of course ActiveStorage’s schema means no changes had to be saved for the row for your record itself — and your record does not think it has outstanding changes via changed?, since it does not).

If you call attach on a new (not yet persisted) record, the ActiveStorage::Blob row is _still_ created, and the bytestream is still persisted to your storage service. But an ActiveStorage::Attachment (join object) has not yet been created.  It will be when you save the record.

But if you just abandon the record without saving it… you have an ActiveStorage::Blob nothing is pointing to, along with the persisted bytestream in your storage service. I guess you’d have to periodically look for these and clean then up….

But master branch in Rails tries to improve this situation with a fairly sophisticated implementation of storing deltas prior to save. I’m not entirely sure if that applies to the “already persisted record” case too. In general, I don’t have a good grasp of how AS expects your record lifecycles to effect persistence of Blobs — like if the record you were attaching it to failed validation, is the Blob expected to be there anyway? Or how are you expected to have validation on the uploaded file itself (like only certain content types allowed, say). I believe the PR in Rails master is trying to improve all of that, I don’t have a thorough grasp of how successful it is at making things “just work” how you might expect, without leaving “orphaned” db rows or storage service files.



ActiveStorage stores the IANA Media Type (aka “MIME type” or “content type”) in the dedicated content_type column in ActiveStorage::Blob. It uses the marcel gem (from the basecamp team) to determine content type.  Marcel looks like it uses file-style magic bytes, but also uses the user-agent-supplied filename suffix or content-type when it decides it’s necessary — trusting the user-agent supplied content-type if all else fails.  It does not look like there is any way to customize this process;  likely most people wouldn’t need that, but I may be one of the few that maybe does. Compare to shrine’s ultra-flexible content-type-determination configuration.

For reasons I’m not certain of, ActiveStorage uses marcel to identify content-type twice.

When (in Rails 2.5.1) you call ​some_model.attach, it calls ActiveStorage::Blob#create_after_upload!, which calls ActiveStorage::Blob#build_after_upload, which calls ActiveStorage::Blob.upload, which sets the content_type attribute to the result of extract_content_type method, which calls marcel.

Additionally, ActiveStorage::Attachment (the join table) has an after_create_commit hook which calls :identify_blob, which calls blob.identify, defined in ActiveStorage::Blob::Identifiable mixin, which also ends up using marcel — only if it already hasn’t been identified (recorded by an identified key in the json serialized metadata column).   This second one only passes the first 4k of the file to marcel (perhaps because it may need to download it from remote storage), while the first one above seems to pass in the entire IO stream.

Normally this second marcel identify won’t be called at all, because the Blob model is already recorded as identified? as a result of the first one. In either case, the operations takes place in the foreground inline (not a bg job), although one of them in an after-commit hook with a second save. (Ah wait, I bet the second one is related to the direct upload feature which I haven’t dived into. Some inline comment docs would still be nice!)

In Rails master, we get an identify:false argument to attach, which can be used to skip which you can use to skip content-type-identification (it might just use the user-agent-supplied content-type, if any, in that case?)

Arbitrary Metadata

In addition to some file metadata that lives in dedicated database columns in the blob table, like content_type, recall that there is a metadata column with a serialized JSON hash, that can hold arbitrary metadata. If you upload an image, you’ll ordinarily find height and width values in there, for instance.  Which you can find eg with ‘model..avatar.metadata[“width”]’ or  ‘model.avatar.metadata[:width]’ (indifferent access, no shortcuts like ‘model.avatar.width’ though, so far as I know).

Where does this come from? It turns out ActiveStorage actually has a nice, abstract, content-type-specific, system for analyzer plugins.  It’s got a built-in one for images, which extracts height and width with MiniMagick, and one for videos, which uses ffprobe command line, part of ffmpeg.

So while this blog post suggests monkey-patching Analyzer::ImageAnalyzer to add in GPS metadata extracted from EXIF, in fact it oughta be possible in 5.2.1+ to use the analyzer plugin to add, remove, or replace analyzers to do your customization, no ugly forwards-compat-dangerous monkey-patching required.  So there are intentional API hooks here for customizing metadata extraction, pretty much however you like.

Unlike content-type-identification which is done inline on attach, metadata analysis is done by ActiveStorage in a background ActiveJob. ActiveStorage::Attachment (the join object, not the blog), has an after_create_commit hook (reminding us that ActiveStorage never expects you to re-use a Blob db model with an altered bytestream/file), which calls blob.analyze_later (unless it’s already been analyzed).   analyze_later simply launches a perform_later ActiveStorage::AnalyzeJob with the (in this case) ActiveStorage::Blob as an argument.  Which just calls analyze on the blob.

So it, at least in theory, this can accommodate fairly slow extraction, because it’s in the background. That does mean you could have an attachment which has not yet been analyzed; you can check to see if analyzation has happened yet with analyzed? — which in the end is just an analyzed: true key in the arbitrary json metadata hash. (Good reminder that ActiveRecord::Store exists, a convenience for making cover methods for keys in a serialized json hash).

This design does assume only one bg job per model that could touch the serialized json metadata column exists at a time — if there were two operating concurrency (even with different keys), there’d be a race condition where one of the sets of changes might get lost as both processes race to 1) load from db, 2) merge in changes to hash, 3) save serialization of merged to db.  So actually, as long as “identified: true” is recorded in content-type-extraction, the identification step probably couldn’t be a bg job either, without taking care of the race condition, which is tricky.

I suppose if you changed your analyzer(s) and needed to re-analyze everything, you could do something like ActiveStorage::Blob.find_each(&:analyze!). analyze! is implemented in terms of update!, so should persist it’s changes to db with no separate need to call save.


ActiveStorage calls “variants” what I would call “derivatives” or shrine (currently) calls “versions” — basically thumbnails, resizes, and other transformations of the original attachment.

ActiveStorage has a very clever way of handling these that doesn’t require any additional tracking in the db.  Arbitrary variants are created “on demand”, and a unique storage location is derived based on the transformation asked for.

If you call avatar.variant(resize: "100x100"), what’s returned is an ActiveStorage::Variant.  No new file has yet been created if this is the first time you asked for that. The transformation will be done when you call the processed method. (ActiveStorage recommends or expects for most use cases that this will be done in controller action meant to deliver that specific variant, so basically on-demand).   processed will first see if the variant file has already been created, by checking processed?. Which just checks if a file already exists in the storage with some key specific to the variant. The key specific to the variant is  “variants/#{blob.key}/#{Digest::SHA256.hexdigest(variation.key)}“. Gives it some prefixes/directory nesting, but ultimately makes a SHA256 digest of variation.key.  Which you can see the code in ActiveStorage::Variation, and follow it through ActiveStorage.verifier, which is just an instance of ActiveSupport::MessageVerifier — in the end we’re basically just taking a signed (and maybe encyrpted) digest of the serialization of the transformation arguments passed in in the first place,  `{ resize: “100×100” }`.

That is, basically through a couple of cryptographic digests and some crypto security too, were just taking the transformation arguments and turning them into a unique-to-those-arguments key (file path).

This has been refactored a bit in master vs 5.2.1 — and in master the hash that specifies the transformations, to be turned into a key, becomes anything supported by image_processing with either MiniMagick or vips processors instead of 5.2.1’s bespoke Minimagick-only wrapper. (And I do love me some vips, can be so much more performant for very large files).  But I think the basic semantics are fundamentally the same.

This is nice because we don’t need another database table/model to keep track of variants (don’t forget we already have two!) — we don’t in fact need to keep track of variants at all. When one is asked for, ActiveStorage can just check to see if it already exists in storage at the only key/path it necessarily would be at.

On the other hand, there’s no way to enumerate what variants we’ve already created, but maybe that’s not really something people generally need.

But also, as far as I can find there is no API to delete variants. What if we just created 100×100 thumbs for every product photo in our app, but we just realized that’s way too small (what is this, 2002?) and we really need something that’s 630×630. We can change our code and it will blithely create all those new 630×630 ones on demand. But what about all the 100x100s already created? They are there in our storage service (say S3).  Whatever ways there might be to find the old variants and delete them are going to be hacky, not to mention painful (it’s making a SHA256 digest to create filename, which is intentionally irreversible. If you want to know what transformation a given variant in storage represents, the only way is to try a guess and see if it matches, there’s no way to reverse it from just the key/path in storage).

Which seems like a common use case that’s going to come up to me? I wonder if I’m missing something. It almost makes me think you are intended to keep variants in a storage configured as a cache which deletes old files periodically (the variants system will just create them on demand if asked for again of course) — except the variants are stored in the same storage service as your originals, and you certainly don’t want to purge non-recently-used originals!  I’m not quite sure what people are doing with purging no-longer-used variants in the real world, or why it hasn’t come up if it hasn’t.

And something that maybe plenty of people don’t need, but I do — ability to create variants of files that aren’t images: PDFs, any sort of video or audio file, really any kind of file at all. There is a separate transformation system called previewing that can be used to create transformations of video and PDF out of the box — specifically to create thumbnails/poster images.  There is a plugin architecture, so I can maybe provide “previews” for new formats (like MS Word), or maybe I want to improve/customize the poster-image selection algorithm.

What I need aren’t actually “previews”, and I might need several of them. Maybe I have a video that was uploaded as an AVI, and I need to have variants as both mp4 and webm, and maybe choose to transcode to a different codec or even adjust lossy compression levels. Maybe I can still use ‘preview’ function nonetheless? Why is “preview” a different API than “variant” anyway? While it has a different name, maybe it actually does pretty much the same thing, but with previewer plugins? I don’t totally grasp what’s going on with previews, and am running out of steam.

I really gotta get down into the weeds with files in my app(s), in an ideal world, I would want to be able to express variants as blocks of whatever code I wanted calling out to whatever libraries I wanted, as long as the block returned an IO-like object, not just hashes of transformation-specifications. I guess one needs something that can be transformed into a unique key/path though. I guess one could imagine an implementation had blocks registered with unique keys (say, “webm”), and generated key/paths based on those unique keys.  I don’t think this is possible in ActiveStorage at the moment.

Will I use ActiveStorage? Shrine?

I suspect the intended developer-user of ActiveStorage is someone in a domain/business/app for which images and attachments  are kind of ancillary. Sure, we need some user avatars, maybe even some product images, or shared screenshots in our basecamp-like app. But we don’t care too much about the details, as long as it mostly works.  Janko of Shrine told me some users thought it was already an imposition to have to add a migration to add a data column to any model they wanted to attach to, when ActiveStorage has a generic migration for a couple generic tables and you’re done (nevermind that this means extra joins on every query whose results you’ll have to deal with attachments on!) — this sort of backs up that idea of the native of the large ActiveStorage target market.

On the other hand, I’m working in a domain where file management is the business/domain. I really want to have lots of control over all of it.

I’m not sure ActiveStorage gives it to me. Could I customize the key/paths to be a little bit more human readable and reverse-engineerable, say having the key begin with the id of the database model? (Which is useful for digital preservation and recovery purposes).Maybe? With some monkey-patches? Probably not?

Will ActiveStorage do what I need as far as no-boundaries flexibility to variant creation of video/audio/arbitrary file types?  Possibly with custom “previewer” plugin (even though a downsampled webm of an original .avi is really not a “preview”), if I’m willing to make all transformations expressable as a hash of specifications?  Without monkey-patching ActiveStorage? Not sure?

What if I have some really slow metadata generation, that I really don’t want to do inline/foreground?  I guess I could not use the built-in metadata extraction, but just make my own json column on some model somewhere (that has_one_attachment), and do it myself. Maybe I could do that variants too, with additional app-specific models for variants (that each have a has_one_attached with the variant I created).  I’d have to be careful to avoid adding too many more tables/joins for common use cases.

If I only had, say, paperclip and carrierwave, I might choose ActiveStorage anyway, cause they aren’t so flexible either. But, hey, shrine! So flexible! It still doesn’t do everything I need, and the way it currently handles variants/derivatives/versions isn’t suitable for me (not set up to support on-demand generation without race conditions, which I realize ironically ActiveStorage is) — but I think I’d rather build it on top of shrine, which is intended to let you build things on top of it, than ActiveStorage, where I’d likely have to monkey-patch and risk forwards-incompatible.

On the other hand, if ActiveStorage is “good enough” for many people… is there a risk that shrine won’t end up with enough user/maintainer community to stay sustainable? Sure, there’s some risk. And relatively small risk of ActiveStorage going away.  One colleague suggested to me that “history shows” once something is baked into Rails, it leads to a “slow death of most competitors”, and eventually more features in the baked-into Rails version. Maybe, but…. as it happens, I kind of need to architect a file attachment solution for my app(s) now.

As with all dependency and architectural choices, you pays yer money and you takes yer chances. It’s programming. At best, we hope we can keep things clearly delineated enough architecturally, that if we ever had to change file attachment support solutions, it won’t be too hard to change.  I’m probably going with shrine for now.

One thing that I found useful looking at ActiveStorage is some, apparently, “good enough” baselines for certain performance/architectural issues. For instance, I was trying to figure out a way to keep my likely bespoke derivatives/variants solution from requiring any additional tables/joins/preloads (as shrine out of the box now requires zero extra) — but if ActiveStorage requires two joins/preloads to avoid n+1, I guess it’s probably okay if I add one. Likewise, I wasn’t sure if it was okay to have a web architecture where every attachment image view is going to result in a redirect… but if that’s ActiveStorage’s solution, it’s probably good enough.

Notes on deep diving with byebug

When using byebug to investigate some code, as I did here, and regularly do to figure out a complex codebase (including but not limited to parts of Rails), a couple Rails-related tips.

If there are ActiveJobs involved, ‘config.active_job.queue_adapter = :inline’ is a good idea to make them easier to ‘byebug’.

If there are after_commit hooks involved (as there were here), turning off Rails transactional tests (aka “transactional fixtures” before Rails 5) is a good idea. Theoretically Rails treats after_commit more consistently now even with transactional tests, but I found debugging this one I was not seeing the real stuff until I turned off transactional tests.  In Rspec, you do this with ‘config.use_transactional_fixtures = false’  in the rails_helper.rb rspec config file.

What “Just standard Rails” means to the University of Alberta libraries

I recently had a chance to speak with the development team at the University of Alberta about their development of their jupiter digital repository app (live, github).

UAlberta had a sufia 6 app in production that was a pretty stock “institutional repository holding PDFs. Around Fall 2015, they started trying to “catch up” to sufia 7 with PCDM etc. — to get features they believed would make it easier to incorporate more ‘digital collections’ content, and to just avoid stale non-maintained dependencies.

In Summer 2017, after having spent almost two years trying to get on sufia 7, with mounting frustrations and still seeming far from the finish line — and after having hired a few non-library-archives-museum-experienced but experienced Rails developers — the University of Alberta libraries development team decided on a radical new approach. They decided it wasn’t clear what Sufia was giving them to justify the difficulty they were having with it. They started over, trying to keep things as close to “ordinary Rails” as possible.

At that time, Fedora still was an institutional requirement.  So they didn’t toss out all of the samvera stack. They decided that they’d chop off the trunk as close to the bottom as they could while still getting tools for working with fedora, and to them that meant a hydra-works dependency, but few other hyrax dependencies.  They basically started the app over.

Within about 6 months of that effort (Early spring 2018) with approximately two full-time developers, they were live with their app (jupiter repo), and have been happy with it so far. But they also still haven’t gotten to the originally planned content beyond the IR-type PDFs, the scanned monographs, newspapers, etc. And have had some developer turnover. (Hey, they’re currently hiring y’all).

The jupiter app implementation

My understanding of how their app works is based on just an hour conversation, plus a few hours spent looking at their source code and internal docs — I may get some things wrong!

Jupiter seems to be to be a pretty simple app, a fairly basic idea of an “institutional repository”.  Most of the items are single PDFs, without children/members.  The software does support some items being composed of a list of files — but no “child works”.  The metadata seems relatively simple; some repeatable strings, but no nested/compound objects (ie, an attribute whose values are multi-property objects themselves). While there is some self-deposit, there is no complicated workflow, basically just an edit form.

The permissions model is relatively simple. Matt Barnett, a lead developer for much of this process (who was there for our conversation, but left the team soon after that) told me that originally some internal stakeholders had been pushing for a more complex permissions model. But knowing what a tar-pit trying to implement ACLs could be, Matt pushed back, and they ultimately implemented a simple model: There are owners who can edit the things they own, and admins who can edit everything, and that’s about it.  By virtue of their campus SSO system, they got “shared accounts” for free, so people could log into a shared account when they needed to share edit privs.

They had been using hydra-deratives for their simple derivative needs (maybe just a single thumbnail for each PDF?), but when ActiveStorage, part of Rails, was released, they began switching the code to that (may or may not be merged into master/deployed in repo yet as this gets published).

Fedora is still there, modeled with hydra-works.  The indexing to solr is whatever was built into hydra-works. They just wrote their own straightforward forms with simple_form.  They also do a lot of CSV-based ingest, which they just wrote code for, like even sufia users would I think.

They use UUID primary keys.

Their app does index to solr — using the general ActiveFedora indexing methods, I think, solrizer and all.  You can see that their indexer is pretty stock, it mostly just calls “super”.

All of their objects exist as ActiveRecord “draft” objects while they are being edited, through more or less ordinary Rails mechanisms. When they have multi-valued fields, they use postgres json arrays, rather than actual normalized schema (which would suggest a different table). I’m not sure what they need to do to get this to work with forms and controller updates. These active record objects seem to use something custom for collection memberships, rather than actual active record associations. So in these regards it’s not quite a totally ordinary activerecord modelling.

The objects have a life in activerecord, but are mirrored to fedora at certain life cycle points — I believe this is also what gets them into solr (using samvera/active-fedora solr indexing code).  The public-facing front-end is based entirely on data from solr — but not using Blacklight, simply writing Rails code to issue queries and handle responses to Solr (with Rsolr I think).

A brief overview of their architecture, by Matt before he left, focusing especially on the persistence stuff (since that’s the least “rails”-y and most custom stuff), can be found in their public repo, here.   Warning, it is pretty negative about samvera/sufia/active_fedora, gird yourself. You can see there they have done a couple custom local things to make the ActiveFedora objects and classes to some extent mimic ActiveRecord, to make using them in Rails easier, trying to encapsulate the fedora-specific stuff inside API boundaries. While at a high level this is what ActiveFedora’s goal is — their implementation is simpler, smaller, less powerful and custom-fit to what they need. We can just say they’re happier with how their local implementation turned out. They also explicitly wrote it to support potential future elimination of fedora altogether.

Matt said if he had to do it over, he might have pushed harder on stripping fedora out too, and just having everything in postgres. And that is something the team still plans to look at seriously for the future.

So what does “just a rails app” mean?  And how do you deal with increased complexity of your requirements?

The most useful thing for me in the conversation was that Matt pushed back on my outline of a potential plan, saying I was still undertaking too much abstraction.

The U Alberta team suggested that I should keep it even simpler, with less DRY abstraction (and thus less tools that could be shared between institutions/apps), and more just building your app for what you need right now.  You can see some of this push-back, specifically in the context of what U Alberta needs, in another document he wrote before he left Alberta in the jupiter repo, on notes for future expansion. It is really worth reading,  to see an argument from even more extreme simplicity, from a developer experienced with Rails but not “infected” with “how libraries do things”   But beware, it’s not shy about our community shibboleths.

We developers (and we library folks) are really drawn the abstraction, generalization, and shared tools that meet as many needs as possible.  It can sometimes lead us astray. It is actually very common advice in software engineering to stick to what you  actually need today, for your app you are developing (you know your business/user needs and which are the highest priorities to business value, right?).  “Do the simplest thing that could possibly work”, “You aren’t gonna need it.” It keeps us honest.

However, I also think it’s possible to code yourself into a corner this way, where your app was fine for exactly what you needed then, but when you need one more thing… you can find you need to re-write large parts of it to accommodate.  In some ways this is my experience with current samvera stack, early fundamental architectural decisions pen us in when we have new use cases. That kind of problem stays smaller when you avoid  harder-to-change shared code, but I don’t it goes away entirely. Trying to play for the future always entails some “YAGNI” risk, but the more domain knowledge and experience you have… maybe you can do better at seeing where you are going and planning for it?

Just some of the specific component plans Matt was skeptical of…

attr_json vs. Just Plain ActiveRecord schemas

The jupiter app already has an activerecord implementation which isn’t strictly “ordinary” activerecord, in the sense they serialize multi-valued/repeatable fields to json arrays,  rather than needing a separate table for each attribute as an actual normalized schema would require. Plus the logic I don’t entirely understand but think might not be ordinary AR associations around collection and “community” membership.

So this already gets you away from the strict “ordinary Rails” path — I’m not sure how the JSON array fields are handled by form submission, or querying if you wanted to do querying (it’s possible all their querying is solr-based, which is familiar to samvera-land, and also unusual for “ordinary rails”).

At my institution, we already have the need for complex repeatable data–a simple example would be repeatable “inscription” notations, each of which has the text of the inscription and the location in the book.  So not just an array of strings, but perhaps an array of hashes.  Weiwei Shi (Digital Initiatives Applications Librarian) suggested in a follow-up message, “We can use the JSON data type to support a more complex data structure as needed” — that is, if I understand it, they are contemplating actual postgres representation somewhat similar to what I am with attr_json, if they end up needing complex json. Matt’s second document tries to draw a line between how they are doing things in “more-or-less completely standard Rails models” and the way I was proposing to do things — I’m not sure I actually see such a great distinction, the representations in postgres to me seem pretty similar, neither of which is standard Active Record patterns.

They do have each attribute in a separate column, whereas I was contemplating putting them all in a single json column. Their approach does have advantages for avoiding update race conditions (or needing optimistic locking to avoid them).  I perhaps should consider that, as an extra feature to attr_json. Although either way you get columns you can’t necessarily use ordinary ActiveRecord querying or form-based update with.

Everyone seems to doubt that attr_json is really going to work, ha. The skepticism towards newly invented non-trivial dependencies is justified, but I can point out attr_json is a lot fewer lines of code than ActiveFedora, or even Valkyrie —  I think it is a risk, but it’s scoped carefully and implemented robustly, and I can’t figure out any other way I’m confident would be simpler to actually meet our modeling needs — once you start doing this complex json stuff, I think you’ll find that it doesn’t behave like “ordinary rails” — for forms/updates, validations, etc. — and rather than hack it out on a case by case basis, it makes a lot of sense to me to solve the problem with something like attr_json, encapsulating the “not really like ordinary ActiveRecord” stuff as much as possible.

The other option of course would be an actual normalized schema, with one table per attribute. For our “inscriptions” that table might have two columns (text and location), for a simple repeatable alternate title it might only have one. It’s going to be a mess to try to prevent n+1 queries and keep db access performant.  I am encouraged I’m not on an insane track by the fact that even the U Alberta is using JSON serializations in postgres, not actually ordinary normalized data — I think as your data gets more complex (not just array of primitives, but need serialization as arrays of hashes), you’re really going to want something like attr_json.  But maybe I’m wrong.

And for better or worse, I have trouble giving up entirely on the idea of some shared tools to make things easier for others in the community too — because it’s fun and rewarding, and why should we all constantly re-invent the wheel? But it’s good to be reminded of the dangers that lie in that direction.


I’m not sure if Matt mentioned this specifically, but I realize I have added a lot of non “basic ActiveRecord” complexity to the data modelling my plan in order to support the PCDM-ish association modeling, where a work has “members” and the members can be either works of themselves (which can have multiple members) or single file objects, and they all need to be kept in order.

U Alberta’s app doesn’t have that. A work can have a list of files, the end.

At my institution I actually spent a while trying to convince stakeholders that we didn’t need that either, but it was the one thing I could make no headway on — eventually they convinced me we did, to accomplish our actual business goals.

If you need this, I can’t figure out any way to get there in an “ActiveRecord-first”-ish direction, except either single-table-inheritance or polymorphic associations.  Both of which are some of the oddest and least reliable corners of ActiveRecord. Of the two, I think STI is probably least weird and most likely to do more of standard use cases minimizing number of db queries. (Valkryie’s approach is somewhat similar in how it uses the DB to single-table inheritance, but without actually using that AR feature).


Matt thought that shrine might do more than ActiveStorage now, but history shows things built into Rails will probably expand and get better. (Yes, but it’s unclear to me how to make audio or video “variants” or derivatives with ActiveStorage, which my place of work predicts to need very shortly. If we are really ruthless about only what we need right now, are we going to have to just rewrite it as soon as we need another thing? There are no easy answers, “YAGNI” is simpler when it’s all about software you are writing yourself and not dependencies… but there are grey areas too).

But I’m not certain about this, after trying to help shrine developers enhance the versions/derivatives functionality to better support some flexibility we need as to storage locations and point-in-time of creation. The answer may just be trying to write an app which adds on locally to do exactly what it needs (whether in terms of shrine  or ActiveStorage), without trying to make a shareable general purpose tool?


Matt was very suspicious of using Blacklight at all, he found that it was quite easy for them to just write the UI they needed based on Solr responses. And their app certainly is as good as many sufia/hyrax apps (it even has an actual search-within-the-collection feature on collection pages, which I think our sufia 7 app didn’t, although I think latest hyrax does).

But remember my inability to entirely give up on the idea of a shareable toolkit? I really would like something that serves as “scaffolding” that gives you an app out of the box with basic file ingest, metadata edit, and search out of the box. And Blacklight is a way to do this. And I think my plan to segregate Blacklight from the rest of the app (it should be a dependency you can switch out) by immediately fetching records from postgres corresponding to solr search results — may be able to keep Blacklight from “infecting” the rest of the app with Blacklight-isms, as Matt was worried it could.

How simple is simple?

It was useful to have Matt call my bluff to some extent: What I have been hypothetically proposing isn’t really “just plain rails”.  But it’s a lot closer than current samvera, or even valkyrie.

I think some of the valkyrites think valkyrie’s departures from “ordinary Rails” are a a positive, that they can use different patterns to do better than Rails…  which is not a unique idea to them…  but I think is a bit hubristic, to think you can invent something better (and easier to onboard new developers with?) than Rails. (I also wonder if the valkyrites, if freed from the need to support fedora too, would simply use hanami?)

The same charges of hubris can be brought to my initial sketch of plans though — it was useful to be challenged from the “left” of “you’re still not simple enough” by Matt. I am so used to thinking about my in-formation plans as a/the simple alternative to, well, samvera or even valkyrie… it was a refreshing and needed corrective to be talking to Matt who thought my plans were still too much abstraction, not as simple as possible, not sticking close enough to implementing only what was needed for my business needs. On the one hand, it makes me worried he’s right; on the other, it makes me more comfortable to be in a nice middle ground of moderation with people advocating things on both sides or me, both heavier-weight and lighter-weight, sharing more code with the LAM digital collections community on one side, and sharing basically none on the other.

Really, “just plain rails” or “just plain [any code]” is to some extent a mirage, or an aspiration. We’re always writing code when we build a Rails app.  We’re always using some dependencies. While there can be a false economy in trying to share all your code in hopes of minimizing the amount of code that has to be written in aggregate (it often doesn’t work out that way because building good re-usable abstractions is hard) — there can also of course be a false economy in never using anyone elses dependency, and “not invented here” syndrome.  And if you’re writing it yourself — it’s writing abstraction layers that are potentially giving you not-worth-it complexity, whether you keep them in the app or make them into a gem. But abstraction layers are also what allow us to do complex things that we can still comprehend as humans — when it works. 

Software is still a craft. As Alberta needs to add additional features, with their aspirations to add a lot more digital-collections-like content — it’s going to take software craftsmanship to figure out how to keep it simple.  What I like about U Alberta’s approach is they realize this.  They realize they are an internal development shop, and need to let developers do what developers do — rather than have non-technical stakeholders making technical decisions for non-technical reasons.  (At one point someone said: After having been ‘burned’ before, they are very suspicious of using common/shared software, vs. just writing their app — which is part of their skepticism towards attr_json —  I think they’re not wrong).

One thing letting an internal development shop excel entails is figuring out how to recruit and retain a solid development team with limited budget, which is one reason Alberta is trying to be so ruthless about keeping it simple and “standard”.  One phrase I heard repeated was “industry-standard onboarding”, which I think also implies needing to be accessible to relatively “junior” new hires, which requires keeping your stack simple and standard. (That is, traditional-samvera or valkyrie-using institutions do not necessarily have any less challenge here and may have more, as for instance Steven Anderson of BPL argued)

(But I wonder if on-boarding a new developer to an existing project that has a very small dev team is going to be challenging across the industry!  I am not convinced that “Where the Rails community has a diversity of opinions on an approach, we should prefer the approach espoused by the Rails core team” (from a Matt/Alberta manifestoalways and necessarily leads to the simplest code or easiest to on-board new developers with. sometimes you can build a monster in the pursuit of not doing something novel…. the irony, right? But it’s always worth considering the trade-offs).

I definitely might end up re-orienting.  For instance, Matt reminded me of something I knew but tried to forget even when writing out my notes for a possible plan: A generalized permissions/ACL system is a craggy shore that many ships have crashed upon. Should I just write for my app the permissions we need instead? After doing some more business analysis to figure out what they are?  Perhaps. More broadly, if we end up trying to implement this “toolkit” and I’m running into troubles and worrying our reach exceeded our grasp — retreat to just the app good enough for what we need right now is always a valid escape hatch.

U Alberta’s story, where they’ve been working on this app with a very different approach for over a year, and so far are happy —  is another good data point reminding us that dissatisfaction with the samvera stack is not new, especially institutions that have developers with wider Rails experience have been suspicious of the value propositions of fedora and samvera for some time.  And that there are a variety of approaches being tried. We all need community to bounce our ideas off of and get feedback, especially those of us who operate in 2-4 person development shops need more than we may get internally. I’m so glad they were willing to spend some time talking to me.  And I highly encourage reading all of Matt/U Alberta’s somewhat iconoclastic analysis docs, as one way of considering other perspectives.  I’m not sure if I can find the time, but I’d kind of like to “onboard” myself into their codebase, and understand how it works better as one example.


Thanks to the whole U Alberta team, and especially Peter Binkley, Weiwei Shi, and Matt Barnett, for spending time explaining what they were up to to me. Thanks to Peter and Weiwei for reviewing this post for any terrible errors.  All remaining mistakes and wrong opinions are my own.

“Whatever Happened to the Semantic Web?”

I’ve been enjoying some of the computing history articles, and especially internet history articles, on twobithistory.org. But this one hits especially close to home I think, “Whatever Happened to the Semantic Web?”

The problem, in Swartz’ view, was the “formalizing mindset of mathematics and the institutional structure of academics” that the “semantic Webheads” brought to bear on the challenge. In forums like the World Wide Web Consortium (W3C), a huge amount of effort and discussion went into creating standards before there were any applications out there to standardize. And the standards that emerged from these “Talmudic debates” were so abstract that few of them ever saw widespread adoption. The few that did, like XML, were “uniformly scourges on the planet, offenses against hardworking programmers that have pushed out sensible formats (like JSON) in favor of overly-complicated hairballs with no basis in reality.” The Semantic Web might have thrived if, like the original web, its standards were eagerly adopted by everyone. But that never happened because—as has been discussedon this blog before—the putative benefits of something like XML are not easy to sell to a programmer when the alternatives are both entirely sufficient and much easier to understand…


The long effort to build the Semantic Web has been said to consist of four phases.7 The first phase, which lasted from 2001 to 2005, was the golden age of Semantic Web activity. Between 2001 and 2005, the W3C issued a slew of new standards laying out the foundational technologies of the Semantic future.

The most important of these was the Resource Description Framework (RDF). …

In 2006, Tim Berners-Lee posted a short article in which he argued that the existing work on Semantic Web standards needed to be supplemented by a concerted effort to make semantic data available on the web…  Berners-Lee’s article launched the second phase of the Semantic Web’s development, where the focus shifted from setting standards and building toy examples to creating and popularizing large RDF datasets. Perhaps the most successful of these datasets was DBpedia, a giant repository of RDF triplets extracted from Wikipedia articles….

…The third phase of the Semantic Web’s development involved adapting the W3C’s standards to fit the actual practices and preferences of web developers. By 2008, JSON had begun its meteoric rise to popularity…. issued a draft specification of JSON-LD in 2010. For the next few years, JSON-LD and an updated RDF specification would be the primary focus of Semantic Web work at the W3C….

….Today, work on the Semantic Web seems to have petered out. The W3C still does some work on the Semantic Web under the heading of “Data Activity,” which might charitably be called the fourth phase of the Semantic Web project. But it’s telling that the most recent “Data Activity” project is a study of what the W3C must do to improve its standardization process.13 Even the W3C now appears to recognize that few of its Semantic Web standards have been widely adopted and that simpler standards would have been more successful. The attitude at the W3C seems to be one of retrenchment and introspection, perhaps in the hope of being better prepared when the Semantic Web looks promising again….


And so the Semantic Web, as colorfully described by one blogger, is “as dead as last year’s roadkill.”14 At least, the version of the Semantic Web originally proposed by Tim Berners-Lee, which once seemed to be the imminent future of the web, is unlikely to emerge soon. That said, many of the technologies and ideas that were developed amid the push to create the Semantic Web have been repurposed and live on in various applications….

…So the problems that confronted the Semantic Web were more numerous and profound than just “XML sucks.” All the same, it’s hard to believe that the Semantic Web is truly dead and gone. Some of the particular technologies that the W3C dreamed up in the early 2000s may not have a future, but the decentralized vision of the web that Tim Berners-Lee and his follow researchers described in Scientific American is too compelling to simply disappear. Imagine a web where, rather than filling out the same tedious form every time you register for a service, you were somehow able to authorize services to get that information from your own website. Imagine a Facebook that keeps your list of friends, hosted on your own website, up-to-date, rather than vice-versa. Basically, the Semantic Web was going to be a web where everyone gets to have their own personal REST API, whether they know the first thing about computers or not. Conceived of that way, it’s easy to see why the Semantic Web hasn’t yet been realized. There are so many engineering and security issues to sort out between here and there. But it’s also easy to see why the dream of the Semantic Web seduced so many people.

Whatever Happened to the Semantic Web?



Oh my, I just realized he cited MY famous blog on linked data in a note. I did not realize that until I actually went and looked at all the footnotes. He cites me for the comment “as dead as last year’s road kill”, but I knew I wouldn’t say something like that! And I did not. I was citing a comment on HackerNews, which I properly quoted, cited, and linked to! It is not something I said or my opinion… exactly.  (since corrected).

The HackerNews comments on this article are… interesting.

Notes on study of shrine implementation

Developing software that is both simple and very flexible/composable is hard, especially in shared dependencies. Flexiblity and composability often lead to very abstract, hard to understand architecture. An architecture custom-fitted for particular use cases/domains has an easier time of remaining simple with few moving parts. I think this is a fundamental tension in software architecture.

shrine is a “File Attachment toolkit for Ruby applications”, developed with explicit goals of being more flexible than some of what came before. True to form, it’s internal architecture can be a bit confusing.

I want to work with shrine, and develop some new functionality based on it, related to versions/derivatives (hopefully for submission to shrine core), requiring some ‘under the hood’ work. When I want to understand some new complicated architecture (say, some part of Rails), one thing I do is trace through it with a debugger (while going back and forth with documentation and code-reading), and write down notes with a sort of “deep dive” tour through a particular code path. So that’s what I’ve done here, with shrine 2.12.0. It may or may not be useful to anyone else, part of the use for me is in writing it; but when I’ve done this before for other software others have found it useful, so I’ll publish it in case it is (and so I can keep finding it again later to refer to it myself, which I plan to do).

Some architectural overview

shrine uses a plugin system based on module mix-in overrides (basically, inheritance),  which is not my favorite form of extension (many others would agree). Most built-in shrine func is implemented as plugins, to support flexible configuration. This mixin-overridden-methods architecture can lead to some pretty tightly coupled and inter-dependent code, even in ostensibly independent plugins, and I think it has sometimes here.  Still, shrine has succeeded in already being more flexible than anything that’s come before (definitely including ActiveStorage). This is just part of the challenge of this kind of software development, I don’t think anyone else starting over is gonna get to a better overall place, I still think shrine is the best thing to work with at present if you need maximal flexibility in handling your uploaded assets.

Shrine has a design document that explains the different objects involved. I still found it hard to internalize a mental model, even with this document. After playing with shrine for a while, here’s my own current re-stating of some of the primary objects involved in shrine (hopefully my re-statement doesn’t have too many errors!).

An uploader (also called a “shrine” object, as the base class is just Shrine) is a  stateless object that knows how to take an IO stream and persist to some back-end.   You generally write a custom uploader class for your app, because a specific uploader is what has specifics about any validationtransformationmetadata extraction, etc, in ingesting a file. An uploader is totally  stateless though (or rather immutable, it may have some config state set on initialize) — it’s sort of a pipeline for going from an IO object to a persisted file.  When you write a custom uploader, it isn’t hard-coded to a particular persistent back-end, rather a specific storage object is injected into an individual uploader instance at runtime.

A shrine attacher is the object that has state for the file. An attacher knows about the model object the file is attached to (a specific attacher instance is associated with a specific model instance).  An attacher has two uploaders injected into it — one for the temporary cache storage and one for the permanent store storage. These are expected to be the same class of uploader, just with different storages injected.  An attacher has ORM plugins that handle actual persistance to the db, as well as tracking changes, and just everything that needs to be done regarding the state of a particular file attachment.

In a typical model, you can get access to the attacher instance for an asset called avatar from a method called avatar_attacher. The avatar method itself is essentially delegated through the attacher too. The attacher is the thing managing access and mutation of the attached files for the model.  If you ask for avatar_attacher.store or avatar_attacher.cache, you get back an uploader object corresponding to that form of storage — to be used to process and persist files to either of those storages.

How do those methods avatar and avatar_attacher wind up in the model?  A ruby module is mixed in to the model with those methods. Shrine calls this mix-in module an “attachment”. When you do include MyUploader::Attachment.new(:name_of_column) in your model, that’s returning an attachment module and mixing it into your model.  I find “attachment” not the most clear name for this, especially since shrine documentation also calls an individual file/bytestream an “attachment” sometimes, but there it is.

And finally, there’s the simple UploadedFile, which is simply a model object representing an uploaded file! It can let you get various information about the uploaded file, or access it (via stream, downloaded file, or url).  An UploadedFile is more or less immutable. It’s what you get returned to you from the (eg) avatar method itself.  An UploadedFile can be round-trip serialized to json — the json that is persisted in your model _data column. So an UploadedFile is basically the deserialized model representation of what’s in your _data column.

It’s important to remember that shrine uses a two-step file persistence approach. There is a temporary cache storage location that has files that may not pass validation and may not yet have been actually saved to a model (or may never be).  The file can be re-displayed to a user in a validation error when it’s in “cache” for instance. Then when the file is actually succesfully permanently persisted attached to a model, it’s in a different storage location, called the store.

Tracing what happens internally when you attach a file to an ActiveRecord model using shrine

Most of this will be relevant regardless of ActiveRecord, but I focused on an ActiveRecord implementation. My demonstration app used to step through uses a bog-standard Shrine uploader, with no plugins (but :activerecord).

class StandardUploader < Shrine
  plugin :activerecord

Just to keep things consistent, we attach to a model on the “standard_data” column, with accessor called “standard”.

  include StandardUploader::Attachment.new(:standard)

What is shrine doing under the hood, what are the different parts, when we assign a new file to the model?  We’ll first do model.standard = File.open("something"), and then model.save.

First model.standard = File.open("something")

The #standard= is provided by the attachment module mix-in, and it calls  asset_attacher.assign(io_object)

If it’s NOT a string, assign first does: `uploaded_file = [attacher.]cache!(value, action: :cache)` (What’s up with ‘not a string’? A string is assumed to be serialized json from a form representing an already existing file. The assign method assumes it’s either an IO object or serialized JSON from a form; there are other methods than `assign` to directly set an UploadedFile or what have you).

The cache! method calls uploaded_file = cache.upload(io)cache points to an instance of our StandardUploader configured to point at the configured ‘cache’ (temporary) storage, so we’re calling upload on an uploader.

[cache uploader]#upload calls processed to run the IO through any uploader-specific processing that is active on the “cache” stage.

Then it calls #store on itself, the uploader assigned as `cache`. “Uploads the file and returns an instance of Shrine::UploadedFile. By default the location of the file is automatically generated by #generate_location, but you can pass in `:location` to upload to a specific location. [ie path, the actual container storage location is fixed though]”  The implementation is via an indirection through #_store, which:

1.  calls get_metadata on itself (an uploader), which for a new IO object calls extract_metadata, which is overridden by custom metadata plugins. So metadata is normally assigned at the cache/assignment phase. This is perhaps so the metadata can be used in validation?  Not sure if there’s a way to make metadata be in the background, and/or be as part of the promotion step (when copying cache to store on save) instead. There’s some examples suggesting they are relevant here, but I don’t really understand them.

2. Calls #put on itself, the uploader. put by default does nothing but call #copy on the uploader, which actually calls #upload on the actual storage object itself (say a Shrine::Storage::FileSystem), to send the file to that storage adapter — in this case for the configured cache storage, since we started from cache on the attacher. (Some plugins may override put to do more than just call copy). 

3. Converts into a shrine UploadedFile object representing the persisted file, and returns that.

So at this point, after calling attacher.cache!, your file has been persisted to the temporary “cache” storage. attacher.cache! purely deals with the stateless uploader and persisting the file; next is making sure that is recorded in your model _data attribute.

[attacher].assign then does ‘[attacher.]set(uploaded_file)’, where uploaded_file is what was returned from the previous cache! call. set first stores the existing value (which could be nil or an an UploadedFile) in the attacher instance variable @old, (in part so it can be deleted from storage on model persistence, since it’s been replaced).  And then calls _set to convert the UploadedFile to a hash, and write it to the _data model attribute — so it’s there ready for persistence if/when the model is saved.

So after assignment (model.standard = File.open("whatever")), the file is persisted in your “cache” storage. The in-memory model has asset_data that points to it. But nothing about that is persisted to your model’s persistence/ORM.  If the model previously had a different file attached, it’s still there in the store storage.

Let’s see how persistence of the new file happens, by tracing the ActiveRecord ORM plugin specifically, when you call model.save.  First note the active_record plugin makes sure shrine’s validations get used by the model, so if they fail, ActiveRecord’s save is normally going to get a validation failure, and not go further. If we made it past there:

In an active_record before_save, it calls attacher.save if and only if the attacher is changed?, meaning has set the @old ivar of previous value (could be nil previous value, but the ivar is set). However, the default/core implementation of save doesn’t actually do anything — this seems mainly here as a place for shrine plugins to hook into actually “before_save”, in an ORM-agnostic way.  (Might have been less confusing to call it before_save, I dunno).  The file is not moved to the permanent storage (and the old file deleted from permanet storage) until after the model has been succesfully persisted.

Then ActiveRecord’s own save has happened — the file data representing the new file persisted in temporary cache has now been persisted to the database.

Then in an active_record after_commit, finalize is called on the attacher. finalize is only called if  @old  is set — so only if the attached file was changed, basically.

The [attacher.]finalize method itself immediately returns if there is no “@old” instance variable set. (So the check with changed? in the hook is actually redundant, even if you call finalize every time, it’ll just return. Unless plugins change this).

Then finalize calls [attacher.]replace. Which — if the @old instance variable is not nil (in which it’s an UploadedFile object), and the object was in the cache storage (it must be in store storage; checked simply by checking the storage_key in the data hash) deletes the old value. “replace” in this case actually means “delete old value” — it doesn’t do anything with the new value, whether the new value is in cache or store. (not to be confused with a different #replace method on UploadedFile, which actually only deals with uploading a new file. These are actually each two halves of what I’d think of as “replacement”, and perhaps would have best had entirely different names — especially cause they both sound similar to the different “swap” method). 

The finalize method removes the @old ivar from the attacher, so the attacher no longer thinks it has an un-persisted change. (would this maybe be safer AFTER the next step?)

finalize calls `_promote(action: :store) if cached?` — that is, if the current UploadedFile exists, and is associated with the cache store.   [attacher.]#_promote just immediately calls promote —  both of these methods can take an uploaded_file argument, but here they are not given one, and default to the current UploadedFile in this attacher, via get

[attacher.]promote does a `stored_file = store!(uploaded_file, **options)`.  Remember the `cache!` method above? `store!` is just the same, but on the uploader configured as `store` storage instead of `cache` storage — except this time we’re passing in an UploadedFile instead of some not-yet-imported io object. Metadata extraction isn’t performed a second time, because, get_metadata has special behavior for UploadedFile input, to just copy existing metadata instead of re-extracting it.

At this point, the file has been copied/moved to the ‘store’ storage — but another copy of the file may still exist in cache​ storage (in some cases where the cache and store storages are compatible, the file really was moved rather than copied though), and no state changes have been made at all to the model, either in-memory or persisted, to point to this new file in permanent storage.

So to deal with both those things, [attacher].promote calls [attacher.]swap, which is commented as “Calls #update, overriden in ORM plugins, and returns true if the attachment was successfully updated.” In fact, the over-ridden attacher.update in the activerecord plugin just calls super, and then saves the AR model with validate:false. (I am not a fan of the thing going around my validations, wonder what that’s about).

Default update(uploaded_file) just calls _set(uploaded_file).

_set pretty much just converts the UploadedFile to it’s serializable json, and then calls write.

write just sets the model attribute to the serializable data (it’s still not persisted, until it gets to the ORM-specific update, where as a last line the model with new data is persisted).

so I think attacher.swap actually just takes the UploadedFile, serializes it to the _data column in the model, and saves/persists the model. Not sure why this is called swap. I think it might be more clear as “update” — oops, but we already have an update, which is by default all that swap calls. I’m not sure the different intent between swap and update, when you should use one vs the other.  (This is maybe one place to intervene to try to use some kind of optimistic or pessimistic locking in some cases)

If swap returns a falsey value (meaning it failed), then promote will go and delete the file persisted to the store storage, to try and keep it from hanging around if  it wasn’t persisted to model.  I don’t totally understand in what cases swap will return a falsey value though. I guess the backgrounding plugin will make it return nil if it thinks the persisted data has changed in db (or the model has been deleted), so a promotion can’t be done.

overview cheatsheet

pseudo-code-ish chart of call stack of interesting methods, not real code

model.avatar=(io)   =>  avatar_attacher.assign(io)

↳ uploaded_file = avatar_attacher.cache!(io)

↳  avatar_attacher.cache.upload(io) => processes including extracting metadata and persists to storage, by calling avatar_attacher.cache.store(io)

↳ io = uploader.processed(io)

↳ io = uploader.store(io) => via uploader._store(io)

↳ get_metadata

↳ uploader.put(io) => actually file persists to storage

returns an UploadedFile

↳ avatar_attacher.set(uploaded_file)

↳ stores previous value in attacher ivar “@old”, puts serialized UploadedFile in-memory avatar_data attribute


an activerecord before_save triggers avatar_attacher.save iff attacher.changed? (has an @old ivar). Core attacher.save doesn’t do anything, but some plugins hook in.

activerecord does the save, and commit.

an active_record after_commit triggers avatar_attacher.finalize iff attacher.changed?

↳ attacher._promote/promote iff  attacher.changed?

↳ stored_file = avatar_attacher.store!( UploadedFile in-memory )

↳ see above at cache! — extra metadata, does other processing/transformation, persists file to store storage, updates in-memory UploadedFile and serialization.

 ↳ attacher.swap(newly persisted UploadedFile)

↳ attacher.update(newly persisted UploadedFile) => just calls _set(uploaded_file), which properly serializes it to in-memory data, and then in an activerecord plugin override, persists to db with activerecord.

Some notes

On method names/semantics

“Naming” things is often called (half-jokingly half-serious) one of the hardest problems in computer science, and that is truer the more abstract you get. Sometimes there just aren’t enough English words to go around, or words that correctly convey the meaning. In this architecture, I think both the replace methods probably should have been named something else to avoid confusion, as neither one does what I’d think of as a “replace” operation.

In general, if one needs to interact with some of these methods directly (rather than just through the existing plugins), either to develop a new plugin or to call some behavior directly without a plugin being involved — it’s not always clear to me which method to use. When I should use swap vs update , which in the base implementation kind of do the same thing, but which different plugins may change in different ways? I don’t understand the intended semantics, and the names aren’t helping me. (promote is similar, but with an UploadedFile which hasn’t yet been processed/persisted? Swap/update takes an UploadedFile which has already been persisted, for updating in model).

It is worth noting that all of these will both change the referenced attached file on a model and persist the whole model to the db. If you just want to set a new attached file in the in-memory model without persisting, you’d use “attacher.set(uploaded_file)” — which requires an UploadedFile object, not just an IO. Also if you call set multiple times without saving, only the penultimate one is in the @old variable — I’m not sure if that can lead to some persisted files not being properly deleted and being orphaned?

Shrine plugins do their thing by overriding methods in the core shrine — often the methods outlined above. Some particularly central/complicated plugins to look at are backgrounding and versions (although we’re hoping to change/replace “versions”) — they are very few lines of code, but so abstract I found it hard to wrap my head around.  I found that the understanding of what unadorned base shrine does above was necessary  to truly understand what these plugins were doing.

Are there ways to orphan attached files in shrine?  That is, a file still stored in a storage somewhere, but no longer referenced in a model?  For starters the “cache” storage is kind of designed to have orphaned files, and needs to have old files cleaned out periodically, like a “tmp” directory. While there is a plugin designed to try to clean up some files in “cache”, they can’t possibly catch everything — like a file in “cache” that was associated with a model that was never saved at all (perhaps cause of validation error) — so I personally wouldn’t bother with it, just assume you need to sweep cache, like the docs suggest you do.

Are there other ways for files to end up orphaned in shrine, including in the “store” storage? If an exception is raised at just the wrong time?  I’m not sure, but I’d like to investigate more. An orphaned file is gonna be really hard to discover and ever delete, I think.


Proposed Rails-based digital collections developer’s toolkit

In my last post, I explained my read of the lay of the samvera land, and why I’m interested in pursuing an alternate approach. We haven’t committed to the path I will outline, but are considering it.

First we should say that I am coming from the assumption of an institution that does want to do local development, to get full control over the application and be able to custom-fit it to their needs. Having a local development team requires resources and perhaps some organizational cultural changes to properly involve technical decision-makers in technical decisions. But it’s my own opinion that developing with hydra/samvera (or anything based on Rails) has always required a local development team; if you don’t want one, you would be better off looking at a more off-the-shelf, perhaps proprietary product.

So, reacting to experiences with what have seemed to me to be overly-complex architectures, one question to start from is — can a development team “just build a Rails app”?  Maybe, but I think some common patterns and needs in digital collections/repositories can be identified, to be turned into shared dependencies, to better organize the initial app and hopefully provide efficiencies for more apps over time.

If the design goal of an architecture of re-usable abstractions is to optimize between maximized code re-use (avoid re-inventing the wheel), and minimized architectural complexity (number of abstractions, complexity of abstractions, general number of novel mental-models) — I believe we can make a go of it by sticking as closely to Rails as possible, trying to add things carefully that have clear benefit for our domain.

One possibly counter-intuitive proposition here is that, rather than trying to share as much code as possible and minimize code in local apps, we can make end-to-end development more efficient and less costly by putting less code into shared dependencies, so we can spend more time making those shared modular components as well-designed for re-use and composability as possible.  Fighting with integration into a complex architecture (that includes many features your app doesn’t need) can easily erase any hypothetical gains from shared code.

So the proposal here is neither to provide a complete working app or “solution bundle” (which might be hyrax approach), nor to write a completely custom app with no domain-specific parts to be shared with the community (which might be what Steven Anderson was at one point suggesting for BPL, or what the “bespoke” Valkyrie-based apps lean towards at present), but instead to write a local app in tandem with some relatively small and tight shareable, re-usable, composable, modular components.  (This is similar to the original “hydra head” approach;  we think we can do this more succesfully as a result of what we’ve learned, and technical choices to keep things much more simpler, recognizing that a local development team will need to do development).

The hypothesis or proposition is that this can get us to an efficient cost-effective architecture for building a digital collections site ultimately quicker than trying to get there from iterating on existing code. Because we can try to be ruthless about minimizing abstractions and code complexity from the start, with the better knowledge of our domain we now have — avoiding legacy code that arguably already has some unneccessary complexity baked in at a fundamental level.

The trade-off, is that by starting from a “clean slate”, we have a long period of developing architecture with no immediate pay-off in adding features to our production app.  There is a hump of lowered productivity before reaching — if we are successful — the goal of overall increased productivity.  This is a real cost, and a real risk that our approach may not be as beneficial as we think — you have to take some risks to achieve great benefits.

This post will lay out a proposed plan in three parts:

  1. Some principles/goals, illustrating further what we mean by a Rails-based developer’s toolkit for our domain
  2. A high-level outline/blueprint of the architectural components I currently anticipate.
  3. Some analysis of the benefits and especially risks and potential reasons for failure of this plan.

If you are interested in this plan in any way, please do get in touch. 

Principles and Goals: What do we intend by “A Rails-based developer’s toolkit”?

The target audience is developers. Developers will use this tool to put together a digital collections/repository app.   It will not produce “shrinkwrap” software for non developers, this is tools for developers.

⇒ Realistically probably minimally requiring a team of 2-4 technical staff (likely including a devops/sysadminy role). I suspect this is what is generally required for a successful hyrax or other samvera project already/too.  By acknowledging there will be need for local development team, we can avoid the complexity of trying to make “declarative-config-only” layers.

⇒ “developers” doesn’t necessarily mean expert ruby/rails developers. We can try to target relatively beginner Rails devs, including those for whom this is their first Rails or ruby project. As with any endeavor, the more experience and skill you have, the more efficient and high-quality work you can do, but I am optimistic we can make the bar fairly low for getting started.

We try to add tools to make “just building a Rails app” easier for digital collections/repository domain, supplying components based on common needs in this domain.  But where possible, we do things “The Rails Way”, what is most natural or typical or common for Rails apps, providing enhancements to it rather than replacements.

⇒ Things people often do on top of Rails, such as “form objects” should be options that are ideally no harder to choose to use than with any other Rails app, but not required or assumed.

⇒ There’s already a lot to learn in just learning how to effectively build a performant, secure, and reliable app on Rails (or a web app in general). We will try to minimize extra abstractions and architectures.

⇒ Ideally making it feasible to add developers (or hire project-based consultancies) experienced in Rails but not our industry/domain to the project, and have them get up to speed relatively quickly (joining a legacy app is rarely easy even for experienced devs). And the path for beginner library developers learning to use the stack will, as much as possible, be learning Rails.

⇒ It’s not that we’re suggesting everything in Rails is designed perfectly or is always easy to work with or learn.  Rather we’re suggesting these kind of flexible APIs can be a hard design problem, and Rails is stable, mature, polished, well-understood, performance-tuned, with an ecosystem of tutorials and third-party code. Trying to reinvent “better” alternatives has significant development costs and will take time and iterations to reach Rails’ maturity, that are best avoided where possible.

This will be a library or toolkit of components, not a fully integrated “solution bundle” or framework. 

⇒ “You call a library, but a framework calls you“.  Rather than providing API to hook into and customize small parts of a large integrated stack, the library/component approach is to try and provide re-usable modular blocks (legos?) that you can put together to build what you (the developer) want.

⇒ In many cases, you’ll customize from the top down instead of from the bottom up. If you want to customize a view, you might override the entire view  and then re-build it, possibly using the same components the original was built with but composed and configured in different ways.  Instead of trying to figure out a way to ‘override’ a part inside without touching anything else.

⇒ Rather than be a forwards-compatibility nightmare, we think targetting this mode of work can actually reduce your local maintenance burden and upgrade paths, by having simpler more stable code in the shared dependencies. In our community experience, the alternate approach has not in practice seemed to result in better forwards-compatibility.

⇒ Out of the box without writing code, we do aim to give you something that runs, this isn’t just components dropped on the floor — but it’ll probably be a “proof of concept”, maybe not even an “MVP” you could deploy in production at all. Think Rails scaffolding, but specifically for our domain.  Many parts may need to be customized or added (and some parts will certainly need to be) to arrive at the app well-suited for your business and user needs, but scaffolding can be easier to get started with than a blank slate.

=> Again, we are assuming an institution interested in a local development team that can customize to your particular business and user needs. If you don’t want to do this, you might be better served by a more “shrinkwrap” ready to go (but less flexible/customizable) solution — that, IMO, is likely not historical or future samvera software either, or any Rails-based approach. It is likely a paid-development-and-support product, perhaps proprietary and/or hosted.

(Incidentally, this is in some ways a reversal of some things I myself thought some years ago when I participated in developing early versions of Blacklight. Then I thought any  generated code was a bad thing, because it could go out of date. And any copy-and-paste of code was a bad thing, for similar — the goal was to have the local app have as little code in it as possible.  I have come to realize while that goal seems appealing, achieving it succesfully is very hard, and a less successful attempt at it can actually be more costly, and it may counter-intuitively make sense to try to minimize the code in the shared dependency instead.

We prioritize solid modular building blocks over features.  There are very diverse needs in our domain. Rather than try to understand let alone build-in all features needed accross our community, we try to give development teams the tools to lower costs when developing what they have discovered about their local requirements and priorities. We will prioritize only the most common needs, and be hesitant of added complexity for less common needs — or needs we haven’t yet figured out how to meet with re-usable modular components.

⇒ This doesn’t get us out of understanding domain needs,  we still need to figure out the common needs in our domain that can be addressed by our tools, and if we’re really wrong we won’t successfully produce a toolkit that lowers development cost.

⇒ But by trying to focus on the most common needs where we can provide modular, composable tools — we have less code, and thus more resources per abstraction to try to design them successfully to be re-composable to your needs, flexible beyond what could have been planned in explicitly. Which takes skill and time, and is feasibly only by ruthlessly focusing on simplicity.  We will do our best to design our tools so it’s clear how you can use them to build additional things you need (a workflow engine?) on top. 

⇒ The number of “built-in” features and components and higher-level abstractions will, if the project is successful, probably grow over time. But you can’t build a good skyscraper without a good foundation, the base-level components need to be solidly designed for re-use, in order to build higher-level on top.

We are targeting a digital collections use case much like our organization’s. Staff-only editing, not self-deposit. Relatively small number of different pages.  There should be no problem scaling number of users or number of documents right from the start, but we are not scaling complexity of requirements, or user interfaces. Our needs are outwardly pretty simple (although the devil is in the details) — customize our metadata schemas and derivatives, let staff ingest and enter metadata, let users find and view and download items, in an app that works well with search engines and on the web in general, with modern best-practice performance characteristics. We are targeting other organizations with similar needs. 

⇒ While we aspire to provide a solid enough foundation you could build more complex requirements on top of this, the toolkit would be of course be a proportionally smaller aid in that situation. While we think the toolkit can grow to handle more complex use cases eventually, and we’re keeping that horizon in view, but it’s not the priority.

⇒ We think this simpler digital collections approach — in addition to conveniently matching our local needs — is often close to a strict subset of more complex needs. If you have complicated workflow needs (whether involving staff or patron work), you probably need most of our simple use case target plus more. Starting here can get us something that works for we think a lot of the community and can be initial steps for more.

⇒ The toolkit is primarily focused on providing tools to aid in the ingesting and management of digital assets and metadata.  I think this is what people have come to samvera for, and is in some sense the “hard part”. While end-user discovery is very important — and the toolkit will provide basic discovery — I think it’s in some ways an easier problem, with more well-understood practices and technique for development — especially if we can provide for it to be done in terms of ActiveRecord models.

We will think of developer and operational use cases from the start. We don’t want to have to be the experts in every end-user use case in the domain, but we want to try to be the experts in  developer and operational use cases in building apps in this domain.  If a design choice makes deployment or operations harder, it may not be a good choice. The users for this toolkit are developers, who use it to meet end-user needs.

⇒ We will plan for performance from the start (avoiding n+1 queries, providing clear paths to use standard Rails caching techniques, etc.).

⇒ We will plan for a clear “story” about deployment architecture from the start —  we will plan for cloud deployment from the start (such things as not requiring any persistent file systems–putting everything on eg S3–and not making assumptions about how many machines different services are divided upon) — but not multi-tenancy (too complicated). We will consider from the start ease of use and efficiency throughout your product life cycle.

⇒ With less code to document and maintain  in the toolkit, we hope it becomes more feasible to maintain good non-stale documentation, and components with very good long-term stability and compatibility — reducing total cost over time of both developing and maintaining not only the toolkit itself, but counter-intuitively an app based on the toolkit as well. In practice, less shared code can be more efficient use of both community and local developer resources than more.   

Outline of Architectural Components

Modelling/Persistence: json_attr-based

Use attr_json (which I wrote) to store object metadata in a schemaless fashion as serialized json. Supported nested/compound objects, with rails-style form support, dirty-tracking, and other features meant to let you use as you would an ordinary ActiveRecord model.  attr_json is at the heart of this plan.

Modelling will be based on PCDM (Collections; Objects; Files), but we may not feel the need to stick strictly to PCDM. For instance, we may make work->child relationship 1-to-many instead of n-to-m, if it significantly eases a robust implementation. Or allow an internal object representing a ‘file’ to be an element in the ‘member’ relation.

We may likely put all three core model types in one table using Rails Single Table Inheritance — that significantly eases ActiveRecord association modelling, if you want a Work/Object’s children to be both ordered, and polymorphically include other Work/Object’s or Files (again not strictly PCDM). attr_json addresses some of the ordinary disadvantages of STI, since varying data can be in the shared json column. (I believe valkyrie activerecord takes similar one-table approach, although it does not use AR associations — I think AR associations are a powerful tool I’m loathe to give up).

We will plan from the start to prioritize efficient rdbms querying over ontological purity. Dealing with avoiding n+1 (or worse) when doing expected things like displaying a list of objects/works with thumbnails, or displaying all a work’s members with thumbnails. In some cases we may resort to recursive CTEs; some solution will be built into the tool, this isn’t something to make developer-users figure out for themselves each implementation.

While attr_json doesn’t technically require postgres, we will likely require postgres for use of toolkit, so when providing out of the box solutions to common needs (including efficiency/performance), we can use features with database-specific elements like jsonb, recursive CTEs, and postgres full text search.

We may provide a six-alphanumeric-primary-key approach similar to sufia/hyrax/samvera, possibly implemented with postgres functions. One way or another, we need to migrate our existing data and keep URLs and possibly internal IDs consistent.


Basic CRUD functionality, and possibly not a whole lot more, will be provided by controllers that look an awful lot like the out-of-the-box Rails scaffolding controllers.

Additional hooks will likely be included for customizing certain things (authorization, strong params) — but the goal is to keep the controller simple (and unchanging-over-time) enough that a developer-user wanting to customize will be safe to simply copy-paste the entire controller and modify it appropriately. Any complex or domain-specific functionality in the toolkit will be in service/helper objects that can be re-used in a local controller too, rather than the controller itself.

Config: chamber

Where our toolkit code needs to allow deployment/institution-specific configuration, I’m leaning toward using chamber. The built-in Rails stuff is kind of weird, and has been changing a lot with every Rails version (because it’s kind of weird and they keep trying to make it less so).  But chamber is suitably flexible that it should be able to meet almost any consuming institution’s needs. We may provide some extensions to chamber.

Asset management: shrine

In addition to metadata modelling and persistence, handling bytestreams/digital assets is the other fundamental core part of our domain. Both originals (often preservation copies) and derivatives.

Shrine is a “file attachment toolkit” for ruby, with Rails integration included. Shrine was motivated by lack of flexibility in some additional file attachment solutions. The result is something that is accurately called a toolkit — while it can be a bit harder to get started with in a fresh Rails app, it provides composable components that can be rearranged to meet domain needs, which makes it very well-suited for our toolkit, where flexible asset-handling is important to reducing total cost of development. While Rails has introduced ActiveStorage as a built-in solution, we don’t think it’s flexible enough for our asset-centered domain.

Shrine already handles storing files in a back-end agnostic way (we will focus on S3 for production and optionally local file system for dev, but it should be feasible for you choose other shrine adapters with minimized changes to other code needed). Shrine already handles streaming, full-file, and URL access APIs regardless of back-end. Shrine already has some architecture built out for ultra-flexible derivatives, storing checksums, etc. (Reflection on what derivatives are already there should make it easier to build out, say, a UI for menu of downloads based on what derivatives you have created and tagged as a download).

The current Shrine derivatives architecture (which it calls “versions”) doesn’t quite match my idea of requirements: I want to be able to optionally store derivatives in  a different backend than originals (eg different S3 bucket), or even different buckets for different derivatives. The shrine author has given me some advice on how to achieve that, and it also aligns in some ways with existing discussion on desired rewrite of the shrine “versions” plugin. I will likely spend significant time on developing (and documenting) a new shrine derivatives/versions plugin meeting our needs, and hopefully to be contributed back to shrine possibly as new standard plugin.

Additionally, while shrine itself prioritizes solid fairly low-level API (to achieve it’s composability/flexibility goals), I think our toolkit needs some slightly higher-level API, probably also developed as a shrine plugin, that lets you define derivatives in a more declarative way, making it easy and more DRY to do things like create/re-create/delete named derivatives for already existing objects, or declaratively specify foreground/background/on-the-fly processing.

The toolkit will likely provide it’s own recommended combination(s) of shrine config, as ‘scaffolding’, either as toolkit-specific shrine plugins, generated code, or both. It will be easy to change much of this config locally.

A metadata properties configuration system

In some current samvera platforms, when you add a new metadata property, there can be a dozen+ different files — in some cases depending on local architecture decisions — you need to edit to mention the new property. This is not only annoying, but error-prone to try to keep changes consistent across all those files.

I think this is a big enough barrier to developer ease of use, common enough to almost all uses, that it deserves an architectural solution.

We will provide an extensible properties configuration system that lets you configure multiple things about the property in the original model. It might look something like this:

class Article < Work
  our_tookit_property "title" do
     attr_json :string, array: true
     rdf_predicate "http://purl.org/dc/terms/"

     indexing do
       to_field "title_text", first_only

     edit_form do
       position: 0
       simple_form wrapper: :small

     include_in_results_list: true
     include_in_detail_page: true

The biggest potential problem with such a “centralized” property configuration system is — what if you need different configuration in different situations? Say one form field type in one place and a different in another, or even multiple indexers for different contexts.

This system will be from the start designed to support that too — what is in the model could be considered defaults, but all components using these values from properties configuration will take the values in a clearly documented format such that a developer can alternately choose to pass in alternate values than what was in the model property registration.

Edit forms: simple_form

simple_form is a Rails form toolkit. We can use it to make forms ‘just work’ (with field-level error reporting etc), including for some of our custom toolkit features (nested models), in a composable and overrideable way.  We will follow in the footsteps of hyrax and many other software in using it, probably set up for Bootstrap 4.

We will provide a custom form builder that can use (in some cases automatically, from properties definitions, see above) a suite of custom inputs for features built into the toolkit or common to our domain. Including nested models, repeatable elements (perhaps using cocoon, which attr_json is already compatible with), vocabulary auto-complete (probably based on questioning_authority, but built out to save vocabularies and IDs in nested models, instead of just saving values; we may need to send some PRs to qa if it’s APIs aren’t suitable).

We’ll also provide custom simple_form inputs for file upload (supporting direct-to-s3 uploads in a shrine-compatible way; hopefully supporting browse_everything), and permissions editing (see below).

We will probably provide a wrapper form for all the inputs that can use information from the property registrations (see above) to specify order of inputs.

We’ll provide a custom simple_form input for setting relationships that lets you search for core model objects (collections, works, files) with an autocomplete-style UI, and assign them to associations.

If you want to customize this beyond the simple things you can do, you’ll just use your own form view where you compose the simple form inputs yourself. The built-in form can be considered scaffolding to quickly get started, although it may be sufficient for simple use cases.

Staff (back-end) UX

Staff UX needs go beyond just forms — for one thing you need a way to find/get things to click on to get their edit forms. We will provide some limited back-end UX: ability to list/search/sort collections/works/files, and limit to just things you have certain permissions on.

Back-end UX is going to be kept pretty limited. Specific applications can build it out themselves if they need something beyond the basic scaffold.

The toolkit will need UX for adding/removing/re-ordering members of a parent. It will probably also provide a batch ingest/edit function of some kind, because that does seem to be a common need, and is one my institution needs.

We’ll need to do a bit more local staff user/requirements analysis to see what the minimum we need is (info from others with similar use cases welcome too), and decide which additional parts the toolkit should provide or just make clear how you’d provide them yourself. We aspire to provide high-quality staff UX for the simple targetted use cases.


Flexible and sane permissions/authorization system is very challenging, many efforts have run aground on it, and complete analyses of requirements can be very complex.

But we’re going to try to create a system anyway.  It will be based on an ACL model. Each ACL entry will relate an object (collection, work, file; “object”), a user or group (“subject”), and an operation (read, write, etc). They’ll be represented as ordinary normalized schema db objects (three-way join).

In addition to user or group as subject, we’ll have special subjects for “all logged in users”, and “public” (don’t even need to be logged in).  These will probably still be just ACL entry rows.

We’ll have a built-in list of hiearchical operations (hieararchical meaning if you have a higher one, it implies all the lower ones).  Which will likely be, in order from least powerful to most powerful:

  • list (see in results and lists)
  • read (see show page)
  • download (download assets)
  • add_member (eg if Collection is object, gives you ability to add things to it)
  • edit (edit metadata but _not_ permissions)
  • own (edit permissions)

The built-in controllers (and end-user-facing discovery) will out of the box do the right things with these permissions, but these will also be developer-editable, you will be able to add additional operations to the hieararchical list, as well as use additional operations that are not in a hieararchical list but are just stand-alone.  You’d have to edit (ie, probably create your own) controllers/views to make use of your new permissions — again we plan all of this as scaffolding which can be modified by writing code using the tools we give you.

While there will be an out of the box input UI element allowing you to edit these permissions directly, it’s expected that many apps will instead set them as part of workflows or other events, within UI  constrained to less than the full flexibility of the ACL system. A developer would do this just by writing code to set things ‘automatically’ in controllers and/or writing a more limited UI component. This system is a low-level acl architecture, it should support higher-level application-specific architectures on top.

APIs need to support fetching from the db with arbitrary permission limits (custom AR scopes); fetching from Solr with arbitrary permission limits (not neccesarily using blacklight_access_controls, we may likely create a new thing that works more the way we’d like, perhaps using solr joins, perhaps not); as well as checking can? on in-memory objects (possibly built on access_granted, which seems simpler and more performant than cancancan while still achieving what we need)

There will also be a superuser or admin class of user that can do everything.

The trickiest thing we do want to include as a requirement is a way for objects to inherit permissions from another object. This needs to something set on the inheriting object, and generally be live,  updating automaticaly if the inherited-from object’s permissions change. This is pretty tricky to get right, with decent performance on both writes and reads. Not sure if we’ll do it in a way that inherited-from permissions are “cached” on inheriting object (and need to be updated on writes), or instead just a persisted “pointer” to the inherited-from object (which needs to be followed on access-checking reads).  This is really hard to figure out how to do simply and performant, but also, we think, a key domain requirement, with previous workarounds that have led to lots of code complexity, so it’s important to figure it out from the start and not try to shoehorn it in later.

(Inherited permissions are even harder taking account of the fact that an object may want to inherit from another object, where that other object itself inherits. (file inherits permissions from work inherits permissions from collection?))

End-user (front-end) UI: Blacklight

We will use Blacklight (and Solr) for end-user discovery interface. But we will try to keep it as loosely coupled as possible, in case individual implementations using the toolkit would rather use something else (or nothing at all), or in the case the toolkit later chooses to support something else.

We’re going to try to make the staff/back-end UX actually not use Solr/Blacklight at all. Instead searching can be supported by postgres full-text search. This means you won’t get facets in the out of the box back-end UX (but can have limits in the UI). It also means if you have a very different front-end desired, you won’t need to run Solr at all for back-end functionality.

We won’t generally be automatically customizing the Blacklight UI, implementers will do that themselves using ordinary Blacklight techniques without mediation of toolkit-specific abstractions.

But there’s one big exception.  While the results list is of course based on solr results, data shown in views (both results and what you get when you click on a result) will not be based on solr — the app will take the IDs returned in the Solr response, and re-fetch them from the rdbms, even for results lists.

This may sound odd, but is actually how the popular generic Rails Solr support gem sunspot works, so there’s some precedent. I think it will allow the software architecture and developer’s mental model to be much simpler, with less duplication or parallel solr vs rdbms implementation — and the performance hit of the extra db query are minimal, especially in the context of legacy samvera performance. This approach lets you deal with n+1 issues purely in terms of rdbms and not duplicatedly on the solr side — with the same techniques whether you are on a results list page or an individual show page. It also lets you index into solr only what you need for query results, and not try to also put enough in solr for efficient display — using solr for what it’s best at, and simplifying and focusing your solr indexing decisions.

BL will be customized/overridden just enough to have the controller do this extra fetch, and use views based on our AR models, while providing customization hooks for local apps to customize views on per-object basis.

Indexing to Solr

Indexing will use hooks into ActiveRecord model life-cycles adapted from sunspot. (Sunspot itself is way over-featured/heavyweight compared to what we need, and is looking for new maintainers, so we won’t be using it directly. But it’s a mature Rails/Solr integration solution that has had a lot of hours put into it, so we will be looking to it for ideas and in some cases code to copy).

Indexing will be based on traject. Some additional architecture in traject 3.0 (I am the principal traject developer)  will make it easier to integrate here, but may still need a few pieces of architecture (like a “reader” based on ActiveRecord objects, and some transformation tools based on ruby objects as source records).  Basing on traject should make it straightforward to have really performant bulk (re-)indexing routines, as well as the ordinary model-lifecycle-event indexing triggers.  You’ll be able to do simple indexing configuration in the model “properties registration”, or more complex stuff in standalone indexer objects.

You will of course easily be able to turn off indexing entirely if you aren’t using blacklight/solr, or, still using our AR-lifecycle-hooks, replace the indexer code with something entirely custom, say, for a non-solr indexing back-end.  You’ll probably turn it off simply by setting the indexer class to nil.

Indexing should be configurable to happen asynchronously or synchronously. (Synchronous is required you need it to be immediately reflected on the editor’s next page view;  which is one reason we’re trying to keep solr out of our back-end staff interfaces, because async makes things so much easier to do performantly). Ideally it should also be set up in a ‘batched’ way so multiple solr doc changes that happen on one save can be sent to solr in one request, but we may not achieve that in initial release, although we’ll keep in mind ways we might use traject APIs to achieve it.

To the extent we use Solr dynamic fields, we’ll try to use the ones already defined in default Solr schema. It will also be trivial to simply specify your custom solr field names. As much as possible, we’ll avoid need for any custom solr schemas.

Preservation-related features

Our current sufia-based app has only basic what we could consider preservation-related features, so the baseline minimum for an MVP 1.0 of the toolkit is likewise basic.

Mainly “fixity audit“. It is easy and documented to have shrine store checksum signatures (or with S3 direct upload and/or client-side calculation of checksums!). For checksums that can be calculated in a streaming fashion, we can even use shrine’s streaming API to make validating signatures much more efficient. We will support storing and checking a handful of checksums. The toolkit will support logging these checks and some UI visualization of them, based on the work I did to fix the feature i activefedora, and some of our local features. And tasks for bulk fixity checking, which can possibly be done an order of magnitude faster in hyrax. Where it makes sense, some work can be off-loaded to S3.

As far as backup copiesour own current (Sufia/fedora 4-in-postgres) system is pretty much limited to bog-standard postgresql backups, and ordinary file system/S3 backups of digital bytestream assets. The initial toolkit release will not support anything more here.

Down the road

There are some features that will not be included in the initial MVP toolkit release (or our initial local app based on it), but for which we made architectural choices to try to facilitate down the line.

While fedora has some support for versioning, with some UI for it in hyrax, I think it’s got some oddities, and is relatively little used in the samvera community (We don’t really use it at all). So it’s not targetted as an initial requirement. At a later point, we could use the standard Rails paper_trail gem for actual metadata versioning (not sure if fedora-sufia/hyrax supported that or not).  By using standard ActiveRecord, we can use paper_trail, which has many many developer-hours put into it. It’s still not without some rough edges, which shows how challenging this feature is. Particularly around associations (which aren’t always handled great in fedora/hyrax versioning either, but the way the data is modelled in fedora/af/hyrax makes some things easier here, while other things harder).  One reason we chose schemaless attr_json-based modelling is to limit the number of associations/relationships involved. For actual bytestream/asset versioning, it seems likely what is built into S3 is sufficient.

It would be great to have import/export based on (at least theoretically) implementation-independent formats, which can be used as a preservation copy. Eventually I would like to see round-trippable (export, and then import in our same software to restore) BagIt serialization.  Which could be used for preservation backup. This is pretty straightforward using a ‘just write ruby’ approach, but has some challenging parts around associations among other things. (Note theoretically implementation-independent doesn’t mean any existing alternate implementation necessarily exists that can import them, which is practically true for most of our community’s current attempts here; but it still has preservation value to try to make it easier to write such an alternate implementation). 

Other things not included (at least initially)

Let’s emphasize again, there will be no built-in workflow support. The goal is instead to provide an architecture that doesn’t get in your way when you want to write your own workflow implementation.

There will also be no “notification”/”activity feed” support in initial toolkit.

OAI-PMH is something our local app needs. It may or may not be included in the initial toolkit though, vs being a purely local implementation. Theoretically blacklight_oai_provider could be used (by a local implementation), but it may make some assumptions about the way you use your BL app that the typical toolkit app is unlikely to meet. The lower-level ruby-oai gem is also a possibility.

Embargoes/leases are probably not in the initial implementation, simply because our own app does not use them. If the toolkit comes to support them, I think they should be based on expressed boundary dates automatically enforced at read-time, rather than requiring a process to check leases/embargoes and set other access control information accordingly.

Our local app uses a custom locally-written JS pan-and-zoom paging viewer, instead of the popular UniversalViewer. The initial toolkit may not have built in support for either, but should have clear “developer stories” about how you’d add them.

At one point I had considered trying to use cells as a basis for the front-end, as I really like it’s architecture and think it would provide for easier customizability of initially shared resources — but I ultimately decided it was too new/unproven/new-to-me, and kind of violated our “when in doubt stick to rails” approach, and would raise the complexity (and difficulty of success) of the toolkit too much.

Similarly, I considered using a fancy modern Javascript view system for at least some parts of the UI, but decided against it on similar grounds. There’s just too much to figure out about best practice patterns in this context, at least from my experience, it would raise the difficulty of success too much.

In general, I don’t have a good handle on how we’re going to use the modern JS we will need in a Rails environment. Not sure if using the new Rails webpacker is the way to go, it might be, but I don’t have a good handle on it. The initial release may have a less than optimal javascript architecture.  

Analysis and Evaluation

If we are right that there’s a value proposition in a smaller, Rails-aligned, shared codebase (so we can make it really solid), and if we successfully figure out the right design of such a developer’s toolkit and pull off it’s implementation… then the proposition is that we’ll have a platform/toolkit for developing digital collections/repository applications very efficiently, throughout the entire application lifetime from your initial product launch, including operational infrastructure, and through continued maintenance and enhancement to meet evolving needs.

And further, while there are other approaches in our community that are in progress trying to reach this same goal, as discussed in the last post,  the proposition is that we can get to a mature, polished, efficient-developer-cost toolkit quicker starting over along these lines, compared to those other approaches.

But that’s a lot of ifs, and this is a potentially expensive project proposal. It’s hard to estimate, and more work needs to be done, but at this point I’d estimate 12-18 months to us having a 1.0 release of toolkit and an in-production app based on it.

How can one evaluate the chances of success? This post and the previous one tries to provide enough information and an argument that it is plausible. But ultimately, there’s no way around having experienced developers/engineers make a judgement call — like with all technical decisions, although this is a particularly weighty one. I could say that I personally have some demonstrated experience making developer tools which have proven to be low TCO over a long time period (bento_search, traject), but this is a much more ambitious plan, and it’s success is not guaranteed.  It’s maybe a bit of a “moon shot”, but when you want to go to the moon, sometimes it’s worth a gamble.

I think there’s no way for you, dear reader, to evaluate whether this is a worthwhile thing to possibly participate in, except the judgement of experienced developers/engineers. If your organization is making choices of technical platforms without basing them on understanding of local business/user needs, followed by significant decision-making weight given to the technical judgements of experienced engineers…. I would say you aren’t maximizing your chances of success in engaging in software development, and (I would argue) you are probably already doing software developoment if you are doing samvera-based apps.

There isn’t much to say about the “upside” other than that — these whole two articles have been an investigation of potential upside — but we can say a lot more about the risks and downsides. They are somewhat different for my own institution, starting with a sufia 7 app, and considering initiating this plan — than they might be for another institution considering coming on later in another context. And there are short-, medium-, and long-term categories of risk. I will try to deliniate some of those risks and costs.

Our Institutional Calculus

We have a sufia 7.4 app (on Rails 5.0, with sufia not supporting more recent Rails), and have to do something.  That effects our cost/benefit/risk calculation.

We could try to upgrade/migrate to Hyrax 2.2.0. This would definitely take less time than writing this new toolkit, but I think could still easily take several months. At the end of it, we’re still on the fedora/hyrax architecture that we’ve found so difficult to work with, so while we have more supported dependencies, we haven’t necessarily reduced our TCO or agility to produce new features. We could be hoping on hyrax eventually being valkyrie — this is more likely if we contribute development effort to get there. How much is hard to predict, as is, in my opinion, how much we’re going to like where we get when we get there.  And architectural work on hyrax to resolve some of the other challenges we’ve had with it goes beyond just getting it on valkyrie.

We could try rewriting our app based on valkyrie, plum->figgy style. This actually isn’t too different than the proposal here, we’re still rewriting an app from scratch. Using valkyrie instead of just active_record (and the attr_json gem which is already fairly polished) — it’s not clear to me this will make the development any easier. On the one hand, possibly we’d be able to share more code than just valkyrie with other valkyrie-using institutions — but it’s not entirely clear at this point how much or how mature/polished we can make it. On the other hand, developing on fairly young valkyrie instead of mature well-understood ActiveRecord will, I think, create additional cost, and, in my judgement, additional ongoing cost making the architecture more expensive to work with. If we’re not actually all that excited about ending up on valkyrie (having no desire to be able to switch to fedora), it’s not clear what the benefit here would be.

One surprising rarely-mentioned possibility: We could put in work to get sufia working on Rails 5.2 or upcoming 6.0, and do an unexpected additional sufia release. It’s not entirely clear how much work this would take, but it may be the cheapest option (still a couple several months?) — at the end, though, while we’ve solved our problem of being on a maintained Rails version, we’re still stuck with relatively unsupported/unmaintained software, using a stack we’re not happy with, with such complexity that it will probably continue to “degrade”.  This is sort of the decision to do the minimal amount of immediate work possible, avoid making a decision, but just push it down the road further.

If this proposed development toolkit plan works, we’ll be in a great spot. But it requires some significant time when we are spending development time on things that do not help our current production app, to get to a point where increased efficiency lets us catch up and ‘pay for’ the time we spent. And it involves some non-trivial risk.

I think for my local institution, if we want to, and believe we have the resources/context to, take the risks to be innovators and “thought leaders” here, it’s worth taking the bet on trying to develop this new architecture. If we discover it’s not working, we try something else — you can only get the greatest benefit by taking some risk. But if we want to play it safe and don’t think we can afford (politically, budgetarily) taking a risk, it may not make sense.

Short-term Risks: Failure to launch

It’s possible we simply could fail to finish this project. It could become apparent at some point in the development process that we will not achieve our goals, or that it’s going to take much longer than we hoped (like foreseeing ending up barely towards it after a year of effort).

This could be exacerbated if institutional needs require us to reduce the amount of time we spend on building out the architecture, or minimize the amount of time we can spend investing in something whose returns may be a year or more away.

It could possibly be addressed by trying to reduce the scope of the re-usable toolkit to just focus on getting our app launched (which one could say is the approach Princeton and PSU are taking), but there’s still some risk that as we try, we find it’s just not going to work.

Medium-Term Risks: Failure to Catch On

It makes sense to try to build a re-usable toolkit, rather than just our app, both as a service to the community, and for enlightened-self-interest reasons, to get (eventually) more developers/institutions working on it, and building a community of mutual-support around it.

We could get to the “finish line” of a deployed app, but we could find that other institutions/developers do not actually find it easy to learn or work with. It may not be as flexible as we aspire to for more generalized use cases, may not be applicable to as many other institutions’ business requirements, limiting the potential community.  As we’re intentionally prioritizing high-quality modular tools over features, our tools need to be successful at lowering cost of building software with them, or the project is not a success.

ActiveRecord may end up not being a good foundation for our needs — in proposing to use Single-Table Inheritance, we’re already using a feature that is sometimes considered a bit off the beaten path of Rails. ActiveRecord could end up getting in our way more than anticipated, and thus increasing development costs.  We think we can make a toolkit which a fairly beginner developer can use, but we may fail and end up with one that is confusing to beginners.

We could succesfully deploy our app based on the toolkit, but if we end up being the only institution using the toolkit, the cost/benefit proposition changes somewhat.

Succession Planning

Related to size of community is succession planning — if all your developers leave, and you need to bring new developers on to an existing project, how hard will that be?

There has been an assumption that by “doing what everyone else is doing”, you have a community of people who could step in. However, I think experience has shown this hope was very significantly inflated.

Taking over a ‘legacy’ project is always tough in software dev, always takes a somewhat experienced developer, and it’s a real issue to pay attention to in any software development effort.

The number of experienced-with-samvera developers who can easily take on a legacy project from another institution is fairly small, and most are centered in a few well-paying institutions. Last time we posted a developer position, we got no applicants with previous samvera experience at all, and almost no applicants with previous ruby or Rails experience.  Having to bring up a new developer with ruby, Rails, and samvera is not a low barrier. Both Esmé Cowles (“there are many more Rails developers than Samvera developers“) and Steven Anderson formerly of BPL (Have you tried showing someone else all that is involved in doing a good Hydra Head? …If either Eben or I left, it would take half a year for someone to become even “competent” in our stack.”spoke to the challenges of getting a developer previously familiar with Rails up to speed on existing samvera.

Regardless of whether a community were to develop around this toolkit, I actually feel pretty confident that succession planning is no worse and probably better for this approach than for all the other approaches being investigated discussed in the last post. This is one risk I think is actually not very high.

But one thing you get with samvera is also a community of people supporting your learning. Can we remain part of the samvera community of mutual-support doing a somewhat different thing? I hope so, and I think we’re all about to find out one way or another, because the community is trying different approaches with or without this one. But it’s not guaranteed.

Long-Term Risks: Failure to Support

Our proposition is that by creating simpler/smaller surface area toolkit, we can design it well enough that it can remain very stable and backwards compatible API. While I have some success there with bento_search and traject, we could fail, in a way we only realize years down the road.

One of the ways we propose to keep our toolkit simple is by relying on existing software. If (eg) shrine were to stop being maintained by it’s existing developer (or have releases with significant backward incompats), we’d be in some trouble, at the best significantly increasing cost of maintenance of the toolkit. Same with some of our other dependencies. While attr_json tries to use public and likely to remain stable Rails api, if Rails changes sufficiently to make it hard to keep attr_json working, that could also significantly increase development effort.  All of these things might only be discovered down the line.

On the other hand, the efficiencies of the toolkit could be enough to allow more of our institutional or community development time to be contributed back to general Rails dependencies, to help them stay maintained. The initial plan for instance involves contributing back a shrine plugin.

These long-term risks are to some extent common to any development project, and other approaches probably have similar levels of this kind of risk.

Snatching success from the jaws of failure?

Even if the project does not reach the level of success desired, in any of the ways outlined above, it might still provide ideas, patterns, approaches, that could influence other approaches in samvera community. In the best case, it could produce re-usable modular components that themselves could even be re-used in other approaches (valkyrie-based and/or hyrax? I’m not certain how plausible that is).

This can apply to “sharing with ourselves” too, if we decide to change approach in the middle or at any point, we may be able to re-use some of what we’ve already done anyway. (I think the shrine-based approach is a particular contender here; shrine itself doesn’t even require or depend on Rails).

Because we aim to produce reusable modular composable components with loose coupling, based on the commonality of Rails, it may increase the likelyhood of some code-sharing. On the other hand, if other approaches aren’t using ActiveRecord, both they and us may find ourselves more coupled to our persistence layer API than we’d like — it can be hard to avoid persistence-layer-approach coupling.

Interested? What’s next?

While the final portion of this post was investigating risks and possible disaster scenarios, I actually do feel positive about this approach. While there is no guarantee of success, I think it has the best chance of getting us to a place of minimized engineering costs (compared to alternatives) within a 1-2 year timeframe.

But we have not yet committed to this approach here at my institution. We will be aiming to decide our directions in the next month or so.

Interest from other institutions could effect our decision-making, and of course collaboration would be welcome if we do embark on this plan and we’re interested in identifying potential collaborators as soon as they reveal themselves.

Initially, collaboration probably woudln’t take the form of actually committing code. As Fred Brooks argues in The Mythical Man-Month adding more developers to a project doesn’t always result in shorter time-lines, and this is more true the more architectural design work is involved. And we’ve got a bunch right now. Initially collaboration would probably take the form of expressing interest, reviewing documentation or progress, maybe code review, and most importantly trying out the code as it is produced. Feedback on if it does look like something that would help you with your use cases, and that you’re still interested in using. But certainly eventually code collaboration opportunities as well, especially filling out certain use cases that you have but we may not.

So let us know if you find this plan exciting?

Additionally, if anyone has any ideas about grant opportunities, I guess it goes without saying that that could be useful.  Theoretically grant funding would be especially useful for relatively high-risk but high-potential-reward projects, they are the ones most likely to not be done without external support. The low-risk stuff, you’re going to do anyway!  I’m not sure granters in our sectors think that way, but be sure to let me know if you know of some who might.

I also really welcome anyone challenging or pushing back on anything in here, please feel free, in comments here, slack, email, whatever. From discussion and debate we spiral to a higher understanding.