What “Just standard Rails” means to the University of Alberta libraries

I recently had a chance to speak with the development team at the University of Alberta about their development of their jupiter digital repository app (live, github).

UAlberta had a sufia 6 app in production that was a pretty stock “institutional repository holding PDFs. Around Fall 2015, they started trying to “catch up” to sufia 7 with PCDM etc. — to get features they believed would make it easier to incorporate more ‘digital collections’ content, and to just avoid stale non-maintained dependencies.

In Summer 2017, after having spent almost two years trying to get on sufia 7, with mounting frustrations and still seeming far from the finish line — and after having hired a few non-library-archives-museum-experienced but experienced Rails developers — the University of Alberta libraries development team decided on a radical new approach. They decided it wasn’t clear what Sufia was giving them to justify the difficulty they were having with it. They started over, trying to keep things as close to “ordinary Rails” as possible.

At that time, Fedora still was an institutional requirement.  So they didn’t toss out all of the samvera stack. They decided that they’d chop off the trunk as close to the bottom as they could while still getting tools for working with fedora, and to them that meant a hydra-works dependency, but few other hyrax dependencies.  They basically started the app over.

Within about 6 months of that effort (Early spring 2018) with approximately two full-time developers, they were live with their app (jupiter repo), and have been happy with it so far. But they also still haven’t gotten to the originally planned content beyond the IR-type PDFs, the scanned monographs, newspapers, etc. And have had some developer turnover. (Hey, they’re currently hiring y’all).

The jupiter app implementation

My understanding of how their app works is based on just an hour conversation, plus a few hours spent looking at their source code and internal docs — I may get some things wrong!

Jupiter seems to be to be a pretty simple app, a fairly basic idea of an “institutional repository”.  Most of the items are single PDFs, without children/members.  The software does support some items being composed of a list of files — but no “child works”.  The metadata seems relatively simple; some repeatable strings, but no nested/compound objects (ie, an attribute whose values are multi-property objects themselves). While there is some self-deposit, there is no complicated workflow, basically just an edit form.

The permissions model is relatively simple. Matt Barnett, a lead developer for much of this process (who was there for our conversation, but left the team soon after that) told me that originally some internal stakeholders had been pushing for a more complex permissions model. But knowing what a tar-pit trying to implement ACLs could be, Matt pushed back, and they ultimately implemented a simple model: There are owners who can edit the things they own, and admins who can edit everything, and that’s about it.  By virtue of their campus SSO system, they got “shared accounts” for free, so people could log into a shared account when they needed to share edit privs.

They had been using hydra-deratives for their simple derivative needs (maybe just a single thumbnail for each PDF?), but when ActiveStorage, part of Rails, was released, they began switching the code to that (may or may not be merged into master/deployed in repo yet as this gets published).

Fedora is still there, modeled with hydra-works.  The indexing to solr is whatever was built into hydra-works. They just wrote their own straightforward forms with simple_form.  They also do a lot of CSV-based ingest, which they just wrote code for, like even sufia users would I think.

They use UUID primary keys.

Their app does index to solr — using the general ActiveFedora indexing methods, I think, solrizer and all.  You can see that their indexer is pretty stock, it mostly just calls “super”.

All of their objects exist as ActiveRecord “draft” objects while they are being edited, through more or less ordinary Rails mechanisms. When they have multi-valued fields, they use postgres json arrays, rather than actual normalized schema (which would suggest a different table). I’m not sure what they need to do to get this to work with forms and controller updates. These active record objects seem to use something custom for collection memberships, rather than actual active record associations. So in these regards it’s not quite a totally ordinary activerecord modelling.

The objects have a life in activerecord, but are mirrored to fedora at certain life cycle points — I believe this is also what gets them into solr (using samvera/active-fedora solr indexing code).  The public-facing front-end is based entirely on data from solr — but not using Blacklight, simply writing Rails code to issue queries and handle responses to Solr (with Rsolr I think).

A brief overview of their architecture, by Matt before he left, focusing especially on the persistence stuff (since that’s the least “rails”-y and most custom stuff), can be found in their public repo, here.   Warning, it is pretty negative about samvera/sufia/active_fedora, gird yourself. You can see there they have done a couple custom local things to make the ActiveFedora objects and classes to some extent mimic ActiveRecord, to make using them in Rails easier, trying to encapsulate the fedora-specific stuff inside API boundaries. While at a high level this is what ActiveFedora’s goal is — their implementation is simpler, smaller, less powerful and custom-fit to what they need. We can just say they’re happier with how their local implementation turned out. They also explicitly wrote it to support potential future elimination of fedora altogether.

Matt said if he had to do it over, he might have pushed harder on stripping fedora out too, and just having everything in postgres. And that is something the team still plans to look at seriously for the future.

So what does “just a rails app” mean?  And how do you deal with increased complexity of your requirements?

The most useful thing for me in the conversation was that Matt pushed back on my outline of a potential plan, saying I was still undertaking too much abstraction.

The U Alberta team suggested that I should keep it even simpler, with less DRY abstraction (and thus less tools that could be shared between institutions/apps), and more just building your app for what you need right now.  You can see some of this push-back, specifically in the context of what U Alberta needs, in another document he wrote before he left Alberta in the jupiter repo, on notes for future expansion. It is really worth reading,  to see an argument from even more extreme simplicity, from a developer experienced with Rails but not “infected” with “how libraries do things”   But beware, it’s not shy about our community shibboleths.

We developers (and we library folks) are really drawn the abstraction, generalization, and shared tools that meet as many needs as possible.  It can sometimes lead us astray. It is actually very common advice in software engineering to stick to what you  actually need today, for your app you are developing (you know your business/user needs and which are the highest priorities to business value, right?).  “Do the simplest thing that could possibly work”, “You aren’t gonna need it.” It keeps us honest.

However, I also think it’s possible to code yourself into a corner this way, where your app was fine for exactly what you needed then, but when you need one more thing… you can find you need to re-write large parts of it to accommodate.  In some ways this is my experience with current samvera stack, early fundamental architectural decisions pen us in when we have new use cases. That kind of problem stays smaller when you avoid  harder-to-change shared code, but I don’t it goes away entirely. Trying to play for the future always entails some “YAGNI” risk, but the more domain knowledge and experience you have… maybe you can do better at seeing where you are going and planning for it?

Just some of the specific component plans Matt was skeptical of…

attr_json vs. Just Plain ActiveRecord schemas

The jupiter app already has an activerecord implementation which isn’t strictly “ordinary” activerecord, in the sense they serialize multi-valued/repeatable fields to json arrays,  rather than needing a separate table for each attribute as an actual normalized schema would require. Plus the logic I don’t entirely understand but think might not be ordinary AR associations around collection and “community” membership.

So this already gets you away from the strict “ordinary Rails” path — I’m not sure how the JSON array fields are handled by form submission, or querying if you wanted to do querying (it’s possible all their querying is solr-based, which is familiar to samvera-land, and also unusual for “ordinary rails”).

At my institution, we already have the need for complex repeatable data–a simple example would be repeatable “inscription” notations, each of which has the text of the inscription and the location in the book.  So not just an array of strings, but perhaps an array of hashes.  Weiwei Shi (Digital Initiatives Applications Librarian) suggested in a follow-up message, “We can use the JSON data type to support a more complex data structure as needed” — that is, if I understand it, they are contemplating actual postgres representation somewhat similar to what I am with attr_json, if they end up needing complex json. Matt’s second document tries to draw a line between how they are doing things in “more-or-less completely standard Rails models” and the way I was proposing to do things — I’m not sure I actually see such a great distinction, the representations in postgres to me seem pretty similar, neither of which is standard Active Record patterns.

They do have each attribute in a separate column, whereas I was contemplating putting them all in a single json column. Their approach does have advantages for avoiding update race conditions (or needing optimistic locking to avoid them).  I perhaps should consider that, as an extra feature to attr_json. Although either way you get columns you can’t necessarily use ordinary ActiveRecord querying or form-based update with.

Everyone seems to doubt that attr_json is really going to work, ha. The skepticism towards newly invented non-trivial dependencies is justified, but I can point out attr_json is a lot fewer lines of code than ActiveFedora, or even Valkyrie —  I think it is a risk, but it’s scoped carefully and implemented robustly, and I can’t figure out any other way I’m confident would be simpler to actually meet our modeling needs — once you start doing this complex json stuff, I think you’ll find that it doesn’t behave like “ordinary rails” — for forms/updates, validations, etc. — and rather than hack it out on a case by case basis, it makes a lot of sense to me to solve the problem with something like attr_json, encapsulating the “not really like ordinary ActiveRecord” stuff as much as possible.

The other option of course would be an actual normalized schema, with one table per attribute. For our “inscriptions” that table might have two columns (text and location), for a simple repeatable alternate title it might only have one. It’s going to be a mess to try to prevent n+1 queries and keep db access performant.  I am encouraged I’m not on an insane track by the fact that even the U Alberta is using JSON serializations in postgres, not actually ordinary normalized data — I think as your data gets more complex (not just array of primitives, but need serialization as arrays of hashes), you’re really going to want something like attr_json.  But maybe I’m wrong.

And for better or worse, I have trouble giving up entirely on the idea of some shared tools to make things easier for others in the community too — because it’s fun and rewarding, and why should we all constantly re-invent the wheel? But it’s good to be reminded of the dangers that lie in that direction.


I’m not sure if Matt mentioned this specifically, but I realize I have added a lot of non “basic ActiveRecord” complexity to the data modelling my plan in order to support the PCDM-ish association modeling, where a work has “members” and the members can be either works of themselves (which can have multiple members) or single file objects, and they all need to be kept in order.

U Alberta’s app doesn’t have that. A work can have a list of files, the end.

At my institution I actually spent a while trying to convince stakeholders that we didn’t need that either, but it was the one thing I could make no headway on — eventually they convinced me we did, to accomplish our actual business goals.

If you need this, I can’t figure out any way to get there in an “ActiveRecord-first”-ish direction, except either single-table-inheritance or polymorphic associations.  Both of which are some of the oddest and least reliable corners of ActiveRecord. Of the two, I think STI is probably least weird and most likely to do more of standard use cases minimizing number of db queries. (Valkryie’s approach is somewhat similar in how it uses the DB to single-table inheritance, but without actually using that AR feature).


Matt thought that shrine might do more than ActiveStorage now, but history shows things built into Rails will probably expand and get better. (Yes, but it’s unclear to me how to make audio or video “variants” or derivatives with ActiveStorage, which my place of work predicts to need very shortly. If we are really ruthless about only what we need right now, are we going to have to just rewrite it as soon as we need another thing? There are no easy answers, “YAGNI” is simpler when it’s all about software you are writing yourself and not dependencies… but there are grey areas too).

But I’m not certain about this, after trying to help shrine developers enhance the versions/derivatives functionality to better support some flexibility we need as to storage locations and point-in-time of creation. The answer may just be trying to write an app which adds on locally to do exactly what it needs (whether in terms of shrine  or ActiveStorage), without trying to make a shareable general purpose tool?


Matt was very suspicious of using Blacklight at all, he found that it was quite easy for them to just write the UI they needed based on Solr responses. And their app certainly is as good as many sufia/hyrax apps (it even has an actual search-within-the-collection feature on collection pages, which I think our sufia 7 app didn’t, although I think latest hyrax does).

But remember my inability to entirely give up on the idea of a shareable toolkit? I really would like something that serves as “scaffolding” that gives you an app out of the box with basic file ingest, metadata edit, and search out of the box. And Blacklight is a way to do this. And I think my plan to segregate Blacklight from the rest of the app (it should be a dependency you can switch out) by immediately fetching records from postgres corresponding to solr search results — may be able to keep Blacklight from “infecting” the rest of the app with Blacklight-isms, as Matt was worried it could.

How simple is simple?

It was useful to have Matt call my bluff to some extent: What I have been hypothetically proposing isn’t really “just plain rails”.  But it’s a lot closer than current samvera, or even valkyrie.

I think some of the valkyrites think valkyrie’s departures from “ordinary Rails” are a a positive, that they can use different patterns to do better than Rails…  which is not a unique idea to them…  but I think is a bit hubristic, to think you can invent something better (and easier to onboard new developers with?) than Rails. (I also wonder if the valkyrites, if freed from the need to support fedora too, would simply use hanami?)

The same charges of hubris can be brought to my initial sketch of plans though — it was useful to be challenged from the “left” of “you’re still not simple enough” by Matt. I am so used to thinking about my in-formation plans as a/the simple alternative to, well, samvera or even valkyrie… it was a refreshing and needed corrective to be talking to Matt who thought my plans were still too much abstraction, not as simple as possible, not sticking close enough to implementing only what was needed for my business needs. On the one hand, it makes me worried he’s right; on the other, it makes me more comfortable to be in a nice middle ground of moderation with people advocating things on both sides or me, both heavier-weight and lighter-weight, sharing more code with the LAM digital collections community on one side, and sharing basically none on the other.

Really, “just plain rails” or “just plain [any code]” is to some extent a mirage, or an aspiration. We’re always writing code when we build a Rails app.  We’re always using some dependencies. While there can be a false economy in trying to share all your code in hopes of minimizing the amount of code that has to be written in aggregate (it often doesn’t work out that way because building good re-usable abstractions is hard) — there can also of course be a false economy in never using anyone elses dependency, and “not invented here” syndrome.  And if you’re writing it yourself — it’s writing abstraction layers that are potentially giving you not-worth-it complexity, whether you keep them in the app or make them into a gem. But abstraction layers are also what allow us to do complex things that we can still comprehend as humans — when it works. 

Software is still a craft. As Alberta needs to add additional features, with their aspirations to add a lot more digital-collections-like content — it’s going to take software craftsmanship to figure out how to keep it simple.  What I like about U Alberta’s approach is they realize this.  They realize they are an internal development shop, and need to let developers do what developers do — rather than have non-technical stakeholders making technical decisions for non-technical reasons.  (At one point someone said: After having been ‘burned’ before, they are very suspicious of using common/shared software, vs. just writing their app — which is part of their skepticism towards attr_json —  I think they’re not wrong).

One thing letting an internal development shop excel entails is figuring out how to recruit and retain a solid development team with limited budget, which is one reason Alberta is trying to be so ruthless about keeping it simple and “standard”.  One phrase I heard repeated was “industry-standard onboarding”, which I think also implies needing to be accessible to relatively “junior” new hires, which requires keeping your stack simple and standard. (That is, traditional-samvera or valkyrie-using institutions do not necessarily have any less challenge here and may have more, as for instance Steven Anderson of BPL argued)

(But I wonder if on-boarding a new developer to an existing project that has a very small dev team is going to be challenging across the industry!  I am not convinced that “Where the Rails community has a diversity of opinions on an approach, we should prefer the approach espoused by the Rails core team” (from a Matt/Alberta manifestoalways and necessarily leads to the simplest code or easiest to on-board new developers with. sometimes you can build a monster in the pursuit of not doing something novel…. the irony, right? But it’s always worth considering the trade-offs).

I definitely might end up re-orienting.  For instance, Matt reminded me of something I knew but tried to forget even when writing out my notes for a possible plan: A generalized permissions/ACL system is a craggy shore that many ships have crashed upon. Should I just write for my app the permissions we need instead? After doing some more business analysis to figure out what they are?  Perhaps. More broadly, if we end up trying to implement this “toolkit” and I’m running into troubles and worrying our reach exceeded our grasp — retreat to just the app good enough for what we need right now is always a valid escape hatch.

U Alberta’s story, where they’ve been working on this app with a very different approach for over a year, and so far are happy —  is another good data point reminding us that dissatisfaction with the samvera stack is not new, especially institutions that have developers with wider Rails experience have been suspicious of the value propositions of fedora and samvera for some time.  And that there are a variety of approaches being tried. We all need community to bounce our ideas off of and get feedback, especially those of us who operate in 2-4 person development shops need more than we may get internally. I’m so glad they were willing to spend some time talking to me.  And I highly encourage reading all of Matt/U Alberta’s somewhat iconoclastic analysis docs, as one way of considering other perspectives.  I’m not sure if I can find the time, but I’d kind of like to “onboard” myself into their codebase, and understand how it works better as one example.


Thanks to the whole U Alberta team, and especially Peter Binkley, Weiwei Shi, and Matt Barnett, for spending time explaining what they were up to to me. Thanks to Peter and Weiwei for reviewing this post for any terrible errors.  All remaining mistakes and wrong opinions are my own.

“Whatever Happened to the Semantic Web?”

I’ve been enjoying some of the computing history articles, and especially internet history articles, on twobithistory.org. But this one hits especially close to home I think, “Whatever Happened to the Semantic Web?”

The problem, in Swartz’ view, was the “formalizing mindset of mathematics and the institutional structure of academics” that the “semantic Webheads” brought to bear on the challenge. In forums like the World Wide Web Consortium (W3C), a huge amount of effort and discussion went into creating standards before there were any applications out there to standardize. And the standards that emerged from these “Talmudic debates” were so abstract that few of them ever saw widespread adoption. The few that did, like XML, were “uniformly scourges on the planet, offenses against hardworking programmers that have pushed out sensible formats (like JSON) in favor of overly-complicated hairballs with no basis in reality.” The Semantic Web might have thrived if, like the original web, its standards were eagerly adopted by everyone. But that never happened because—as has been discussedon this blog before—the putative benefits of something like XML are not easy to sell to a programmer when the alternatives are both entirely sufficient and much easier to understand…


The long effort to build the Semantic Web has been said to consist of four phases.7 The first phase, which lasted from 2001 to 2005, was the golden age of Semantic Web activity. Between 2001 and 2005, the W3C issued a slew of new standards laying out the foundational technologies of the Semantic future.

The most important of these was the Resource Description Framework (RDF). …

In 2006, Tim Berners-Lee posted a short article in which he argued that the existing work on Semantic Web standards needed to be supplemented by a concerted effort to make semantic data available on the web…  Berners-Lee’s article launched the second phase of the Semantic Web’s development, where the focus shifted from setting standards and building toy examples to creating and popularizing large RDF datasets. Perhaps the most successful of these datasets was DBpedia, a giant repository of RDF triplets extracted from Wikipedia articles….

…The third phase of the Semantic Web’s development involved adapting the W3C’s standards to fit the actual practices and preferences of web developers. By 2008, JSON had begun its meteoric rise to popularity…. issued a draft specification of JSON-LD in 2010. For the next few years, JSON-LD and an updated RDF specification would be the primary focus of Semantic Web work at the W3C….

….Today, work on the Semantic Web seems to have petered out. The W3C still does some work on the Semantic Web under the heading of “Data Activity,” which might charitably be called the fourth phase of the Semantic Web project. But it’s telling that the most recent “Data Activity” project is a study of what the W3C must do to improve its standardization process.13 Even the W3C now appears to recognize that few of its Semantic Web standards have been widely adopted and that simpler standards would have been more successful. The attitude at the W3C seems to be one of retrenchment and introspection, perhaps in the hope of being better prepared when the Semantic Web looks promising again….


And so the Semantic Web, as colorfully described by one blogger, is “as dead as last year’s roadkill.”14 At least, the version of the Semantic Web originally proposed by Tim Berners-Lee, which once seemed to be the imminent future of the web, is unlikely to emerge soon. That said, many of the technologies and ideas that were developed amid the push to create the Semantic Web have been repurposed and live on in various applications….

…So the problems that confronted the Semantic Web were more numerous and profound than just “XML sucks.” All the same, it’s hard to believe that the Semantic Web is truly dead and gone. Some of the particular technologies that the W3C dreamed up in the early 2000s may not have a future, but the decentralized vision of the web that Tim Berners-Lee and his follow researchers described in Scientific American is too compelling to simply disappear. Imagine a web where, rather than filling out the same tedious form every time you register for a service, you were somehow able to authorize services to get that information from your own website. Imagine a Facebook that keeps your list of friends, hosted on your own website, up-to-date, rather than vice-versa. Basically, the Semantic Web was going to be a web where everyone gets to have their own personal REST API, whether they know the first thing about computers or not. Conceived of that way, it’s easy to see why the Semantic Web hasn’t yet been realized. There are so many engineering and security issues to sort out between here and there. But it’s also easy to see why the dream of the Semantic Web seduced so many people.

Whatever Happened to the Semantic Web?



Oh my, I just realized he cited MY famous blog on linked data in a note. I did not realize that until I actually went and looked at all the footnotes. He cites me for the comment “as dead as last year’s road kill”, but I knew I wouldn’t say something like that! And I did not. I was citing a comment on HackerNews, which I properly quoted, cited, and linked to! It is not something I said or my opinion… exactly.  (since corrected).

The HackerNews comments on this article are… interesting.

Notes on study of shrine implementation

Developing software that is both simple and very flexible/composable is hard, especially in shared dependencies. Flexiblity and composability often lead to very abstract, hard to understand architecture. An architecture custom-fitted for particular use cases/domains has an easier time of remaining simple with few moving parts. I think this is a fundamental tension in software architecture.

shrine is a “File Attachment toolkit for Ruby applications”, developed with explicit goals of being more flexible than some of what came before. True to form, it’s internal architecture can be a bit confusing.

I want to work with shrine, and develop some new functionality based on it, related to versions/derivatives (hopefully for submission to shrine core), requiring some ‘under the hood’ work. When I want to understand some new complicated architecture (say, some part of Rails), one thing I do is trace through it with a debugger (while going back and forth with documentation and code-reading), and write down notes with a sort of “deep dive” tour through a particular code path. So that’s what I’ve done here, with shrine 2.12.0. It may or may not be useful to anyone else, part of the use for me is in writing it; but when I’ve done this before for other software others have found it useful, so I’ll publish it in case it is (and so I can keep finding it again later to refer to it myself, which I plan to do).

Some architectural overview

shrine uses a plugin system based on module mix-in overrides (basically, inheritance),  which is not my favorite form of extension (many others would agree). Most built-in shrine func is implemented as plugins, to support flexible configuration. This mixin-overridden-methods architecture can lead to some pretty tightly coupled and inter-dependent code, even in ostensibly independent plugins, and I think it has sometimes here.  Still, shrine has succeeded in already being more flexible than anything that’s come before (definitely including ActiveStorage). This is just part of the challenge of this kind of software development, I don’t think anyone else starting over is gonna get to a better overall place, I still think shrine is the best thing to work with at present if you need maximal flexibility in handling your uploaded assets.

Shrine has a design document that explains the different objects involved. I still found it hard to internalize a mental model, even with this document. After playing with shrine for a while, here’s my own current re-stating of some of the primary objects involved in shrine (hopefully my re-statement doesn’t have too many errors!).

An uploader (also called a “shrine” object, as the base class is just Shrine) is a  stateless object that knows how to take an IO stream and persist to some back-end.   You generally write a custom uploader class for your app, because a specific uploader is what has specifics about any validationtransformationmetadata extraction, etc, in ingesting a file. An uploader is totally  stateless though (or rather immutable, it may have some config state set on initialize) — it’s sort of a pipeline for going from an IO object to a persisted file.  When you write a custom uploader, it isn’t hard-coded to a particular persistent back-end, rather a specific storage object is injected into an individual uploader instance at runtime.

A shrine attacher is the object that has state for the file. An attacher knows about the model object the file is attached to (a specific attacher instance is associated with a specific model instance).  An attacher has two uploaders injected into it — one for the temporary cache storage and one for the permanent store storage. These are expected to be the same class of uploader, just with different storages injected.  An attacher has ORM plugins that handle actual persistance to the db, as well as tracking changes, and just everything that needs to be done regarding the state of a particular file attachment.

In a typical model, you can get access to the attacher instance for an asset called avatar from a method called avatar_attacher. The avatar method itself is essentially delegated through the attacher too. The attacher is the thing managing access and mutation of the attached files for the model.  If you ask for avatar_attacher.store or avatar_attacher.cache, you get back an uploader object corresponding to that form of storage — to be used to process and persist files to either of those storages.

How do those methods avatar and avatar_attacher wind up in the model?  A ruby module is mixed in to the model with those methods. Shrine calls this mix-in module an “attachment”. When you do include MyUploader::Attachment.new(:name_of_column) in your model, that’s returning an attachment module and mixing it into your model.  I find “attachment” not the most clear name for this, especially since shrine documentation also calls an individual file/bytestream an “attachment” sometimes, but there it is.

And finally, there’s the simple UploadedFile, which is simply a model object representing an uploaded file! It can let you get various information about the uploaded file, or access it (via stream, downloaded file, or url).  An UploadedFile is more or less immutable. It’s what you get returned to you from the (eg) avatar method itself.  An UploadedFile can be round-trip serialized to json — the json that is persisted in your model _data column. So an UploadedFile is basically the deserialized model representation of what’s in your _data column.

It’s important to remember that shrine uses a two-step file persistence approach. There is a temporary cache storage location that has files that may not pass validation and may not yet have been actually saved to a model (or may never be).  The file can be re-displayed to a user in a validation error when it’s in “cache” for instance. Then when the file is actually succesfully permanently persisted attached to a model, it’s in a different storage location, called the store.

Tracing what happens internally when you attach a file to an ActiveRecord model using shrine

Most of this will be relevant regardless of ActiveRecord, but I focused on an ActiveRecord implementation. My demonstration app used to step through uses a bog-standard Shrine uploader, with no plugins (but :activerecord).

class StandardUploader < Shrine
  plugin :activerecord

Just to keep things consistent, we attach to a model on the “standard_data” column, with accessor called “standard”.

  include StandardUploader::Attachment.new(:standard)

What is shrine doing under the hood, what are the different parts, when we assign a new file to the model?  We’ll first do model.standard = File.open("something"), and then model.save.

First model.standard = File.open("something")

The #standard= is provided by the attachment module mix-in, and it calls  asset_attacher.assign(io_object)

If it’s NOT a string, assign first does: `uploaded_file = [attacher.]cache!(value, action: :cache)` (What’s up with ‘not a string’? A string is assumed to be serialized json from a form representing an already existing file. The assign method assumes it’s either an IO object or serialized JSON from a form; there are other methods than `assign` to directly set an UploadedFile or what have you).

The cache! method calls uploaded_file = cache.upload(io)cache points to an instance of our StandardUploader configured to point at the configured ‘cache’ (temporary) storage, so we’re calling upload on an uploader.

[cache uploader]#upload calls processed to run the IO through any uploader-specific processing that is active on the “cache” stage.

Then it calls #store on itself, the uploader assigned as `cache`. “Uploads the file and returns an instance of Shrine::UploadedFile. By default the location of the file is automatically generated by #generate_location, but you can pass in `:location` to upload to a specific location. [ie path, the actual container storage location is fixed though]”  The implementation is via an indirection through #_store, which:

1.  calls get_metadata on itself (an uploader), which for a new IO object calls extract_metadata, which is overridden by custom metadata plugins. So metadata is normally assigned at the cache/assignment phase. This is perhaps so the metadata can be used in validation?  Not sure if there’s a way to make metadata be in the background, and/or be as part of the promotion step (when copying cache to store on save) instead. There’s some examples suggesting they are relevant here, but I don’t really understand them.

2. Calls #put on itself, the uploader. put by default does nothing but call #copy on the uploader, which actually calls #upload on the actual storage object itself (say a Shrine::Storage::FileSystem), to send the file to that storage adapter — in this case for the configured cache storage, since we started from cache on the attacher. (Some plugins may override put to do more than just call copy). 

3. Converts into a shrine UploadedFile object representing the persisted file, and returns that.

So at this point, after calling attacher.cache!, your file has been persisted to the temporary “cache” storage. attacher.cache! purely deals with the stateless uploader and persisting the file; next is making sure that is recorded in your model _data attribute.

[attacher].assign then does ‘[attacher.]set(uploaded_file)’, where uploaded_file is what was returned from the previous cache! call. set first stores the existing value (which could be nil or an an UploadedFile) in the attacher instance variable @old, (in part so it can be deleted from storage on model persistence, since it’s been replaced).  And then calls _set to convert the UploadedFile to a hash, and write it to the _data model attribute — so it’s there ready for persistence if/when the model is saved.

So after assignment (model.standard = File.open("whatever")), the file is persisted in your “cache” storage. The in-memory model has asset_data that points to it. But nothing about that is persisted to your model’s persistence/ORM.  If the model previously had a different file attached, it’s still there in the store storage.

Let’s see how persistence of the new file happens, by tracing the ActiveRecord ORM plugin specifically, when you call model.save.  First note the active_record plugin makes sure shrine’s validations get used by the model, so if they fail, ActiveRecord’s save is normally going to get a validation failure, and not go further. If we made it past there:

In an active_record before_save, it calls attacher.save if and only if the attacher is changed?, meaning has set the @old ivar of previous value (could be nil previous value, but the ivar is set). However, the default/core implementation of save doesn’t actually do anything — this seems mainly here as a place for shrine plugins to hook into actually “before_save”, in an ORM-agnostic way.  (Might have been less confusing to call it before_save, I dunno).  The file is not moved to the permanent storage (and the old file deleted from permanet storage) until after the model has been succesfully persisted.

Then ActiveRecord’s own save has happened — the file data representing the new file persisted in temporary cache has now been persisted to the database.

Then in an active_record after_commit, finalize is called on the attacher. finalize is only called if  @old  is set — so only if the attached file was changed, basically.

The [attacher.]finalize method itself immediately returns if there is no “@old” instance variable set. (So the check with changed? in the hook is actually redundant, even if you call finalize every time, it’ll just return. Unless plugins change this).

Then finalize calls [attacher.]replace. Which — if the @old instance variable is not nil (in which it’s an UploadedFile object), and the object was in the cache storage (it must be in store storage; checked simply by checking the storage_key in the data hash) deletes the old value. “replace” in this case actually means “delete old value” — it doesn’t do anything with the new value, whether the new value is in cache or store. (not to be confused with a different #replace method on UploadedFile, which actually only deals with uploading a new file. These are actually each two halves of what I’d think of as “replacement”, and perhaps would have best had entirely different names — especially cause they both sound similar to the different “swap” method). 

The finalize method removes the @old ivar from the attacher, so the attacher no longer thinks it has an un-persisted change. (would this maybe be safer AFTER the next step?)

finalize calls `_promote(action: :store) if cached?` — that is, if the current UploadedFile exists, and is associated with the cache store.   [attacher.]#_promote just immediately calls promote —  both of these methods can take an uploaded_file argument, but here they are not given one, and default to the current UploadedFile in this attacher, via get

[attacher.]promote does a `stored_file = store!(uploaded_file, **options)`.  Remember the `cache!` method above? `store!` is just the same, but on the uploader configured as `store` storage instead of `cache` storage — except this time we’re passing in an UploadedFile instead of some not-yet-imported io object. Metadata extraction isn’t performed a second time, because, get_metadata has special behavior for UploadedFile input, to just copy existing metadata instead of re-extracting it.

At this point, the file has been copied/moved to the ‘store’ storage — but another copy of the file may still exist in cache​ storage (in some cases where the cache and store storages are compatible, the file really was moved rather than copied though), and no state changes have been made at all to the model, either in-memory or persisted, to point to this new file in permanent storage.

So to deal with both those things, [attacher].promote calls [attacher.]swap, which is commented as “Calls #update, overriden in ORM plugins, and returns true if the attachment was successfully updated.” In fact, the over-ridden attacher.update in the activerecord plugin just calls super, and then saves the AR model with validate:false. (I am not a fan of the thing going around my validations, wonder what that’s about).

Default update(uploaded_file) just calls _set(uploaded_file).

_set pretty much just converts the UploadedFile to it’s serializable json, and then calls write.

write just sets the model attribute to the serializable data (it’s still not persisted, until it gets to the ORM-specific update, where as a last line the model with new data is persisted).

so I think attacher.swap actually just takes the UploadedFile, serializes it to the _data column in the model, and saves/persists the model. Not sure why this is called swap. I think it might be more clear as “update” — oops, but we already have an update, which is by default all that swap calls. I’m not sure the different intent between swap and update, when you should use one vs the other.  (This is maybe one place to intervene to try to use some kind of optimistic or pessimistic locking in some cases)

If swap returns a falsey value (meaning it failed), then promote will go and delete the file persisted to the store storage, to try and keep it from hanging around if  it wasn’t persisted to model.  I don’t totally understand in what cases swap will return a falsey value though. I guess the backgrounding plugin will make it return nil if it thinks the persisted data has changed in db (or the model has been deleted), so a promotion can’t be done.

overview cheatsheet

pseudo-code-ish chart of call stack of interesting methods, not real code

model.avatar=(io)   =>  avatar_attacher.assign(io)

↳ uploaded_file = avatar_attacher.cache!(io)

↳  avatar_attacher.cache.upload(io) => processes including extracting metadata and persists to storage, by calling avatar_attacher.cache.store(io)

↳ io = uploader.processed(io)

↳ io = uploader.store(io) => via uploader._store(io)

↳ get_metadata

↳ uploader.put(io) => actually file persists to storage

returns an UploadedFile

↳ avatar_attacher.set(uploaded_file)

↳ stores previous value in attacher ivar “@old”, puts serialized UploadedFile in-memory avatar_data attribute


an activerecord before_save triggers avatar_attacher.save iff attacher.changed? (has an @old ivar). Core attacher.save doesn’t do anything, but some plugins hook in.

activerecord does the save, and commit.

an active_record after_commit triggers avatar_attacher.finalize iff attacher.changed?

↳ attacher._promote/promote iff  attacher.changed?

↳ stored_file = avatar_attacher.store!( UploadedFile in-memory )

↳ see above at cache! — extra metadata, does other processing/transformation, persists file to store storage, updates in-memory UploadedFile and serialization.

 ↳ attacher.swap(newly persisted UploadedFile)

↳ attacher.update(newly persisted UploadedFile) => just calls _set(uploaded_file), which properly serializes it to in-memory data, and then in an activerecord plugin override, persists to db with activerecord.

Some notes

On method names/semantics

“Naming” things is often called (half-jokingly half-serious) one of the hardest problems in computer science, and that is truer the more abstract you get. Sometimes there just aren’t enough English words to go around, or words that correctly convey the meaning. In this architecture, I think both the replace methods probably should have been named something else to avoid confusion, as neither one does what I’d think of as a “replace” operation.

In general, if one needs to interact with some of these methods directly (rather than just through the existing plugins), either to develop a new plugin or to call some behavior directly without a plugin being involved — it’s not always clear to me which method to use. When I should use swap vs update , which in the base implementation kind of do the same thing, but which different plugins may change in different ways? I don’t understand the intended semantics, and the names aren’t helping me. (promote is similar, but with an UploadedFile which hasn’t yet been processed/persisted? Swap/update takes an UploadedFile which has already been persisted, for updating in model).

It is worth noting that all of these will both change the referenced attached file on a model and persist the whole model to the db. If you just want to set a new attached file in the in-memory model without persisting, you’d use “attacher.set(uploaded_file)” — which requires an UploadedFile object, not just an IO. Also if you call set multiple times without saving, only the penultimate one is in the @old variable — I’m not sure if that can lead to some persisted files not being properly deleted and being orphaned?

Shrine plugins do their thing by overriding methods in the core shrine — often the methods outlined above. Some particularly central/complicated plugins to look at are backgrounding and versions (although we’re hoping to change/replace “versions”) — they are very few lines of code, but so abstract I found it hard to wrap my head around.  I found that the understanding of what unadorned base shrine does above was necessary  to truly understand what these plugins were doing.

Are there ways to orphan attached files in shrine?  That is, a file still stored in a storage somewhere, but no longer referenced in a model?  For starters the “cache” storage is kind of designed to have orphaned files, and needs to have old files cleaned out periodically, like a “tmp” directory. While there is a plugin designed to try to clean up some files in “cache”, they can’t possibly catch everything — like a file in “cache” that was associated with a model that was never saved at all (perhaps cause of validation error) — so I personally wouldn’t bother with it, just assume you need to sweep cache, like the docs suggest you do.

Are there other ways for files to end up orphaned in shrine, including in the “store” storage? If an exception is raised at just the wrong time?  I’m not sure, but I’d like to investigate more. An orphaned file is gonna be really hard to discover and ever delete, I think.


Proposed Rails-based digital collections developer’s toolkit

In my last post, I explained my read of the lay of the samvera land, and why I’m interested in pursuing an alternate approach. We haven’t committed to the path I will outline, but are considering it.

First we should say that I am coming from the assumption of an institution that does want to do local development, to get full control over the application and be able to custom-fit it to their needs. Having a local development team requires resources and perhaps some organizational cultural changes to properly involve technical decision-makers in technical decisions. But it’s my own opinion that developing with hydra/samvera (or anything based on Rails) has always required a local development team; if you don’t want one, you would be better off looking at a more off-the-shelf, perhaps proprietary product.

So, reacting to experiences with what have seemed to me to be overly-complex architectures, one question to start from is — can a development team “just build a Rails app”?  Maybe, but I think some common patterns and needs in digital collections/repositories can be identified, to be turned into shared dependencies, to better organize the initial app and hopefully provide efficiencies for more apps over time.

If the design goal of an architecture of re-usable abstractions is to optimize between maximized code re-use (avoid re-inventing the wheel), and minimized architectural complexity (number of abstractions, complexity of abstractions, general number of novel mental-models) — I believe we can make a go of it by sticking as closely to Rails as possible, trying to add things carefully that have clear benefit for our domain.

One possibly counter-intuitive proposition here is that, rather than trying to share as much code as possible and minimize code in local apps, we can make end-to-end development more efficient and less costly by putting less code into shared dependencies, so we can spend more time making those shared modular components as well-designed for re-use and composability as possible.  Fighting with integration into a complex architecture (that includes many features your app doesn’t need) can easily erase any hypothetical gains from shared code.

So the proposal here is neither to provide a complete working app or “solution bundle” (which might be hyrax approach), nor to write a completely custom app with no domain-specific parts to be shared with the community (which might be what Steven Anderson was at one point suggesting for BPL, or what the “bespoke” Valkyrie-based apps lean towards at present), but instead to write a local app in tandem with some relatively small and tight shareable, re-usable, composable, modular components.  (This is similar to the original “hydra head” approach;  we think we can do this more succesfully as a result of what we’ve learned, and technical choices to keep things much more simpler, recognizing that a local development team will need to do development).

The hypothesis or proposition is that this can get us to an efficient cost-effective architecture for building a digital collections site ultimately quicker than trying to get there from iterating on existing code. Because we can try to be ruthless about minimizing abstractions and code complexity from the start, with the better knowledge of our domain we now have — avoiding legacy code that arguably already has some unneccessary complexity baked in at a fundamental level.

The trade-off, is that by starting from a “clean slate”, we have a long period of developing architecture with no immediate pay-off in adding features to our production app.  There is a hump of lowered productivity before reaching — if we are successful — the goal of overall increased productivity.  This is a real cost, and a real risk that our approach may not be as beneficial as we think — you have to take some risks to achieve great benefits.

This post will lay out a proposed plan in three parts:

  1. Some principles/goals, illustrating further what we mean by a Rails-based developer’s toolkit for our domain
  2. A high-level outline/blueprint of the architectural components I currently anticipate.
  3. Some analysis of the benefits and especially risks and potential reasons for failure of this plan.

If you are interested in this plan in any way, please do get in touch. 

Principles and Goals: What do we intend by “A Rails-based developer’s toolkit”?

The target audience is developers. Developers will use this tool to put together a digital collections/repository app.   It will not produce “shrinkwrap” software for non developers, this is tools for developers.

⇒ Realistically probably minimally requiring a team of 2-4 technical staff (likely including a devops/sysadminy role). I suspect this is what is generally required for a successful hyrax or other samvera project already/too.  By acknowledging there will be need for local development team, we can avoid the complexity of trying to make “declarative-config-only” layers.

⇒ “developers” doesn’t necessarily mean expert ruby/rails developers. We can try to target relatively beginner Rails devs, including those for whom this is their first Rails or ruby project. As with any endeavor, the more experience and skill you have, the more efficient and high-quality work you can do, but I am optimistic we can make the bar fairly low for getting started.

We try to add tools to make “just building a Rails app” easier for digital collections/repository domain, supplying components based on common needs in this domain.  But where possible, we do things “The Rails Way”, what is most natural or typical or common for Rails apps, providing enhancements to it rather than replacements.

⇒ Things people often do on top of Rails, such as “form objects” should be options that are ideally no harder to choose to use than with any other Rails app, but not required or assumed.

⇒ There’s already a lot to learn in just learning how to effectively build a performant, secure, and reliable app on Rails (or a web app in general). We will try to minimize extra abstractions and architectures.

⇒ Ideally making it feasible to add developers (or hire project-based consultancies) experienced in Rails but not our industry/domain to the project, and have them get up to speed relatively quickly (joining a legacy app is rarely easy even for experienced devs). And the path for beginner library developers learning to use the stack will, as much as possible, be learning Rails.

⇒ It’s not that we’re suggesting everything in Rails is designed perfectly or is always easy to work with or learn.  Rather we’re suggesting these kind of flexible APIs can be a hard design problem, and Rails is stable, mature, polished, well-understood, performance-tuned, with an ecosystem of tutorials and third-party code. Trying to reinvent “better” alternatives has significant development costs and will take time and iterations to reach Rails’ maturity, that are best avoided where possible.

This will be a library or toolkit of components, not a fully integrated “solution bundle” or framework. 

⇒ “You call a library, but a framework calls you“.  Rather than providing API to hook into and customize small parts of a large integrated stack, the library/component approach is to try and provide re-usable modular blocks (legos?) that you can put together to build what you (the developer) want.

⇒ In many cases, you’ll customize from the top down instead of from the bottom up. If you want to customize a view, you might override the entire view  and then re-build it, possibly using the same components the original was built with but composed and configured in different ways.  Instead of trying to figure out a way to ‘override’ a part inside without touching anything else.

⇒ Rather than be a forwards-compatibility nightmare, we think targetting this mode of work can actually reduce your local maintenance burden and upgrade paths, by having simpler more stable code in the shared dependencies. In our community experience, the alternate approach has not in practice seemed to result in better forwards-compatibility.

⇒ Out of the box without writing code, we do aim to give you something that runs, this isn’t just components dropped on the floor — but it’ll probably be a “proof of concept”, maybe not even an “MVP” you could deploy in production at all. Think Rails scaffolding, but specifically for our domain.  Many parts may need to be customized or added (and some parts will certainly need to be) to arrive at the app well-suited for your business and user needs, but scaffolding can be easier to get started with than a blank slate.

=> Again, we are assuming an institution interested in a local development team that can customize to your particular business and user needs. If you don’t want to do this, you might be better served by a more “shrinkwrap” ready to go (but less flexible/customizable) solution — that, IMO, is likely not historical or future samvera software either, or any Rails-based approach. It is likely a paid-development-and-support product, perhaps proprietary and/or hosted.

(Incidentally, this is in some ways a reversal of some things I myself thought some years ago when I participated in developing early versions of Blacklight. Then I thought any  generated code was a bad thing, because it could go out of date. And any copy-and-paste of code was a bad thing, for similar — the goal was to have the local app have as little code in it as possible.  I have come to realize while that goal seems appealing, achieving it succesfully is very hard, and a less successful attempt at it can actually be more costly, and it may counter-intuitively make sense to try to minimize the code in the shared dependency instead.

We prioritize solid modular building blocks over features.  There are very diverse needs in our domain. Rather than try to understand let alone build-in all features needed accross our community, we try to give development teams the tools to lower costs when developing what they have discovered about their local requirements and priorities. We will prioritize only the most common needs, and be hesitant of added complexity for less common needs — or needs we haven’t yet figured out how to meet with re-usable modular components.

⇒ This doesn’t get us out of understanding domain needs,  we still need to figure out the common needs in our domain that can be addressed by our tools, and if we’re really wrong we won’t successfully produce a toolkit that lowers development cost.

⇒ But by trying to focus on the most common needs where we can provide modular, composable tools — we have less code, and thus more resources per abstraction to try to design them successfully to be re-composable to your needs, flexible beyond what could have been planned in explicitly. Which takes skill and time, and is feasibly only by ruthlessly focusing on simplicity.  We will do our best to design our tools so it’s clear how you can use them to build additional things you need (a workflow engine?) on top. 

⇒ The number of “built-in” features and components and higher-level abstractions will, if the project is successful, probably grow over time. But you can’t build a good skyscraper without a good foundation, the base-level components need to be solidly designed for re-use, in order to build higher-level on top.

We are targeting a digital collections use case much like our organization’s. Staff-only editing, not self-deposit. Relatively small number of different pages.  There should be no problem scaling number of users or number of documents right from the start, but we are not scaling complexity of requirements, or user interfaces. Our needs are outwardly pretty simple (although the devil is in the details) — customize our metadata schemas and derivatives, let staff ingest and enter metadata, let users find and view and download items, in an app that works well with search engines and on the web in general, with modern best-practice performance characteristics. We are targeting other organizations with similar needs. 

⇒ While we aspire to provide a solid enough foundation you could build more complex requirements on top of this, the toolkit would be of course be a proportionally smaller aid in that situation. While we think the toolkit can grow to handle more complex use cases eventually, and we’re keeping that horizon in view, but it’s not the priority.

⇒ We think this simpler digital collections approach — in addition to conveniently matching our local needs — is often close to a strict subset of more complex needs. If you have complicated workflow needs (whether involving staff or patron work), you probably need most of our simple use case target plus more. Starting here can get us something that works for we think a lot of the community and can be initial steps for more.

⇒ The toolkit is primarily focused on providing tools to aid in the ingesting and management of digital assets and metadata.  I think this is what people have come to samvera for, and is in some sense the “hard part”. While end-user discovery is very important — and the toolkit will provide basic discovery — I think it’s in some ways an easier problem, with more well-understood practices and technique for development — especially if we can provide for it to be done in terms of ActiveRecord models.

We will think of developer and operational use cases from the start. We don’t want to have to be the experts in every end-user use case in the domain, but we want to try to be the experts in  developer and operational use cases in building apps in this domain.  If a design choice makes deployment or operations harder, it may not be a good choice. The users for this toolkit are developers, who use it to meet end-user needs.

⇒ We will plan for performance from the start (avoiding n+1 queries, providing clear paths to use standard Rails caching techniques, etc.).

⇒ We will plan for a clear “story” about deployment architecture from the start —  we will plan for cloud deployment from the start (such things as not requiring any persistent file systems–putting everything on eg S3–and not making assumptions about how many machines different services are divided upon) — but not multi-tenancy (too complicated). We will consider from the start ease of use and efficiency throughout your product life cycle.

⇒ With less code to document and maintain  in the toolkit, we hope it becomes more feasible to maintain good non-stale documentation, and components with very good long-term stability and compatibility — reducing total cost over time of both developing and maintaining not only the toolkit itself, but counter-intuitively an app based on the toolkit as well. In practice, less shared code can be more efficient use of both community and local developer resources than more.   

Outline of Architectural Components

Modelling/Persistence: json_attr-based

Use attr_json (which I wrote) to store object metadata in a schemaless fashion as serialized json. Supported nested/compound objects, with rails-style form support, dirty-tracking, and other features meant to let you use as you would an ordinary ActiveRecord model.  attr_json is at the heart of this plan.

Modelling will be based on PCDM (Collections; Objects; Files), but we may not feel the need to stick strictly to PCDM. For instance, we may make work->child relationship 1-to-many instead of n-to-m, if it significantly eases a robust implementation. Or allow an internal object representing a ‘file’ to be an element in the ‘member’ relation.

We may likely put all three core model types in one table using Rails Single Table Inheritance — that significantly eases ActiveRecord association modelling, if you want a Work/Object’s children to be both ordered, and polymorphically include other Work/Object’s or Files (again not strictly PCDM). attr_json addresses some of the ordinary disadvantages of STI, since varying data can be in the shared json column. (I believe valkyrie activerecord takes similar one-table approach, although it does not use AR associations — I think AR associations are a powerful tool I’m loathe to give up).

We will plan from the start to prioritize efficient rdbms querying over ontological purity. Dealing with avoiding n+1 (or worse) when doing expected things like displaying a list of objects/works with thumbnails, or displaying all a work’s members with thumbnails. In some cases we may resort to recursive CTEs; some solution will be built into the tool, this isn’t something to make developer-users figure out for themselves each implementation.

While attr_json doesn’t technically require postgres, we will likely require postgres for use of toolkit, so when providing out of the box solutions to common needs (including efficiency/performance), we can use features with database-specific elements like jsonb, recursive CTEs, and postgres full text search.

We may provide a six-alphanumeric-primary-key approach similar to sufia/hyrax/samvera, possibly implemented with postgres functions. One way or another, we need to migrate our existing data and keep URLs and possibly internal IDs consistent.


Basic CRUD functionality, and possibly not a whole lot more, will be provided by controllers that look an awful lot like the out-of-the-box Rails scaffolding controllers.

Additional hooks will likely be included for customizing certain things (authorization, strong params) — but the goal is to keep the controller simple (and unchanging-over-time) enough that a developer-user wanting to customize will be safe to simply copy-paste the entire controller and modify it appropriately. Any complex or domain-specific functionality in the toolkit will be in service/helper objects that can be re-used in a local controller too, rather than the controller itself.

Config: chamber

Where our toolkit code needs to allow deployment/institution-specific configuration, I’m leaning toward using chamber. The built-in Rails stuff is kind of weird, and has been changing a lot with every Rails version (because it’s kind of weird and they keep trying to make it less so).  But chamber is suitably flexible that it should be able to meet almost any consuming institution’s needs. We may provide some extensions to chamber.

Asset management: shrine

In addition to metadata modelling and persistence, handling bytestreams/digital assets is the other fundamental core part of our domain. Both originals (often preservation copies) and derivatives.

Shrine is a “file attachment toolkit” for ruby, with Rails integration included. Shrine was motivated by lack of flexibility in some additional file attachment solutions. The result is something that is accurately called a toolkit — while it can be a bit harder to get started with in a fresh Rails app, it provides composable components that can be rearranged to meet domain needs, which makes it very well-suited for our toolkit, where flexible asset-handling is important to reducing total cost of development. While Rails has introduced ActiveStorage as a built-in solution, we don’t think it’s flexible enough for our asset-centered domain.

Shrine already handles storing files in a back-end agnostic way (we will focus on S3 for production and optionally local file system for dev, but it should be feasible for you choose other shrine adapters with minimized changes to other code needed). Shrine already handles streaming, full-file, and URL access APIs regardless of back-end. Shrine already has some architecture built out for ultra-flexible derivatives, storing checksums, etc. (Reflection on what derivatives are already there should make it easier to build out, say, a UI for menu of downloads based on what derivatives you have created and tagged as a download).

The current Shrine derivatives architecture (which it calls “versions”) doesn’t quite match my idea of requirements: I want to be able to optionally store derivatives in  a different backend than originals (eg different S3 bucket), or even different buckets for different derivatives. The shrine author has given me some advice on how to achieve that, and it also aligns in some ways with existing discussion on desired rewrite of the shrine “versions” plugin. I will likely spend significant time on developing (and documenting) a new shrine derivatives/versions plugin meeting our needs, and hopefully to be contributed back to shrine possibly as new standard plugin.

Additionally, while shrine itself prioritizes solid fairly low-level API (to achieve it’s composability/flexibility goals), I think our toolkit needs some slightly higher-level API, probably also developed as a shrine plugin, that lets you define derivatives in a more declarative way, making it easy and more DRY to do things like create/re-create/delete named derivatives for already existing objects, or declaratively specify foreground/background/on-the-fly processing.

The toolkit will likely provide it’s own recommended combination(s) of shrine config, as ‘scaffolding’, either as toolkit-specific shrine plugins, generated code, or both. It will be easy to change much of this config locally.

A metadata properties configuration system

In some current samvera platforms, when you add a new metadata property, there can be a dozen+ different files — in some cases depending on local architecture decisions — you need to edit to mention the new property. This is not only annoying, but error-prone to try to keep changes consistent across all those files.

I think this is a big enough barrier to developer ease of use, common enough to almost all uses, that it deserves an architectural solution.

We will provide an extensible properties configuration system that lets you configure multiple things about the property in the original model. It might look something like this:

class Article < Work
  our_tookit_property "title" do
     attr_json :string, array: true
     rdf_predicate "http://purl.org/dc/terms/"

     indexing do
       to_field "title_text", first_only

     edit_form do
       position: 0
       simple_form wrapper: :small

     include_in_results_list: true
     include_in_detail_page: true

The biggest potential problem with such a “centralized” property configuration system is — what if you need different configuration in different situations? Say one form field type in one place and a different in another, or even multiple indexers for different contexts.

This system will be from the start designed to support that too — what is in the model could be considered defaults, but all components using these values from properties configuration will take the values in a clearly documented format such that a developer can alternately choose to pass in alternate values than what was in the model property registration.

Edit forms: simple_form

simple_form is a Rails form toolkit. We can use it to make forms ‘just work’ (with field-level error reporting etc), including for some of our custom toolkit features (nested models), in a composable and overrideable way.  We will follow in the footsteps of hyrax and many other software in using it, probably set up for Bootstrap 4.

We will provide a custom form builder that can use (in some cases automatically, from properties definitions, see above) a suite of custom inputs for features built into the toolkit or common to our domain. Including nested models, repeatable elements (perhaps using cocoon, which attr_json is already compatible with), vocabulary auto-complete (probably based on questioning_authority, but built out to save vocabularies and IDs in nested models, instead of just saving values; we may need to send some PRs to qa if it’s APIs aren’t suitable).

We’ll also provide custom simple_form inputs for file upload (supporting direct-to-s3 uploads in a shrine-compatible way; hopefully supporting browse_everything), and permissions editing (see below).

We will probably provide a wrapper form for all the inputs that can use information from the property registrations (see above) to specify order of inputs.

We’ll provide a custom simple_form input for setting relationships that lets you search for core model objects (collections, works, files) with an autocomplete-style UI, and assign them to associations.

If you want to customize this beyond the simple things you can do, you’ll just use your own form view where you compose the simple form inputs yourself. The built-in form can be considered scaffolding to quickly get started, although it may be sufficient for simple use cases.

Staff (back-end) UX

Staff UX needs go beyond just forms — for one thing you need a way to find/get things to click on to get their edit forms. We will provide some limited back-end UX: ability to list/search/sort collections/works/files, and limit to just things you have certain permissions on.

Back-end UX is going to be kept pretty limited. Specific applications can build it out themselves if they need something beyond the basic scaffold.

The toolkit will need UX for adding/removing/re-ordering members of a parent. It will probably also provide a batch ingest/edit function of some kind, because that does seem to be a common need, and is one my institution needs.

We’ll need to do a bit more local staff user/requirements analysis to see what the minimum we need is (info from others with similar use cases welcome too), and decide which additional parts the toolkit should provide or just make clear how you’d provide them yourself. We aspire to provide high-quality staff UX for the simple targetted use cases.


Flexible and sane permissions/authorization system is very challenging, many efforts have run aground on it, and complete analyses of requirements can be very complex.

But we’re going to try to create a system anyway.  It will be based on an ACL model. Each ACL entry will relate an object (collection, work, file; “object”), a user or group (“subject”), and an operation (read, write, etc). They’ll be represented as ordinary normalized schema db objects (three-way join).

In addition to user or group as subject, we’ll have special subjects for “all logged in users”, and “public” (don’t even need to be logged in).  These will probably still be just ACL entry rows.

We’ll have a built-in list of hiearchical operations (hieararchical meaning if you have a higher one, it implies all the lower ones).  Which will likely be, in order from least powerful to most powerful:

  • list (see in results and lists)
  • read (see show page)
  • download (download assets)
  • add_member (eg if Collection is object, gives you ability to add things to it)
  • edit (edit metadata but _not_ permissions)
  • own (edit permissions)

The built-in controllers (and end-user-facing discovery) will out of the box do the right things with these permissions, but these will also be developer-editable, you will be able to add additional operations to the hieararchical list, as well as use additional operations that are not in a hieararchical list but are just stand-alone.  You’d have to edit (ie, probably create your own) controllers/views to make use of your new permissions — again we plan all of this as scaffolding which can be modified by writing code using the tools we give you.

While there will be an out of the box input UI element allowing you to edit these permissions directly, it’s expected that many apps will instead set them as part of workflows or other events, within UI  constrained to less than the full flexibility of the ACL system. A developer would do this just by writing code to set things ‘automatically’ in controllers and/or writing a more limited UI component. This system is a low-level acl architecture, it should support higher-level application-specific architectures on top.

APIs need to support fetching from the db with arbitrary permission limits (custom AR scopes); fetching from Solr with arbitrary permission limits (not neccesarily using blacklight_access_controls, we may likely create a new thing that works more the way we’d like, perhaps using solr joins, perhaps not); as well as checking can? on in-memory objects (possibly built on access_granted, which seems simpler and more performant than cancancan while still achieving what we need)

There will also be a superuser or admin class of user that can do everything.

The trickiest thing we do want to include as a requirement is a way for objects to inherit permissions from another object. This needs to something set on the inheriting object, and generally be live,  updating automaticaly if the inherited-from object’s permissions change. This is pretty tricky to get right, with decent performance on both writes and reads. Not sure if we’ll do it in a way that inherited-from permissions are “cached” on inheriting object (and need to be updated on writes), or instead just a persisted “pointer” to the inherited-from object (which needs to be followed on access-checking reads).  This is really hard to figure out how to do simply and performant, but also, we think, a key domain requirement, with previous workarounds that have led to lots of code complexity, so it’s important to figure it out from the start and not try to shoehorn it in later.

(Inherited permissions are even harder taking account of the fact that an object may want to inherit from another object, where that other object itself inherits. (file inherits permissions from work inherits permissions from collection?))

End-user (front-end) UI: Blacklight

We will use Blacklight (and Solr) for end-user discovery interface. But we will try to keep it as loosely coupled as possible, in case individual implementations using the toolkit would rather use something else (or nothing at all), or in the case the toolkit later chooses to support something else.

We’re going to try to make the staff/back-end UX actually not use Solr/Blacklight at all. Instead searching can be supported by postgres full-text search. This means you won’t get facets in the out of the box back-end UX (but can have limits in the UI). It also means if you have a very different front-end desired, you won’t need to run Solr at all for back-end functionality.

We won’t generally be automatically customizing the Blacklight UI, implementers will do that themselves using ordinary Blacklight techniques without mediation of toolkit-specific abstractions.

But there’s one big exception.  While the results list is of course based on solr results, data shown in views (both results and what you get when you click on a result) will not be based on solr — the app will take the IDs returned in the Solr response, and re-fetch them from the rdbms, even for results lists.

This may sound odd, but is actually how the popular generic Rails Solr support gem sunspot works, so there’s some precedent. I think it will allow the software architecture and developer’s mental model to be much simpler, with less duplication or parallel solr vs rdbms implementation — and the performance hit of the extra db query are minimal, especially in the context of legacy samvera performance. This approach lets you deal with n+1 issues purely in terms of rdbms and not duplicatedly on the solr side — with the same techniques whether you are on a results list page or an individual show page. It also lets you index into solr only what you need for query results, and not try to also put enough in solr for efficient display — using solr for what it’s best at, and simplifying and focusing your solr indexing decisions.

BL will be customized/overridden just enough to have the controller do this extra fetch, and use views based on our AR models, while providing customization hooks for local apps to customize views on per-object basis.

Indexing to Solr

Indexing will use hooks into ActiveRecord model life-cycles adapted from sunspot. (Sunspot itself is way over-featured/heavyweight compared to what we need, and is looking for new maintainers, so we won’t be using it directly. But it’s a mature Rails/Solr integration solution that has had a lot of hours put into it, so we will be looking to it for ideas and in some cases code to copy).

Indexing will be based on traject. Some additional architecture in traject 3.0 (I am the principal traject developer)  will make it easier to integrate here, but may still need a few pieces of architecture (like a “reader” based on ActiveRecord objects, and some transformation tools based on ruby objects as source records).  Basing on traject should make it straightforward to have really performant bulk (re-)indexing routines, as well as the ordinary model-lifecycle-event indexing triggers.  You’ll be able to do simple indexing configuration in the model “properties registration”, or more complex stuff in standalone indexer objects.

You will of course easily be able to turn off indexing entirely if you aren’t using blacklight/solr, or, still using our AR-lifecycle-hooks, replace the indexer code with something entirely custom, say, for a non-solr indexing back-end.  You’ll probably turn it off simply by setting the indexer class to nil.

Indexing should be configurable to happen asynchronously or synchronously. (Synchronous is required you need it to be immediately reflected on the editor’s next page view;  which is one reason we’re trying to keep solr out of our back-end staff interfaces, because async makes things so much easier to do performantly). Ideally it should also be set up in a ‘batched’ way so multiple solr doc changes that happen on one save can be sent to solr in one request, but we may not achieve that in initial release, although we’ll keep in mind ways we might use traject APIs to achieve it.

To the extent we use Solr dynamic fields, we’ll try to use the ones already defined in default Solr schema. It will also be trivial to simply specify your custom solr field names. As much as possible, we’ll avoid need for any custom solr schemas.

Preservation-related features

Our current sufia-based app has only basic what we could consider preservation-related features, so the baseline minimum for an MVP 1.0 of the toolkit is likewise basic.

Mainly “fixity audit“. It is easy and documented to have shrine store checksum signatures (or with S3 direct upload and/or client-side calculation of checksums!). For checksums that can be calculated in a streaming fashion, we can even use shrine’s streaming API to make validating signatures much more efficient. We will support storing and checking a handful of checksums. The toolkit will support logging these checks and some UI visualization of them, based on the work I did to fix the feature i activefedora, and some of our local features. And tasks for bulk fixity checking, which can possibly be done an order of magnitude faster in hyrax. Where it makes sense, some work can be off-loaded to S3.

As far as backup copiesour own current (Sufia/fedora 4-in-postgres) system is pretty much limited to bog-standard postgresql backups, and ordinary file system/S3 backups of digital bytestream assets. The initial toolkit release will not support anything more here.

Down the road

There are some features that will not be included in the initial MVP toolkit release (or our initial local app based on it), but for which we made architectural choices to try to facilitate down the line.

While fedora has some support for versioning, with some UI for it in hyrax, I think it’s got some oddities, and is relatively little used in the samvera community (We don’t really use it at all). So it’s not targetted as an initial requirement. At a later point, we could use the standard Rails paper_trail gem for actual metadata versioning (not sure if fedora-sufia/hyrax supported that or not).  By using standard ActiveRecord, we can use paper_trail, which has many many developer-hours put into it. It’s still not without some rough edges, which shows how challenging this feature is. Particularly around associations (which aren’t always handled great in fedora/hyrax versioning either, but the way the data is modelled in fedora/af/hyrax makes some things easier here, while other things harder).  One reason we chose schemaless attr_json-based modelling is to limit the number of associations/relationships involved. For actual bytestream/asset versioning, it seems likely what is built into S3 is sufficient.

It would be great to have import/export based on (at least theoretically) implementation-independent formats, which can be used as a preservation copy. Eventually I would like to see round-trippable (export, and then import in our same software to restore) BagIt serialization.  Which could be used for preservation backup. This is pretty straightforward using a ‘just write ruby’ approach, but has some challenging parts around associations among other things. (Note theoretically implementation-independent doesn’t mean any existing alternate implementation necessarily exists that can import them, which is practically true for most of our community’s current attempts here; but it still has preservation value to try to make it easier to write such an alternate implementation). 

Other things not included (at least initially)

Let’s emphasize again, there will be no built-in workflow support. The goal is instead to provide an architecture that doesn’t get in your way when you want to write your own workflow implementation.

There will also be no “notification”/”activity feed” support in initial toolkit.

OAI-PMH is something our local app needs. It may or may not be included in the initial toolkit though, vs being a purely local implementation. Theoretically blacklight_oai_provider could be used (by a local implementation), but it may make some assumptions about the way you use your BL app that the typical toolkit app is unlikely to meet. The lower-level ruby-oai gem is also a possibility.

Embargoes/leases are probably not in the initial implementation, simply because our own app does not use them. If the toolkit comes to support them, I think they should be based on expressed boundary dates automatically enforced at read-time, rather than requiring a process to check leases/embargoes and set other access control information accordingly.

Our local app uses a custom locally-written JS pan-and-zoom paging viewer, instead of the popular UniversalViewer. The initial toolkit may not have built in support for either, but should have clear “developer stories” about how you’d add them.

At one point I had considered trying to use cells as a basis for the front-end, as I really like it’s architecture and think it would provide for easier customizability of initially shared resources — but I ultimately decided it was too new/unproven/new-to-me, and kind of violated our “when in doubt stick to rails” approach, and would raise the complexity (and difficulty of success) of the toolkit too much.

Similarly, I considered using a fancy modern Javascript view system for at least some parts of the UI, but decided against it on similar grounds. There’s just too much to figure out about best practice patterns in this context, at least from my experience, it would raise the difficulty of success too much.

In general, I don’t have a good handle on how we’re going to use the modern JS we will need in a Rails environment. Not sure if using the new Rails webpacker is the way to go, it might be, but I don’t have a good handle on it. The initial release may have a less than optimal javascript architecture.  

Analysis and Evaluation

If we are right that there’s a value proposition in a smaller, Rails-aligned, shared codebase (so we can make it really solid), and if we successfully figure out the right design of such a developer’s toolkit and pull off it’s implementation… then the proposition is that we’ll have a platform/toolkit for developing digital collections/repository applications very efficiently, throughout the entire application lifetime from your initial product launch, including operational infrastructure, and through continued maintenance and enhancement to meet evolving needs.

And further, while there are other approaches in our community that are in progress trying to reach this same goal, as discussed in the last post,  the proposition is that we can get to a mature, polished, efficient-developer-cost toolkit quicker starting over along these lines, compared to those other approaches.

But that’s a lot of ifs, and this is a potentially expensive project proposal. It’s hard to estimate, and more work needs to be done, but at this point I’d estimate 12-18 months to us having a 1.0 release of toolkit and an in-production app based on it.

How can one evaluate the chances of success? This post and the previous one tries to provide enough information and an argument that it is plausible. But ultimately, there’s no way around having experienced developers/engineers make a judgement call — like with all technical decisions, although this is a particularly weighty one. I could say that I personally have some demonstrated experience making developer tools which have proven to be low TCO over a long time period (bento_search, traject), but this is a much more ambitious plan, and it’s success is not guaranteed.  It’s maybe a bit of a “moon shot”, but when you want to go to the moon, sometimes it’s worth a gamble.

I think there’s no way for you, dear reader, to evaluate whether this is a worthwhile thing to possibly participate in, except the judgement of experienced developers/engineers. If your organization is making choices of technical platforms without basing them on understanding of local business/user needs, followed by significant decision-making weight given to the technical judgements of experienced engineers…. I would say you aren’t maximizing your chances of success in engaging in software development, and (I would argue) you are probably already doing software developoment if you are doing samvera-based apps.

There isn’t much to say about the “upside” other than that — these whole two articles have been an investigation of potential upside — but we can say a lot more about the risks and downsides. They are somewhat different for my own institution, starting with a sufia 7 app, and considering initiating this plan — than they might be for another institution considering coming on later in another context. And there are short-, medium-, and long-term categories of risk. I will try to deliniate some of those risks and costs.

Our Institutional Calculus

We have a sufia 7.4 app (on Rails 5.0, with sufia not supporting more recent Rails), and have to do something.  That effects our cost/benefit/risk calculation.

We could try to upgrade/migrate to Hyrax 2.2.0. This would definitely take less time than writing this new toolkit, but I think could still easily take several months. At the end of it, we’re still on the fedora/hyrax architecture that we’ve found so difficult to work with, so while we have more supported dependencies, we haven’t necessarily reduced our TCO or agility to produce new features. We could be hoping on hyrax eventually being valkyrie — this is more likely if we contribute development effort to get there. How much is hard to predict, as is, in my opinion, how much we’re going to like where we get when we get there.  And architectural work on hyrax to resolve some of the other challenges we’ve had with it goes beyond just getting it on valkyrie.

We could try rewriting our app based on valkyrie, plum->figgy style. This actually isn’t too different than the proposal here, we’re still rewriting an app from scratch. Using valkyrie instead of just active_record (and the attr_json gem which is already fairly polished) — it’s not clear to me this will make the development any easier. On the one hand, possibly we’d be able to share more code than just valkyrie with other valkyrie-using institutions — but it’s not entirely clear at this point how much or how mature/polished we can make it. On the other hand, developing on fairly young valkyrie instead of mature well-understood ActiveRecord will, I think, create additional cost, and, in my judgement, additional ongoing cost making the architecture more expensive to work with. If we’re not actually all that excited about ending up on valkyrie (having no desire to be able to switch to fedora), it’s not clear what the benefit here would be.

One surprising rarely-mentioned possibility: We could put in work to get sufia working on Rails 5.2 or upcoming 6.0, and do an unexpected additional sufia release. It’s not entirely clear how much work this would take, but it may be the cheapest option (still a couple several months?) — at the end, though, while we’ve solved our problem of being on a maintained Rails version, we’re still stuck with relatively unsupported/unmaintained software, using a stack we’re not happy with, with such complexity that it will probably continue to “degrade”.  This is sort of the decision to do the minimal amount of immediate work possible, avoid making a decision, but just push it down the road further.

If this proposed development toolkit plan works, we’ll be in a great spot. But it requires some significant time when we are spending development time on things that do not help our current production app, to get to a point where increased efficiency lets us catch up and ‘pay for’ the time we spent. And it involves some non-trivial risk.

I think for my local institution, if we want to, and believe we have the resources/context to, take the risks to be innovators and “thought leaders” here, it’s worth taking the bet on trying to develop this new architecture. If we discover it’s not working, we try something else — you can only get the greatest benefit by taking some risk. But if we want to play it safe and don’t think we can afford (politically, budgetarily) taking a risk, it may not make sense.

Short-term Risks: Failure to launch

It’s possible we simply could fail to finish this project. It could become apparent at some point in the development process that we will not achieve our goals, or that it’s going to take much longer than we hoped (like foreseeing ending up barely towards it after a year of effort).

This could be exacerbated if institutional needs require us to reduce the amount of time we spend on building out the architecture, or minimize the amount of time we can spend investing in something whose returns may be a year or more away.

It could possibly be addressed by trying to reduce the scope of the re-usable toolkit to just focus on getting our app launched (which one could say is the approach Princeton and PSU are taking), but there’s still some risk that as we try, we find it’s just not going to work.

Medium-Term Risks: Failure to Catch On

It makes sense to try to build a re-usable toolkit, rather than just our app, both as a service to the community, and for enlightened-self-interest reasons, to get (eventually) more developers/institutions working on it, and building a community of mutual-support around it.

We could get to the “finish line” of a deployed app, but we could find that other institutions/developers do not actually find it easy to learn or work with. It may not be as flexible as we aspire to for more generalized use cases, may not be applicable to as many other institutions’ business requirements, limiting the potential community.  As we’re intentionally prioritizing high-quality modular tools over features, our tools need to be successful at lowering cost of building software with them, or the project is not a success.

ActiveRecord may end up not being a good foundation for our needs — in proposing to use Single-Table Inheritance, we’re already using a feature that is sometimes considered a bit off the beaten path of Rails. ActiveRecord could end up getting in our way more than anticipated, and thus increasing development costs.  We think we can make a toolkit which a fairly beginner developer can use, but we may fail and end up with one that is confusing to beginners.

We could succesfully deploy our app based on the toolkit, but if we end up being the only institution using the toolkit, the cost/benefit proposition changes somewhat.

Succession Planning

Related to size of community is succession planning — if all your developers leave, and you need to bring new developers on to an existing project, how hard will that be?

There has been an assumption that by “doing what everyone else is doing”, you have a community of people who could step in. However, I think experience has shown this hope was very significantly inflated.

Taking over a ‘legacy’ project is always tough in software dev, always takes a somewhat experienced developer, and it’s a real issue to pay attention to in any software development effort.

The number of experienced-with-samvera developers who can easily take on a legacy project from another institution is fairly small, and most are centered in a few well-paying institutions. Last time we posted a developer position, we got no applicants with previous samvera experience at all, and almost no applicants with previous ruby or Rails experience.  Having to bring up a new developer with ruby, Rails, and samvera is not a low barrier. Both Esmé Cowles (“there are many more Rails developers than Samvera developers“) and Steven Anderson formerly of BPL (Have you tried showing someone else all that is involved in doing a good Hydra Head? …If either Eben or I left, it would take half a year for someone to become even “competent” in our stack.”spoke to the challenges of getting a developer previously familiar with Rails up to speed on existing samvera.

Regardless of whether a community were to develop around this toolkit, I actually feel pretty confident that succession planning is no worse and probably better for this approach than for all the other approaches being investigated discussed in the last post. This is one risk I think is actually not very high.

But one thing you get with samvera is also a community of people supporting your learning. Can we remain part of the samvera community of mutual-support doing a somewhat different thing? I hope so, and I think we’re all about to find out one way or another, because the community is trying different approaches with or without this one. But it’s not guaranteed.

Long-Term Risks: Failure to Support

Our proposition is that by creating simpler/smaller surface area toolkit, we can design it well enough that it can remain very stable and backwards compatible API. While I have some success there with bento_search and traject, we could fail, in a way we only realize years down the road.

One of the ways we propose to keep our toolkit simple is by relying on existing software. If (eg) shrine were to stop being maintained by it’s existing developer (or have releases with significant backward incompats), we’d be in some trouble, at the best significantly increasing cost of maintenance of the toolkit. Same with some of our other dependencies. While attr_json tries to use public and likely to remain stable Rails api, if Rails changes sufficiently to make it hard to keep attr_json working, that could also significantly increase development effort.  All of these things might only be discovered down the line.

On the other hand, the efficiencies of the toolkit could be enough to allow more of our institutional or community development time to be contributed back to general Rails dependencies, to help them stay maintained. The initial plan for instance involves contributing back a shrine plugin.

These long-term risks are to some extent common to any development project, and other approaches probably have similar levels of this kind of risk.

Snatching success from the jaws of failure?

Even if the project does not reach the level of success desired, in any of the ways outlined above, it might still provide ideas, patterns, approaches, that could influence other approaches in samvera community. In the best case, it could produce re-usable modular components that themselves could even be re-used in other approaches (valkyrie-based and/or hyrax? I’m not certain how plausible that is).

This can apply to “sharing with ourselves” too, if we decide to change approach in the middle or at any point, we may be able to re-use some of what we’ve already done anyway. (I think the shrine-based approach is a particular contender here; shrine itself doesn’t even require or depend on Rails).

Because we aim to produce reusable modular composable components with loose coupling, based on the commonality of Rails, it may increase the likelyhood of some code-sharing. On the other hand, if other approaches aren’t using ActiveRecord, both they and us may find ourselves more coupled to our persistence layer API than we’d like — it can be hard to avoid persistence-layer-approach coupling.

Interested? What’s next?

While the final portion of this post was investigating risks and possible disaster scenarios, I actually do feel positive about this approach. While there is no guarantee of success, I think it has the best chance of getting us to a place of minimized engineering costs (compared to alternatives) within a 1-2 year timeframe.

But we have not yet committed to this approach here at my institution. We will be aiming to decide our directions in the next month or so.

Interest from other institutions could effect our decision-making, and of course collaboration would be welcome if we do embark on this plan and we’re interested in identifying potential collaborators as soon as they reveal themselves.

Initially, collaboration probably woudln’t take the form of actually committing code. As Fred Brooks argues in The Mythical Man-Month adding more developers to a project doesn’t always result in shorter time-lines, and this is more true the more architectural design work is involved. And we’ve got a bunch right now. Initially collaboration would probably take the form of expressing interest, reviewing documentation or progress, maybe code review, and most importantly trying out the code as it is produced. Feedback on if it does look like something that would help you with your use cases, and that you’re still interested in using. But certainly eventually code collaboration opportunities as well, especially filling out certain use cases that you have but we may not.

So let us know if you find this plan exciting?

Additionally, if anyone has any ideas about grant opportunities, I guess it goes without saying that that could be useful.  Theoretically grant funding would be especially useful for relatively high-risk but high-potential-reward projects, they are the ones most likely to not be done without external support. The low-risk stuff, you’re going to do anyway!  I’m not sure granters in our sectors think that way, but be sure to let me know if you know of some who might.

I also really welcome anyone challenging or pushing back on anything in here, please feel free, in comments here, slack, email, whatever. From discussion and debate we spiral to a higher understanding.

On the present and future of samvera technical architectures

Here where I work, we have a digital collections app (live; source) based on sufia 7.4. This is not sustainable for the long-term, as the community’s development efforts have largely moved from sufia to its replacement hyrax, and the latest version of Rails sufia runs on is 5.0, which will eventually be end-of-lifed. (exact schedule unknown).

Upgrading/migrating to hyrax would be the ‘obvious’ path, but it would take some significant work; we aren’t super happy with the sufia/hyrax architecture; and this turns out to be a time of some transition in the samvera community.

In figuring out what’s going on and identifying and evaluating available options, I’ve had to do quite a bit of research.  So I wanted to share my evaluation and analysis with the community, to hopefully help others understand the lay of the land — but also to explain why we are considering some new approaches. As I’ve been doing this, I have begun to develop opinions on how to move forward here, and I’m leaning towards a novel approach not based on existing community consensuses — I’ve done my best to present information objectively and identify the parts that are my own judgements/evaluations, but I’ll be up front about my present bias/leanings so you can judge for yourself or be cautious.

Also, while there has been recent exciting work on changing and improving governance processes and structures in Samvera, this post will focus only on the software products and technical architectures in the samvera community, “the stack”.

The Challenging Near Past/Present

I think it’s important to be clear about some of the challenges of the current software stack, to understand why people are going in some different directions, and how to evaluate those directions.

The current situation is that many, probably not all, but more than a few, people and teams working with sufia/hyrax and the samvera stack have found it very challenging in a variety of ways.  Here are some I know about, many from personal experience, that you may have seen me address in past blog posts too.

Performance can be a significant problem, at several different parts of the stack. Some I have encountered:

⇒ Saving a Work can take a 10 or more seconds in our app. Perhaps only an inconvenience for a single work, but can add up to be a real problem in higher-order functions that save multiple works, bulk ingests, or test suites. (also increases the cost of logic that saves multiple times where one time could conceivably have worked, as I have encountered in the stack).

⇒ So far in our attempts to make a feature to let you change a fileset into a child work (delete fileset, create work at same order in members list, with come copied attributes over), the operation can take five minutes to complete. We are in the midst of quite a bit of developer work to try to figure out what’s going on and if we can improve it. This feature is taking several weeks to develop because of underlying stack complexity.

⇒ Our app with stock sufia code had several “n+1 query” problems, where on display a separate Solr query was being done for each item displayed (on results page, or child items on a work detail page), making response time unacceptably slow. When using ActiveRecord this has well-understood and easy fixes, but with this stack it took some somewhat complex local hacking to fix.

⇒ Re-indexing (to solr) our corpus consisting of ~6400 GenericWorks and ~18500 FileSets can take from 3 hours to 9+ hours, depending on nature of indexing, and even after some extensive optimization work. Comparing the 1.25/second best case to industry standards, it doesn’t look good.  For instance, indexing MARC to Solr using traject, people routinely get from a couple hundred to 1000+ records/s.

Trying to customize or add features to a sufia/hyrax app can be quite complicated, some find they are spending as much or more time trying to figure out how to get it to integrate with shared stack code (without creating large forwards-compat problems on upgrades) as they spend on the actual ‘business logic’.

⇒ This isn’t really about adding for more features to be built-in/configurable to Sufia/Hyrax. No matter how much is, our use cases vary enough that people will always want to be changing things in local ways or adding custom local features, and sufia/hyrax and the rest of the stack has always meant to support this.

Some organizations have tried but had problems attracting or retaining Rails developers (with Rails experience but without library/samvera experience).  These developers can find the samvera stack unnecessarily complex considering the problems it solves.

The cost of keeping your app up to date with new versions of stack dependencies can be great enough that many institutions wind up staying on old versions of shared dependencies.  My attempts at analyzing this appear to show a pretty big spread among sufia/hyrax and other dependency versions in repos “in the wild”.  (Here where I am, we are on sufia 7.4 — after valiantly trying to stay up to date, we decided we had to stick there to meet our launch deadlines).

ActiveFedora was intended to be a kind of port of ActiveRecord, with close to api-compatible modelling/persistence layer (not including querying).  But ActiveRecord is an incredibly complicated stack with literally years of developer time put into it, and is constantly evolving itself. What we’ve ended up with in AF has been found by many to be unreliable, with unpredictable performance characteristics and general behavior, very difficult to figure out how to use ‘correctly’, with very complex architecture hard to debug.

Parts of the stack, especially in sufia/hyrax, often seem not as mature as expected; there are bugs in features one thought were long-standing in the app; there isn’t necessarily clear and accurate shared understanding about what things are present in the code already, and what things need more work, or are likely to have lots of edge case bugs. This may be because of the several times there have been major refactorings to the sufia/hyrax codebase (fedora 3 to 4; an institutional repo focused app to more general; sufia to hyrax; etc). (It should be noted that the documentation working group is, working on building out better recorded shared understanding of features).

When thinking about this, I often go back to Richard Schneeman’s post on “polish” in software:

I’ve previously called these types of moments papercuts. They’re not life threatening and may not even be mission critical but they are much more painful than they should be. Often these issues force you to stop what you’re doing and either investigate the root cause of the rogue behavior or at bare minimum abandon your thought process and try something new.

When we say something is “polished” it means that it is free from sharp edges, even the small ones. I view polished software to be ones that are mostly free from frustration. They do what you expect them to and are consistent.

My experience  building an app to meet local needs using the samvera stack has often been at the other end of this continuum — near constant “papercuts”, sharp edges, frustrations, and “yak-shaving” investigations of the root causes of some unexpected behavior. My experience is that the software often does not do what I expect, or behave consistently.

I think sometimes when I discuss these issues, non-engineers think I’m just talking about programmers’ personal experience/emotions, that the code isn’t “fun” to work with. Now, I do think the affective result on your programmers’ day-to-day matters, how your programmers feel — burn-out is a real thing — but certainly no more than the pleasantness and efficacy of day-to-day work for all other non-programmer staff too; and we don’t expect it all to be “fun”, that’s why it’s a job.

But the reason this matters to your organization isn’t primarily because of how it makes programmers feel. It’s because all of the foregoing significantly increases the cost of launching and maintaining your software. Organizations find it takes much longer, or many more engineers, than expected to get to first launch. Adding what even the engineers might have expected would be a fairly simple feature can take order(s) of magnitude more time than expected. Bugs can appear which are enormously time-consuming to track down and fix, if they can feasibly be fixed at all by your engineers. In addition to cost/schedule, this can also simply affect your chances and levels of successfully meeting your business needs, both in initial launch and ongoing maintenance and development.

And when making technical choices, that’s what matters to an organization above all else — meeting business needs as efficiently and cost-effectively as possible (including staff-time and number of staff; staff is the biggest costs for most of us).  And to many, it wasn’t clear that current directions were getting them there.  Building and maintaining a samvera-stack based app that met local business needs well has seemed to some very expensive.

These are not observations unique to me, there has been a growing recognition of these challenges in the samvera development community. It has led to new samvera processes working to improve the situation gradually and steadily (for instance, the “Component Maintenance Working Group”, the Hyrax maintenance working group and the “Road Map Interest Group”); but has also led others to think it’s time to explore new architectural approaches and more drastic software changes.

Valkyrie: A new approach

Princeton University Libraries had an app called plum supporting their digital collections. It was:

  • A hydra app based on curation_concerns and some fairly old hydra dependency versions (not sufia/hyrax).
  • Staff-only editing/workflow. No self-deposit.
  • Used for metadata/asset management (with fedora 4 back-end), had no public interface of it’s own — (meta)data was vended to other public-facing app(s).

As outlined in two blog posts on a PUL Systems blog, they ran into some pretty severe performance problems. They spent significant development effort  improving performance, both locally and in PR’s back to hyrax and the stack.

In a presentation at Samvera Virtual Connect 2018, Esmé Cowles (presentation begins at 40:00 in video) said Princeton’s eventual direction (valkyrie) was motivated “not just becuase of performance problems, but because while we were working on those problems, we were constantly butting up against the complexity of the stack… That complexity was impeding us doing what we wanted to do to work on performance.”

While frustration with performance or legibility of the inherited architecture was not new to either Princeton or others, Princeton reached a point where they decided they had basically no choice but to take a departure from the inherited architecture, if they wanted to achieve their business goals, that the “inherited” stack was simply not tenable for their needs. Additionally, as the performance problems were centered on Fedora (as well as the ActiveFedora architecture), they decided the best path  was to move away from Fedora as the persistent store, and towards using the postgres rdbms.

We could imagine responding to that by writing either a bespoke local app or a shared toolkit/framework simply based on postgres. But Princeton really prioritized not separating from the samvera community, and based on that, decided instead to build a persistence abstraction that would allow the developer to switch between multiple back-ends (mainly targeting fedora or postgres, both likely in concert with solr), using the same class/method-level APIs for both.

That is what valkyrie is. It is just a modeling/persistence layer.  As far as what it does, valkyrie could be roughly compared to ActiveFedora or ActiveRecord.  It is not a “solution bundle”. It pretty much only addresses API(s) for modelling metadata and saving those models, whether to fedora, to postgres, or to other hypothetical future back-ends.  The rest of the business logic in a digital collections or institutional repository application would come from somewhere other than valkyrie, whether shared gems or local code.

Princeton proposed an official hydra/samvera working group to work on valkyrie, and got significant interest from other developers active in samvera community. valkyrie became a samvera community project, and as I write this is housed in the samvera-labs grouping.

Valkyrie uses a “Repository/Data Mapper” architecture that is different in some ways from Rails’ ActiveRecord design, and seems to be inspired by Hanami’s repository/data mapper implementation.  Valkyrie also uses some of the dry-rb libraries that are also used by hanami.   Valkyrie also requires the use of the reform form object library, generally in the form of the ChangeSet reform sub-class specialization.

In building out the main modelling and persistence abstraction to meet planned use cases, other particular-to-valkyrie abstractions were required, like ChangeSets (I don’t entirely understand them, but I think someone building an app based on valkyrie is going to have to) , and others that may normally stay “below the hood” like OptimisticLockToken.

Valkryie is not fundamentally based on linked data/RDF, its models are not defined based on linked data. The valkyrie fedora metadata adapter requires a mapping from model attributes to RDF predicates so it can be serialized to fedora; other external RDF serializations would require similar.

valkyrie “bespoke” apps

Princeton is live with figgy, their plum-replacement app based on valkyrie. figgy kind of served as a ‘demonstration/proof-of-concept app’ throughout valkyrie development, and still serves that role to some extent, as I believe the only valkyrie-based app in production, and the one by the same group of developers most central to valkyrie development.

Figgy is a rewrite of plum to meet same basic usage parameters. It is not technically a git fork/branch of plum, but some business logic was ported from plum.

Figgy does not use a samvera “solution bundle” (such as hyrax). It uses only a a few existing samvera-community dependencies as component building blocks where it makes sense (mainly hydra-editor and hydra-derivatives, see their Gemfile.lock). Existing pre-valkyrie components that can be used with a valkyrie-based app will generally be de-coupled enough that they can also be easily swapped out if the need ever arises. (Personally, my experience with hydra-derivatives for my own local needs would not lead me to follow their lead in using hydra-derivatives! But perhaps porting hydra-derivatives using code from plum to figgy made sense as a starting point).

Figgy then has a lot of local/bespoke architecture custom-fitted for it’s business needs, sometimes based on existing general Rails dependencies. One major example is custom local workflow logic/architecture based on the popular aasm (“acts as state machine”) gem.  It also vends changes to the other apps serving as front-ends using an RabbitMQ based eventing system, also more-or-less locally designed.

The other known valkyrie app in development is Penn State Library’s cho. I know less about cho, but my understanding is that it is not yet in production, and takes some very original/innovative architectural approaches — it is largely based on ingesting and editing via CSVs (rather than interactive web-based GUIs), including being able to dynamically define metadata schemas based on CSV.  Cho seems to use few existing samvera components on top of valkyrie, perhaps even fewer than figgy; mainly hydra-characterization.

Where is valkryie at now

Valkyrie has been under development for around 2 years, and has probably hundreds of developer-hours of work. As I write this a 1.2.0 version has an imminent release.  While valkyrie is already being used in production by princeton’s figgy, features that some might expect, need, or want for generalized use are still being developed on an ongoing basis. The 1.2.0 release (as I write this still in pre-release) adds some significant features, including: The ability to store single-values (rather than arrays of values) in properties; Optimistic locking; and Guaranteed persistently-ordered values (the first value in a list stays the first value in the list).

To some extent, as is common for open source, features are added to valkyrie when apps using valkyrie need them and the developers of those apps spend the time to add them to valkyrie.  While the valkyrie team is thinking to the future and trying to generalize for others, right now it’s primarily the needs of figgy and cho driving prioritization.  For instance, an Issue to suggest providing a generalized solution in valkyrie to “n+1 query” problems (a problem pretty central to my experience and concerns, as discussed above, but maybe not initially figgy or cho) was recently created, after it came up in figgy development.

If you need something that is conceptually part of modelling/persistence layer but isn’t really built into valkyrie, you often still have an option to add it, which often involves going “under the hood” and adding custom logic to the valkyrie adapters or custom queries.  So you may have to reckon with architectural components/complexity  ‘under the hood’ to meet such needs; and likely also means that you’d have to re-implement if you switched storage layers (from fedora to postgres or vice versa).

For instance, at present if you wanted values to be indexed to solr as a numeric type instead of string/text (so it could be sorted or range-facetted in solr), Trey Pendragon told me “you’d need to add a custom indexer to the solr adapter.” One should probably be cautious of assuming what features or use-case-supports are or aren’t already built out in valkyrie (like any other relatively complex dependency still reaching for maturity).

You can watch what things are being considered and prioritized for future valkyrie development in the open valkyrie waffle board.

Milestones in valkyrie and figgy history

Some personal analysis and evaluation — Valkyrie

Princeton and others investing in Valkyrie began from the requirement of being able to provide a stable consistent API on top of underlying data that could be stored either in Fedora or Postgres.

If you start from that requirement, the Valkyrie architecture is exactly where you are reasonably going to end up, this is an appropriate way of approaching that requirement (which typical Rails apps are not capable of fulfilling).

However (in my own opinion/experience/evaluation, like everything in this section), there is a significant cost to building the abstractions to make that possible. Every abstraction has a cost: in implementation, ongoing maintenance, and cognitive burden and ongoing work of developers using the abstraction and integrating it with others.  Building successful (efficient, polished, good TCO) abstractions as shared code between apps with diverse needs can be especially challenging.

Valkyrie is a fairly significant abstraction.  Its development necessarily involves significant work to figure out the right APIs and create implementations for features that, if you were simply assuming an rdbms (or even postgres specifically) and using ActiveRecord might just already be there. In addition to the basic mechanics of persistence, also: ordered values; optimistic locking; associations, joins and eager-loading to handle n+1 queries.  Or Rails recommended “Russian-Doll Caching” with automatic touching of parents.  In ActiveRecord, not just already there, but very mature with well-understood community knowledge about strengths, weaknesses, work-arounds, and best-practice usage patterns. Whereas all of these things, if they are to be used, need to be designed and implemented (and iterated to maturity and polish) in valkyrie — and with the increased challenge of making them work well for each of the persistence back-ends valkyrie intends to support.

Whether these costs are worth it depends on how important or beneficial the foundational requirement is, as well as how well the abstractions meet developer use cases. In part, this is hard to be absolutely sure about in advance — both the ultimate benefits and the ultimate costs can to some extent only be estimated/predicted and not known with certainty in advance of the actual development and community use.

Will valkyrie actually result in shared codebases/dependencies between postgres-using and fedora-using applications in our community?  At this point, there are not many examples already existing, it’s more a design goal for the future. I think it’s hard to know to what extent this prediction of the future will pan out.

How one evaluates the value proposition of the valkyrie effort also depends on the value one places on the base goal/requirement of supporting both fedora and postgres as persistence back-ends. It may be worth asking in what circumstances does fedora actually make sense, and how widespread are these circumstances?  I believe few (none?) of the current developers/institutions investing in Valkyrie are actually planning on using fedora, or missing it.   The requirement to support the possibility of back-end agnosticism may be less about the technical needs of anyone investing in valkyrie, and more about the social/political situation in our community, which has always used fedora before, and where simply moving to a non-fedora solution seemed originally too big a jump to be comprehensible as staying within the community.

⇒ (While there was some initial hope that the performance problems could be significantly improved while still using fedora by using valkyrie with, say, a non-active-fedora-based means of accessing fedora — so far only relatively minor improvements have been seen via this route, not enough to resolve the performance issues that led to valkyrie. It’s possible future implementations of the fedora APIs, whether from the fcrepo implementation or other, will do differently; predicting the future is always a gamble).

The valkyrie enthusiasts have been wisely careful not to express any judgement over the commitments of other institutions to fedora (we each have different business needs) — however, many of us beyond valkyrie have been increasingly questioning what value fedora brings us at what costs for some time, and I think it’s worth considering in exactly what conditions using fedora actually makes sense, and how common these conditions are.

If the eventual result is that most/all codebases using Valkyrie are using postgres rather than fedora — and I think that’s a real possibility — that is a significant cost we paid in development to make other things possible, and a significant ongoing cost we’ll continue to bear in maintaining and developing against the abstractions that were designed for that. (Or in a subsequent costly switch to not using them).

Esmé suggests that another benefit of valkyrie can be in hiring/retaining/onboarding developers, especially Rails developers from outside our development community, and that “following the patterns those developers know makes it easier to hire Rails developers and have them be productive and happy, (instead of being frustrated by ActiveFedora and Fedora more broadly).”

I share that concern and goal, but it is not totally clear to me how much valkyrie achieves  there — choosing to write to the Valkyrie API instead of ActiveRecord arguably takes us substantially outside of patterns that Rails developers know. While it seems safe to believe it will result in some level of improvement over previous stack,  when I look at figgy code I am not that confident in predicting to what extent a figgy-style app will be legible to the typical Rails developer, or escape what we could call the “weird custom community architecture” problem.

For myself, it’s not clear that the costs of developing (and developing against) the valkyrie abstraction will bear benefits justifying it. Might there be other ways to meet our individual as well as shared business/domain needs more efficiently in total-cost-of-development-and-ownership?  Might there be other ways for different teams to try different things while staying part of a community of practice?  Is the valkyrie approach actually necessary or sufficient for allowing people using different back-ends (or other architectures) to share other domain logic?

It is hard to answer these questions with surety, they rely on estimations and predictions of future events and comparing hypothetical counter-factuals. But based on an experience of challenges from complex and newer/less-mature architectures, I’m interested in trying to be ruthless about minimizing the number and complexity of abstractions/architectures, trying to find the simplest architecture possible to optimize our efficiency/productivity. “as simple as possible, but no simpler.” A significant abstraction layer to make possible both fedora and postgres does not excite me, when that’s not a requirement I think important for our local business needs.

However, one thing that is worth commenting is that I am actually totally happy with the valkyrites demonstrating the viability and sense of writing a “bespoke” app (which can still be based, where possible, on shared components), instead of trying to use a pre-built application/”solution bundle” that you customize according to it’s customization points.  Providing the latter in a high-quality way, mature, efficiency-increasing way is hard — especially when the developer community has diverse needs — and I personally suspect that a much wider swath of business cases will be better-served by the ‘component’ approach than has often been recognized in our community.

I suspect that using the hydra/samvera stack has almost always required local programming expertise, it has never (for most institutions) provided a “shrinkwrap” install-and-go experience. I appreciate the “bespoke” valkyrie apps openly trying to demonstrate that at least in some cases an explicit and acknowledged component-based put-it-together-yourself approach may be more productive (as Esmé in particular has pointed out).

The two current real-world valkyrie demonstration apps actually differ from what I see as the recent  “consensus path” in two significant ways:  valkyrie persistence layer and in explicitly embracing an “assemble-components” approach instead of a “customize-pre-built-solution” approach.

A Hyrax based on Valkyrie?

Okay, so we talked about valkyrie, and “bespoke” apps using valkyrie — what about the idea of refactoring/enhancing Hyrax to be based on valkyrie?

It is my impression that those initiating valkyrie, via a samvera working group, hoped this would the ultimate outcome, believing it was important for keeping those with “bespoke” valkyrie-based apps connected to and participating in the wider community — as well as a contribution the valkyrie effort could make to institutions wanting to stay on hyrax but with persistence layer flexibility.

As the valkyrie working group work wrapped up, even before the “final report” was released actually, there seemed to be general community consensus on this, and I believe a community decision was made to commit to this, although I’m not certain.

Work to switch hyrax over to valkyrie was begun, and significant development-hours were put into it. At one point it was believed there would be a hyrax version 3.0 based on valkyrie released around May 2018.

However, that phase of effort didn’t reach the goal-line (a release of valkyrie based on hyrax, or even a merge into master) before work mostly halted. I believe the valkyrie branch in the hyrax repo has the product of that work — last commit there is from March 6, 2018. I think it’s very hard to estimate how much work was remaining on that branch to get to a release (most of us have experienced the phenomenon where the “last 5%” can become more like half of total development effort).   Some of the developers who were primarily involved in that work seem, at least for the moment, no longer spending as much development time on hyrax generally; and as other hyrax development has continued, that branch would need to be reconciled with current master.

Since that work, Tom Johnson (@no-reply) has taken over as formal “technical lead” of hyrax, meaning technical architect in this case.

I asked on slack about the current thinking on future Hyrax and valkryie. Tom provided some info on his current plans and thinking in messages in the #hyrax channel of the samvera slack, dated August 13 2018 12:22PM and 12:34PM (eastern). (Alas, we may have no slack archives).

– Moving away from `ActiveFedora` and toward a backend-agnostic persistence technology is viewed as on the critical path for Hyrax’s success

– The community’s capacity to maintain `ActiveFedora` is quickly diminishing, in part because the software is challenging to maintain and in part because the development personnel best equipped to maintain it have shifted to other projects (including but not limited to Valkyrie)

– Valkyrie is the presumptive replacement; this is the case largely because of key community members succeeding at delivering (and generally being happy developing) applications based on it.

– We are committed to making this transition without it looking like a stop-the-world-and-rewrite-the application affair for existing adopters.

That is (this interpretation/re-wording also based on further discussion in slack channel and PMs), some kind of work to make Hyrax have a backend-agnostic persistence layer is in the plans, and it is presumed this will involve Valkyrie.

But it will likely not involve simply refactoring Hyrax to use valkyrie instead of ActiveFedora, which was that original valkryie branch approach. Tom is committed to making future Hyrax releases less disruptive existing adopters, and that original approach would be the kind of “stop the world” rewrite involving significant backwards-incompatibilities that has been disruptive in the past.  It probably will involve re-using/porting/copy-pasting code (as well as ideas) in that existing  valkyrie branch, but probably will not be based on that branch in the repo.

Instead, there will probably (these are Tom’s current thoughts not official plans) be a first step to create an architecture within Hyrax that “that is open to Valkyrie, but ships using active fedora by default”.  Then a period of “getting an advanced guard trying to build apps based on this [which] can and should provide a lot of useful information about how platform support needs to work.”  Then later, “a transition to valkyrie-by-default and removing AF would then be based on what we learn and demand[s] from adopters.”

Tom plans to share some of these road-map-recommendations on this at Samvera Connect in October, at which point some of this will presumably start becoming somewhat more formalized and official as plans.

I think it’s very hard to predict calendar timelines for all this. If you were waiting for the end-point, a hyrax version that just uses valkyrie (and allows postgres as a backend thusly) out-of-the-box, supported/documented/tested… I personally would predict it could be 1-2 years, but others may have much more optimistic estimates; one answer is just that it’s very difficult to predict at this point, depending on so much including what developers/institutions are interested in contributing to what extent.  We can be sure it won’t be May 2018.  :)

Note well the current Valkyrie fedora adapter does not store things in fedora in a way compatible with current hyrax/sufia modelling. Either a new adapter (with some challenges) needs to be created, or there would have to be data migration.

Some personal analysis and evaluation

I totally understand and support @no-reply’s priority to keep Hyrax stable, with changes being iterative and backwards-compatible, no “stop the world” changes — this is one of the biggest challenges facing Hyrax users I think, and it makes sense to prioritize it.

And I understand how that leads to his current thinking on how to approach valkyrie — by providing the architecture where valkyrie can be optionally switched in as a simultaneous alternative to what’s already there, which for at least a time remains there.

But this leads to a kind of ironic/counter-intuitive outcome.  Valkryie is already an abstraction layer for swappable persistence back-ends.  For reasons that are indeed sensible in overall hyrax context, we’ve arrived at a proposal to add more architecture (more abstraction) on top, to valkryie itself swappable in or out (at the point you swap it in, you can then use it to swap actual back-ends). An persistence abstraction API to let us use another persistence abstraction API beneath it.

Abstraction has costs, including to legibility of the codebase.  One might wonder if you’re going to put in this extra hyrax-specific persistence-swappability architecture anyway, does it even make sense to swap to valkyrie as the happy path supported option, or should you swap directly to postgres instead and skip valkyrie?  But there might be various reasons it really does make sense — but, it’s got a cost.

So in evaluating hyrax-on-valkyrie, I think we start out with all the pros and cons outlined in the valkyrie analysis section above.

On top of that we have pro’s and con’s of hyrax itself. How you’ll evaluate those depends on your experience with or otherwise evaluation of hyrax generally. There will be significant advantages for people who have found that hyrax has features they need, and using them via hyrax (including any customization) has worked out well and seemed like an efficient path compared to alternatives — and who want to switch to a postgres-based persistence store.

I have not had a great experience with sufia. I’ve found it very challenging to deal with the existing architecture when implementing the customizations and features we need. When I’ve tried to look at what has changed in hyrax I don’t expect significant improvements for my business cases. On the other hand, there has been code added  which increase architectural complexity for me without giving me features relevant to my needs (adminsets, nested collections).   Of course hyrax will continue to improve — especially under Tom’s excellent technical leadership, which I have a lot of faith in.  But the community’s historic value on new features over architectural rehabilitation comes from structural pressures that will have be resisted or changed. And even within the realm of architecture rehab, an investment in hyrax-on-valkyrie — while it might be a totally reasonable and appropriate priority — is development hours not spent on improving the other parts of hyrax architecture that have gotten in my way and lowered our efficiency (raised TCO) of running sufia, and which may have to temporarily increase architectural complexity/number of abstractions.

I am concerned that hyrax may have painted itself into a corner where it could be quite a while until the problems with fundamental architectural aspects of hyrax that I have run into become better; a while until the app’s architecture becomes more legible with the minimal amount of abstraction/architecture needed for its goals, instead of more complex with more layers of abstraction as a bridge to get there.  Doing this in a way that minimizes upgrade pain will make it take even longer/more effort, but not doing that is not desirable/feasible either, I believe Tom is making the right decision to prioritize minimizing upgrade/backwards-incompat pain in hyrax.

But my experiences with sufia have not been positive enough to excite me about trying to upgrade my app to present hyrax, or about a hyrax based on valkyrie or postgres but otherwise largely similar backwards/compat with current hyrax release. If you take out the persistence parts that are proposed to change, and the business logic components where I have had a lot of trouble using them to meet my local needs — I’m not sure how much of hyrax is left. From my experience, I am not enthused about investing lots more in hyrax (whether that’s contributing to the shared codebase, or work on upgrading-or-rewriting our app from sufia 7.4 to a recent hyrax version and continuing to maintain it). I’d be more excited about trying to find a more efficient way to invest development time that could ultimately, get us to a happy place quicker — both in terms of our local app, and shareable components.

What if there’s another way? (my “fantasy plan”)

Let’s say valkyrie (and apps and architecture built from it) starts from the basic non-negotiable requirement: Allow code using fedora or postgres as a persistence back-end to use the same persistence APIs; and then adds on some subsidiary goals, including sticking closer to common Rails patterns where possible.

What if instead we started from the basic requirement: Stick as close to standard Rails patterns as possible, with as few and as simple additional abstractions as we can; as simple as we can while still not requiring re-invention of the wheel in digital collections use cases?

How would we do this, what would it look like? Like valkyrie, we’ll start from modelling/persistence.

We could consider really just putting all our metadata in a standard normalized database schema. But that’s going to result in some really complex and challenging to work with rdbms schemas, for the kinds of metadata schemas we use; for instance, with frequent repeatable fields, and apps that need to handle multiple “types” of objects in the same app.

Let’s rule that out.  What’s a next step up in complexity, still straying as little from standard Rails as possible, with as few and as simple new abstractions as possible? Is there a way where we still use ActiveRecord, but we aren’t required to create normalized rdbms schemas for our complex/various/evolving metadata schemas?

Recently some rdbms have developed features to allow “schemaless” use by storing json in a column. Really, you could always have used rdbms this way by serializing complex structured data to text columns, but the additional features, especially in postgres, make this even more attractive.  (Although keep in mind that our legacy fedora-based architecture basically stores complex data as a blob without indexing or querying capabilities either; although this is part of what makes it challenging to work with).

While ActiveRecord has basic support for storing arbitrary json-able hashes in MySQL or postgres json columns, the individual data elements aren’t really “first-class” objects and are missing many standard AR modelling/persistence features.

I developed the attr_json gem as an experiment to see if I could add more seamless support for many of the standard AR model features, for individual attributes serialized to json(b), sticking as close to how AR normally works as possible. attr_json supports typing, complex/nested objects, standard Rails-style form support, dirty-tracking, and some limited postgres-jsonb query support. This allows you to use many standard Rails patterns and approaches with individual attributes serialized to json in the rdbms.

attr_json has received some limited attention from the wider rails community. A handful or rails developers have communicated with me in github issues, one or two are already using it in production, and it has 32 ‘stars‘ and 5 watchers on github, almost all apparently from developers not from the LAM/samvera community.  While I’d like even more attention and collaboration, this is still encouraging, and all reviews so far have been very positive.

What if we were to try to build out a developer’s toolkit for creating digital collections/repository applications, beginning from attr_json (in ActiveRecord and probably requiring postgres) as the central modelling/persistence layer, but not stopping there, trying to launch with an expanded toolkit addressing other app and business needs too?

This is what I’ve been calling my “fantasy plan”.  I think it could provide a way to build digital collections/repo apps with a better developer experience and overall lower TCO  (both in building out the toolkit and building apps based on it) then other options. Of course, success isn’t guaranteed and it also has risks. This is not a plan I or my institution are currently committed to at this point, but we are considering it.

In addition to modelling/persistence, the next most core area of functionality in our domain, I’d suggest, is handling bytestreams/digital assets. Both originals  and derivatives. My fantasy plan developer’s toolkit would be based on shrine for this area — possibly with custom shrine plugins.  shrine’s goal itself is as a toolkit for file/attachment handling, giving you components and primitives you can assemble into exactly what you need, leads me to judge it well-suited for use when flexibility around how to handle bytestreams/assets (including storage platforms) is so core to our domain requirements.

I have more ideas about building out this “developer’s toolkit”, and analysis of the potential benefits and risks of this approach, but I’ll save them for a follow-up post focusing on this possible plan specifically. 

But is this Samvera? The spreading out of the community

I think we are at a moment where, like it or not, different institutions are trying different things.

Even just within the new “based on valkyrie” approach (which people are valiantly trying to make a community consensus), we have both “bespoke” apps and the potential future possibility of “solution bundles”.

There is experimentation and divergent planning going on apart from this too.

Christina Harlow of Stanford recently presented at ELAG in Prague on Stanford’s current planning to re-architect their digital collection/repository system(s) in a project called TACO. (slides; video; See 8:35 in video for some brief evaluation of hyrax for their needs).  If I understand the current plans (and I may not!) they are planning an architecture which is substantially written in Go (not rails or even ruby); which does not involve Fedora; which is not based on RDF/linked data at the basic persistence level; and I think which is not planned to involve samvera shared code, although it may involve Blacklight.   Stanford clearly has a much more complex environment  than many of us, requiring a fairly complex architecture to keep it more sane than it had become — although they were running into some of the same problems of architectural legibility and performance discussed above, just at a larger scale (“scale” more in terms of diverse and complex business requirements and collections than necessarily scale of documents or users/use). [update September 5 2018, more info/documentation on Stanford’s approach is being made available here.]

In 2016 at Hydra Connect, Steven Anderson, then of Boston Public Library, gave an 8-minute lightning talk presentation called “I love you fedora, but it’s over”, about their plans to move to a non-fedora non-samvera stick-close-to-Rails kind of architecture. (slides; video).  He mentioned experiencing some of the same problems with architectural legibility and performance with the existing stack that we’ve discussed previously, and arrived at a plan similar in some ways to my “fantasy plan” above. So there have been rumblings on this for a while now — I hadn’t seen this presentation until now, but feel a real affinity with it.  Steven left BPL shortly after this talk though, and Eben English (who is still at BPL) tells me the plans basically stalled out at that point. BPL is currently still using their previously existing app based on active-fedora 8.0 and fedora 3.8. (no sufia), and is awaiting some additional hiring before determining future plans.

In one sense, the samvera community has for years been less homogenous than our aspirations or self-images. Actual samvera-based apps in production have become very spread out amongst various versions of various samvera gems seen as consensus path at various times in samvera history: just hydra-head and active-fedora, curation_concerns, sufia, hyrax, etc., all at various recent and historical versions (and both fedora 3 and fedora 4). (Plus other branches of effort with varying levels of code-sharing and varying possible futures, like Avalon and Hyku).

There does seem to be a recent increase in heterogeneity of plans though. What does this mean for Samvera? Samvera (née hydra) has always been described as a community, not a users’ group.  (Perhaps a community of practice?).   We are a community of people working on similar problems; sharing knowledge; developing and sharing techniques; developing shared understandings of options and patterns available, and best practices; and looking for opportunities to collaborate on development and use of shared software.

To be sure, if people go in different technical/software directions, it makes this more challenging, but it doesn’t make it impossible, we don’t all need to be using the same software to be such a community (and even just all using Rails is actually significant opportunity for code-sharing).  One of the things I missed most in my year outside of library world in a more for-profit world — was the way that here in non-profit library-archive-museum-land, our peers are collaborators not competitors.  And I think most of the individuals and institutions involved in the community don’t want to lose this, and want to find ways to maintain our community ties and benefits even if our software becomes more heterogenous. I think we can do it. We are a community, not a users’ group.

In some ways, I think the increase in software diversity in our community indicates some good things going on. Some institutions are realizing that the current stack wasn’t working well for them, and going back to “first principles” of technical decision-making — in being clear about their local business needs/requirements, and analyzing the path most likely to meet those well and efficiently. And diverse investigations of approaches will give our community more information and knowledge.

Personally, I think samvera community efforts have been hampered by people/institutions making technical plans influenced in part by what they think other people want, what they think “everyone else” is doing, or occasionally even where grant money is available.  The “self-interest” in “enlightened self-interest” sometimes got the short-shrift.  (To be clear, this is just one factor among many. No matter what creating a shared codebase in this kind of domain is hard and comes with no guarantees of success).  Institutions going back to their local business needs/requirements and using local technical expertise to try diverse approaches can strengthen our community with more knowledge and experience and options, compared to an attempt at a monoculture.

And also to be clear, we couldn’t be here without what has gone before. That many found the “consensus” stack wasn’t meeting their needs does not mean the community was a failure. None of these new approaches would be possible without all that’s been learned — by individuals, institutions, and the community — about our individual and shared use cases, requirements, approaches, options, dead-ends, patterns, etc. We were literally figuring out the domain together, and we will continue to do so. Plus what we’ve all learned individually, institutionally, and as a community about software architecture and design in those years. Plus the additional software tools that have come to exist giving us radically new options (the first hydra release was prior to Rails 3.0!!)

It does mean that we’re in a time where those with the capacity to do so have to actually go back to those first principles of 1) evaluating our local business needs 2) using technical expertise to evaluate the best way to meet them (short and long term), taking into account 3) opportunities for collaboration, which can mutually benefit all collaborators.   To the extent that there are institutions that have this capacity, but where decision making on choice of software platforms is not being led by people with the technical expertise to make technical decisions, and decisions are being made on other than technical grounds…  it is not serving us, and in the best case this new situation will force the issue, and we’ll all come out stronger.  In any event, these are exciting times, and I think we have some opportunities for real advancement in the maturity of the software we use and provide.

Feedback welcome

I may have gotten some things wrong; my subjective evaluations and analyses can be disagreed with. Discussion and feedback is very welcome: As comments here, as blog responses, in Slack, wherever you like is good with me.

I am also of course interested in connecting with developers or institutions who may be interested in the “Rails-first” developer’s toolkit approach we are considering, which I’ll go into more about in a subsequent follow-up post.

Thanks for early comments from Eddie Rubeiz, Dan Sanford, Anna Headley, Trey Pendragon, and Tom Johnson. All errors and all opinions are solely my own. 

on the dangers of having too MUCH time/resources available in development

When you have a deadline for a release, and limited development resources to get there, you are forced to be ruthless about features you develop, and identifying the business value of each of them. You have to think about “Minimum Viable Product” (with “viable” being a real part of that phrase).

And you have to ruthlessly think about the “business value” (whether that’s in terms of meeting end-user needs, or your organizations needs), and make sure you are focusing on the features that provide maximum value most efficiently. That is, trying to minimize developer time and other expense while maximizing value.

(These are not usually totally quantifiable things, at least in advance, so this isn’t actually just an equation. Developer time may be quantifiable in retrospect, but is only an estimate before the fact. “Value” may be quantifiable if you are a for-profit concern who can measure value by revenue, but often again not until after the fact. Although there are ways to A/B testing and such to do minimal investment to discover if more investment is justified. I’m more used to working in non-commercial environments where value isn’t really easily quantifiable at all — but it’s still there).

You have to do this because you have an enormous “wishlist” of things various stakeholders want, and you know you can’t possibly accomplish them all and get your product out by the deadline — whether that’s a hard deadline, or just in the sense of getting the product out ever, or not waiting long past when you could have gotten it out, in pursuit of accomplishing everything you possibly could.

Well, maybe you don’t have to, but it’s generally more clear to everyone involved that you ought to, when you are worried about getting the product out the door soon enough to meet deadlines or stakeholder expectations, when there’s actually some stress involved and you aren’t totally confident this will be easy to do.  Its more clear that the only way to keep from going crazy and meet your goals is to apply a bit of ruthlessness to evaluating possible things you could do in terms of business value.

(If you are using any kind of agile, scrum, or buzzword-less process/division of labor that involves a “product owner” role, then it’s the product owner’s responsibility and authority to be making these decisions. About what work to choose to do next, in order to maximize value. In consultation with all sorts of stakeholders, as well as the rest of the development team (the “product owner” is part of the development team too), of course).

But okay, let’s say you met that deadline, your product is out, and let’s say that it was quite successful, by whatever means you measure success, quantifiable or not. (Again my experience is mostly in non-commercial environments where success is not as obviously or easily quantifiable as just “revenue”, but there’s still some kind of success, even if it’s just the judgements of internal and external stakeholders and end-users — if there wasn’t such a thing as success vs less success, why were you even doing it in the first place? :) ).

Now the pressure is off — you made it to the finish line!  You still maybe have a giant list of things some stakeholders thought of as possible things to do, or things they would like to do, or someone’s pet thing, or whatever. Or maybe you don’t, and the possible things to develop just come up popcorn-style.  Either way you’re no longer so worried about it, maybe it seems like you have all the time in the world.

But, I suggest, the continued success of your product/project still relies on bringing that ruthlessness to evaluating business value, even when it’s less obvious that there are external constraints forcing it. The continued success of your product, and you and your teams’ sanity.

If you lose your focus on that, the product starts going all discombobulated. You can end up doing things that actually hurt business value. You can have stakeholders wondering why their pet feature wasn’t implemented, but someone else’s was, and turning it into a political battle, and realizing too late that you can’t really explain/justify the choices made.  You can end up spending weeks on something you thought would only take a couple days, when really there was no reason to be doing it in the first place (or you would have quickly realized after a few days when the true scope became apparent) if you had been ruthless about business value.  You can spend a lot of time “rearranging deck chairs,” which ultimately can be bad for the morale of both the development team and other stakeholders.

I actually hate working under time-pressure and resource-stress (who doesn’t?), but in fact the time-pressure and resource-stress of trying to get across the finish line provided some useful focus that resulted in a better product and better morale. (Towards the end of writing this, I found a blog post titled “Can we truly be agile in maintenance projects” which suggests this may not be a unique thought).

But I think the solution to keeping a focus on business value without these stressors is probably the same as it was with them:  The agile practice of managing/grooming your “backlog” and “sprint planning”. 

We actually have one benefit in an “already crossed the first finish line”,  “maintenance and continued development” scenario:  There is less pressure to un-agilely plot out everything for the next 6+ months (ie, to get across that first finish line). It should actually be a bit easier to do things “agilely”.  (One of the insights of agile development, I think, is that all that really matters to the development team is what are we doing right now.  This isn’t an absolute, sometimes you gotta know what’s coming down the line, or make long-term plans. But the end (or rather the beginning) of the day, one human can really only be doing one thing at a time, and what ultimately matters is choosing what that is).

If you keep that backlog management practice even in a “maintenance” phase, then it becomes obvious that you can’t do it without maintaining a focus on business value.

Personally, I don’t necessarily think it’s important to always have your entire backlog ordered (as some scrum “manuals” will insist). What’s important is that when you plan what you’re going to work on now (ie, the next “sprint”, whether that’s a day, a week, or a month), the things you put in there have been evaluated for business value (with the product owner having ultimate responsibility and authority for that). Which requires having enough of your backlog prioritized/ordered enough that you don’t need to always be looking at the entire backlog and re-litigating it before every sprint (cause that’s going to drive you crazy and suck all your time), but what practices you need to make that happen can be contextual.

If you keep a focus on “what are we doing now/next”, and keep your agile practice of determining this with a “ruthless” focus on maximizing business value (which requires being able to articulate the business value of what you choose; which requires continual pursuit of the contextual knowledge — user-testing; internal mission and strategy; internal ‘politics’) — even in the absence of high-stress deadlines and resource crunches — I think you can increase your likelihood of continuing to produce a great product,  being able to explain/justify to stakeholders why you did X and not Y, and keeping the development team sane and their work rewarding.

BrowseEverything in Sufia, and refactoring the ingest flow

[With diagram of some Sufia ingest classes]

So, our staff that ingests files into our Sufia 7.4-based repository regularly needs to ingest dozens of 100MB+ TIFFs. For our purposes here, we’re considering uploading a bunch of “children” (in our case usually page images) of a single “work”, through the work edit page.

Trying to upload so much data through the browser ends up being very not great — even with the fancy JS immediately-upload-with-progress-bar code in Sufia. Takes an awful long time (hours; in part cause browsers’ 3-connections-per-host limit is a bottleneck compared to how much network bandwidth you could get), need to leave your browser open the whole time, and it actually locks up your browser from interacting with our site in any other tabs (see again 3-connections-per-host limit).

The solution would seem to be getting the files on some network-accessible storage, and having the app grab them right from there. browse_everything was already included in sufia, so we decided to try that. (Certainly another solution would be having workflow put the files on some network-accessible storage to begin with, but there were Reasons).

After a bunch of back-and-forth’s, for local reasons we decided to use AWS S3. And a little windows doohickey that gives windows users a “folder” they can drop things into, that will be automatically uploaded to S3. They’ve got to wait until the upload is complete before the things are available in the repo UI. (But it goes way faster than upload through browser, doesn’t lock up your browser, you don’t even need to leave your browser open, or your computer on at all, as the windows script is actually running on a local network server).  When they do ask the sufia app to ingest, the sufia app (running on EC2) can get the files from S3 surprisingly quickly — in-region AWS network is pretty darn fast.

Browse_everything doesn’t actually work in stock Sufia 7.4

The first barrier is, it turns out browse_everything doesn’t actually work in Sufia 7.4, the feature was broken.

(Normally when I do these things, I try to see what’s been fixed/changed in hyrax: To see if we can backport hyrax fixes;  to get a sense of what ‘extra’ work we’re doing by still being in sufia; and to report to you all. But in this case, I ended up just getting overwhelmed and couldn’t keep track. I believe browse_everything “works” in Hyrax, but may still have problems/bugs, not sure, read on.)

ScholarSphere had already made browse-everything work with their sufia 7.x, by patching various parts of sufia, as I found out from asking in Slack and getting helpful help from PSU folks, so that could serve as a model.  The trick was _finding_ the patches in the scholarsphere source code, but it was super helpful to not have to re-invent the wheel when I did. Sometimes after finding a problem in my app, I’d have a better sense of which files to look at in ScholarSphere for relevant patches.

Browse-everything S3 Plugin

Aside from b-e integration on the sufia side, the S3 plugin for browse-everything also had some problems.  The name of the file(s) you choose in the b-e selector didn’t show up in the sufia edit screen after you selected it, because the S3 b-e adapter wasn’t sending it. I know some people have told me they’re using b-e with S3 in hyrax (the successor to sufia) — I’m not sure how this is working. But what I did is just copy-and-paste the S3 adapter to write a custom local one, and tell b-e to use that.

The custom local one includes a fix for the file name thing (PR’d to browse-everything), and also makes the generated S3 public links have a configurable expires_in (PR’d to browse-everything) — which I think you really want for S3 use with b-e, to keep them from timing out before the bg jobs get to them.

Both of those PR’s have been merged to b-e, but not included in a release yet. It’s been a while since a b-e release (As I write this latest b-e is 0.15.1 in Dec 2017; also can we talk about why 0.15.1 isn’t just re-released as 1.0 since it’s being used in prod all over the place?).  Another fix in b-e which isn’t in prod yet, is a fix for directories with periods in them, which I didn’t notice until after we had gone live with our implementation, and then back-ported in as a separate PR.

Instead of back-porting this stuff in as patches, one could consider using b-e off github ‘master’. I really really don’t like having dependencies to particular un-released git trees in production. But the other blocker for that course of action is that browse-everything master currently has what I consider a major UX regression.  So back-port patches it is, as I get increasingly despondent about how hard it’s gonna be to ever upgrade-migrate our sufia 7.4 app to (some version of) hyrax.

The ole temp file problem

Another problem is that the sufia ImportUrlJob creates some files as ruby Tempfiles, which means the file on disk can/will be deleted by Tempfile code whenever it’s reference gets garbage collected. But those files were expected to stay around for other code, potentially background jobs, to have to process.  But bg jobs are in entirely different ruby processes, they aren’t keeping a reference to the TempFile keeping it from being deleted.

In some cases the other things expecting the file are able to re-download it from fedora if it’s not there (via the WorkingDirectory class), which is a performance issue maybe, but works. But in other cases, they just 500.

I’m not sure why that wasn’t a problem all along for us, maybe the S3 ingest changed timing to make it so? It’s also possible it still wasn’t a problem, I just mistakenly thought it was causing the problems I was having, but I noticed the problem code-reading trying to figure out the mysterious problems we were having, so I went ahead and fixed it it into our custom ImportUrlJob.

Interestingly, while the exact problem I had already been fixed in Hyrax —  a subsequent code-change in Hyrax re-introduced a similar TempFile problem in another way, then fixed again by mbklein. That fix is only in Hyrax 2.1.0.

But then the whole Sufia/Hyrax ingest architecture…

At some point I had browse-everything basically working, but… if you tried to ingest say 100 files via S3, you would have to wait a long time for your browser to get a response back. In some cases timing out.

Why? Because while a bunch of things related to ingest are done in background jobs, the code in sufia tried to create all the FileSet objects and attach them to the Work in  Sufia::CreateWithRemoteFilesActor, which ends up called in the foreground, during the request-response loop.  (I believe this is the same in Hyrax, not positive). (This is not how “local”/”uploaded” files are handled).

And this is a very slow thing to do in Sufia. Whether that’s becuase of Fedora, ActiveFedora, the usage patterns of ActiveFedora in sufia/hyrax… I think it’s combo of all of them. The code paths being used sometimes do slow things things once-per-new file that really could be done just once for the work. But even fixing that, it still ain’t really speedy.

At this point (or maybe after a day or two of unsuccessfully hacking things, I forget), I took a step back, and spent a day or two getting a handle on the complete graph of classes involved in this ingest process, and diagraming it.


You may download XML you can import into draw.io to edit, if you’d like to repurpose for your own uses, for updating for Hyrax, local mods, whatever.  

This has changed somewhat in Hyrax, but I think many parts are still substantially the same.

A few thoughts.

If I’m counting right, we have nine classes/objects involved in: Creating some new “child” objects, attaching an uploaded file to each one (setting a bit of metadata based on original file name), and then attaching the “child” objects to a parent (copying a bit of metadata from parent). (This is before any characterization or derivatives).

This seems like a lot. If we were using ActiveRecord and some AR file attachment library (CarrierWave, or I like the looks of shrine) this might literally be less than 9 lines of code.

Understanding why it ended up this way might require some historical research. My sense is that: A) The operations being done are so slow (again, whether due to Fedora, AF, or Sufia architecture) that things had to be broken up into multiple jobs that might not have to be otherwise. B) A lot of stuff was added by people really not wanting to touch what was already there (cause they didn’t understand it, or cause it was hard to get a grasp on what backwards incompat issues might arise from touching it), so new classes were added on top to accomodate new use cases even if a “greenfield” implementation might result in a simpler object graph (and less code duplication, more DRY).

But okay, it’s what we got in Sufia. Another observation though is that the way ‘local’ files (ie “uploaded” files, via HTTP, to a dir the web app can access) and ‘remote’ files (browse-everything) are handled is not particularly parallel/consistent, the work is divided up between classes in pretty different ways for the two paths. I suspect this may be due to “B” above.

And if you get into the implementations of various classes involved, there seems to be some things being done _multiple times_ accross different classes, the same things. Which doesn’t help when the things are very slow (if they involve saving a Work).  Again I suspect (B) above.

So, okay, at this point I hubristically thought, okay, let’s just rewrite some parts of this to make more sense, at least to my view of what makes sense. (What was in Hyrax did not seem to me to be substantially different in the ways relevant here). Partially cause I felt it would be really hard to figure out and fix the remaining bugs or problems in the current code, which I found confusing, and it’s lack of parallelism between local/remote file handling meant a problem could be fixed in one of those paths and not in the other which did things very differently.

Some of my first attempts involved not having a class that created all the new “filesets” and attached them to the parent work.  If we could just have a job for each new file, that created a fileset for that file and attached it to the work, we’d be fitting into the ActiveJob architecture better — where you ideally want a bunch of fairly small and quick and ideally idempotent jobs, not one long-running job doing a lot of things.

The problem I ran into there, is that every time you add a member to a ‘Work’ in the Sufia/Fedora architecture, you actually need to save that Work, and do so by updating a single array of “all the members”.  So if a bunch of jobs are running concurrently trying to add members to the same Work at once, they’re going to step on each others toes. Sufia does have a “locking” mechanism in place (using redlock), so they shouldn’t actually overwrite each others data. But if they each have to wait in line for the lock, the concurrency benefits are significantly reduced — and it still woudln’t really be playing well with ActiveJob architecture, which does’t expect jobs to be just sitting there waiting for a lock blocking the workers.  Additionally, in dev, i was sometimes getting some of these jobs timing out trying to get the lock (which may have been due to using SQLite3 in dev, and not an issue if I was using pg, which I’ve since switched to in dev to match prod).

After a few days of confusion and banging my head against the wall here, I returned to something more like stock sufia where there is one mega-job that creates and associates all the filesets. But it does it in some different ways than stock sufia, in a couple places having to use “internal” Sufia API — with the goal of _avoiding_ doing slow/expensive things multiple times (save the work once with all new filesets added as members, instead of once for each member as stock code did), and getting the per-file jobs queued as soon as possible under the constraints.

I also somewhat reduced the number of different bg jobs. There was at least one place in stock code where a bg job existed only to decide which of two other possible bg jobs it really wanted to invoke, and then perform_later on them. I had my version of a couple jobs do a perform_now instead — I wanted to re-use the logic locked in the two ActiveJob workers being dispatched, but there was no reason to have a job that existed only for milliseconds whose purpose was only to queue up another job, it could call that existing logic synchronously instead.

I also refactored to try to make “uploaded” (local) vs “remote” file ingest much more consistently parallel — IMO it makes it easier to get the code right, with less code, and easier to wrap your head around.

Here’s a diagram of where my architecture ended up:




Did it work?

So I began thinking we had a solution to our staff UX problem that would take “a couple days” to implement, because it was “already a Sufia feature” to use browse-everything from S3.

In fact, it took me 4-5 weeks+ (doing some other parts of my job in those weeks, but with this as the main focus).  Here’s the PR to our local app.

It involved several other fixes and improvements that aren’t mentioned in this report.

We found several bugs in our implementation — or in sufia/cc — both before we actually merged and after we merged (even though we thought we had tested all the use cases extensively, there were some we hadn’t until we got to real world data — like the periods-in-directory-names b-e bug).

In general, I ran into something I’ve run into before — not only does sufia has lots of parts, but they are often implicitly tightly-coupled, assuming that other parts are doing things in a certain way, where if the other things change that certain way, it breaks the first things, with none of these assumptions documented (or probably intentional or even conscious from the code writers).

Another thing I think happens, is that sometimes there can be bugs in ActiveFedora, but the particular way the current (eg) Sufia implementation is implemented doesn’t hit them, but you change the code in certain ways that probably ought to be fine, and now they hit bugs that were actually always there, but nobody noticed since the shared implementation didn’t hit them.

Some time after we deployed the new feature, we ran into a bug that I eventually traced to an ActiveFedora bug (one I totally  don’t understand myself), which had already been fixed and available in AF 11.5.2 (thanks so much to Tom Johnson for, months ago, backporting the fix to AF 11.x, not just in 12.x).  We had been running ActiveFedora 11.1.6. After some dependency hell of getting a consistent dependency tree with AF 11.5.2, it seems to have fixed the problem without breaking anything else or requiring any other code changes (AF appears to have not actually introduced backwards incommpats between these minor version releases, which is awesome).

But what’s a mystery to me (well, along with what the heck is up with that bug, which I don’t understand at all in the AF source), is why we didn’t encounter this bug before, why were the functions working just fine with AF 11.1.6 until recently? It’s a mystery, but my wild guess is that the changes to order and timing of how things are done in my ingest refactor made us hit an AF bug that the previous stock Sufia usage had not.

I can’t hide it cause I showed you the PR, I did not write automated tests for the new ingest functionality. Which in retrospect was a mistake. Partially I’m not great at writing tests; partially because when I started it was so experimental and seemed like it could be a small intervention, but also implementation kept changing so having to keep changing tests could have been a slowdown. But also partially cause I found it overwhelming to figure out how to write tests here, it honestly gave me anxiety to think about it.  There are so many fairly tightly coupled moving parts, that all had to change, in a coordinated fashion, and many of them were ActiveJob workers.

Really there’s probably no way around that but writing some top-level integration tests, but those are so slow in sufia, and difficult to write sometimes too. (Also we have a bunch of different paths that probably all need testing; one of our bugs ended up being with when someone had chosen a ‘format’ option in the ‘batch create’ screen; something I hadn’t been thinking to test manually and wouldn’t have thought to test automated-ly either. Likewise the directory-containing-a-period bug. And the more separate paths to test, the more tests, and when you’re doing it in integration tests… your suite gets so so slow.  But we do plan to add at least some happy path integration tests, we’ve already got a unit of work written out and prioritized for soonish. Cause I don’t want this to keep breaking if we change code again, without being caught by tests.

So… did it work?  Well, our staff users can ingest from S3 now, and seems to have successfully made their workflow much more efficient, productive, and less frustrating, so I guess I’d say yes!

What does this say about still being on Sufia and upgrade paths?

As reported above, I did run into a fair number of bugs in the stack that would be have been fixed if we had been on Hyrax already.  Whenever this happens, it rationally makes me wonder “Is it an inefficient use of our developer time that we’re still on Sufia dealing with these, should we have invested developer time in upgrading to Hyrax already?”

Until roughly March 2018, that wouldn’t have really been an option, wasn’t even a question. At earlier point in the two-three-ish year implementation process (mostly before I even worked here), we had been really good at keeping our app up to date with new dependency releases. Which is why we are on Sufia 7.4 at least.

But at some point we realized that getting off that treadmill was the only way we were going to hit our externally-imposed deadlines for going live. And I think we were right there. But okay, since March, it’s more of an open book at the moment — and we know we can’t stay on Sufia 7.4.0 forever. (It doesn’t work on Rails 5.2 for one, and Rails before 5.2 will be EOL’d before too long).  So okay the question/option returns.

I did spend 4-5 weeks on implementing this in our sufia app. I loosely and roughly and wild-guessedly “estimate” that upgrading from our Sufia 7.4 app all the way to Hyrax 2.1 would take a lot longer than 4-5 weeks. (2, 3, 4 time as long?)

But of course this isn’t the only time I’ve had to fight with bugs that would have been fixed in Hyrax, it adds up.

But contrarily, quite a few of these bugs or other architecture issues corrected here are not fixed in Hyrax yet either. And a couple are fixed in Hyrax 2.1.0, but weren’t in 2.0.0, which was where Hyrax was when I started this.  And probably some new bugs too. Even if we had already been on Hyrax before I started looking at “ingest from S3”, it would not have been the “couple day” implementation I naively assumed. It would have been somewhere in between that and the 4-5 week+ implementation, not really sure where.

Then there’s the fact that even if we migrate/upgrade to Hyrax 2.1 now… there’s another big backwards-incompatible set of changes slated to come down the line for a future Hyrax version already, to be based on “valkyrie” instead.

So… I’m not really sure. And we remain not really sure what’s going to become of this Sufia 7.4 app that can’t just stay on Sufia 7.4 forever. We could do the ‘expected’ thing and upgrade to hyrax 2.1 now, and then upgrade again when/if future-valkyrie-hyrax comes out. (We could also invest time helping to finish future-valkyrie-hyrax). Or we could actually contribute code towards a future (unexpected!) Sufia release (7.5 or 8 or whatever) that works on Rails 5.2 — not totally sure how hard that would be.

Or we could basically rewrite the app (copying much of the business logic of course, easier in business logic we managed to write in ways less coupled to sufia) — either based on valkyrie-without-sufia (as some institutions have already done for new apps, I’m not sure if anyone has ported a sufia or hyrax app there yet; it would essentially be an app rewrite to do so) or…. not.  If it would be essentially an app rewrite to go to valkyrie-without-hyrax anyway (and unclear at this point how close to an app rewrite to go to a not-yet-finished future hyrax-with-valkyrie)…

We have been doing some R&D development into what an alternate digital collections/repo architecture could look like, not necessarily based on Valkyrie — my attr_json gem is part of that, although doesn’t demonstrate a commitment to actually use that gem in the future here at MPOW, we’re just exploring different things.

Deep-dive into hydra-derivatives

(Actually first wrote this in November, five months ago, getting it published now…)

In our sufia 7.4 digital repository, we wanted to add some more derivative thumbnails and download JPGs from our large TIFF originals: 3-4 sizes of JPG to download, and 3 total sizes of thumbnail for the three sizes in our customized design, with each of them having a 2x version for srcset too. But we also wanted to change some of the ways the derivatives-creation code worked in our infrastructure.

1. Derivatives creation is already in a bg ActiveJob, but we wanted to run it on a different server than the rails app server. While the built-in job was capable of this, downloading the original from fedora, in our experience,in at least some circumstances, it left behind that temporary download instead of removing it when done. Which caused problems especially if you had to do bulk derivatives creation of already uploaded items.

  • Derivative-creating bg jobs ought not to be fighting over CPU/RAM with our Rails server, and also ought to be able to be on a server separately properly sized and scaled for the amount of work to be done.

2. We wanted to store derivatives on AWS S3

  • All our stuff is deployed on AWS, storing on S3 is over the long-term cheaper than storing on an Elastic Block Storage ‘local disk’.
  • If you ever wanted to horizontally scale your rails server “local disk” storage (when delivered through a rails controller as sufia 7 does it) requires some complexity, probably a shared file system, which can be expensive and/or unreliable on AWS.
  • If we instead deliver directly from S3 to browsers, we take that load off the Rails server, which doesn’t need it. (This does make auth more challenging, we decided to punt on it for now, with the same justification and possible future directions as we discussed for DZI tiles).
  • S3 is just a storage solution that makes sense for a whole bunch of JPGs and other assets you are going to deliver over the web, it’s what it’s for.

3. Ideally, it would be great to tweak the TIFF->JPG generation parameters a bit. The JPGs should preferably be progressive JPGs, for instance, they weren’t out of stock codebase. The parameters might vary somewhat between JPGs intended as thumbnails and on-screen display, vs JPGs intended as downloads. The thumb ones should ideally use some pretty aggressive parameters to reduce size, such as removing embedded color profiles. (We ended up using vips instead of imagemagick).

4. Derivatives creation seemed pretty slow, it would be nice to speed it up a bit, if there were opportunities discovered to do so. This was especially inconvenient if you had to generate or re-generate one or more derivatives for all objects already existing in the repo. But could also be an issue even with routine operation, when ingesting many new files at once.

I started with a sort of “deep-dive” into seeing what Sufia (via hydra-derivatives) were doing already. I was looking for possible places to intervene, and also to see what it was doing, so if I ended up reimplementing any of it I could duplicate anything that seemed important.  I ultimately decided that I would need to customize or override so many parts of the existing stack, it made sense to just replace most of it locally. I’ll lead you through both those processes, and end with some (much briefer than usual) thoughts.

Deep-dive into Hydra Derivatives

We are using Sufia 7.4, and CurationConcerns 1.7.8. Some of this has changed in Hyrax, but I believe the basic architecture is largely similar. I’ll try to make a note of parts I know have changed in Hyrax. (links to hyrax code will be to master at the time I write this, links to Sufia and CC will be to the versions we are using).


We’ll start at the top with the CurationConcerns CreateDerivativesJob. (Or similar version in Hyrax).  See my previous post for an overview of how/when this job gets scheduled.  Turns out the execution of a CreateDerivativesJob is hard-coded into the CharacterizeJob, you can’t choose to have it run a different job or none at all. (Same in hyrax).

The first thing this does is acquire a file path to the original asset file, with `CurationConcerns::WorkingDirectory.find_or_retrieve(file_id, file_set.id, filepath)`. CurationConcerns::WorkingDirectory (or see in hyrax) checks to see if the file is already there in an expected place inside CurationConcerns.working_directory, and if not copies it to the working directory from a fedora fetch,  using a Hydra::PCDM::File object.

Because it’s using Hydra::PCDM::File object #content API, it fetches the entire fedora file into memory, before writing it to the CurationConcerns.working_directory.  For big files, this uses a lot of RAM temporarily, but more distressing to me is the additional latency, to first fetch the thing into RAM and then stream RAM to disk, instead of streaming right to disk. While the CurationConcerns::WorkingDirectory code seems to have been written originally to try to stream, with a copy_stream_to_working_directory method in terms of streams, the current implementation just turns a full in-memory string into a StringIO instead.  The hyrax implementation is the same. 

Back to the CreateDerivativesJob, we now have a filename to a copy of the original asset in the ‘working directory’.  I don’t see any logic here to clean up that copy, so perhaps this is the source of the ‘temporary file buildup’ my team has sometimes seen.  I’m not sure why we only sometimes see it, or if there are other parts of the stack meant to clean this up later in some cases. I’m not sure if the contract of `CurationConcerns::WorkingDirectory#find_or_retrieve` is to always return a temporary file that the caller is meant to clean up when done, if it’s always safe to assume the filename returned can be deleted by caller; or if instead future actors are meant to use it and/or clean it up.

The CreateDerivativesJob does an acquire_lock_for: I think this is probably left over from when derivatives were actually stored in fedora, now that they are not, this seems superflous (and possibly expensive, not sure). And indeed it’s gone from the hyrax version, so that’s probably true.

Later, the CreateDerivativesJob reindexes the fileset object (first doing a file_set.reload, I think that’s from fedora, not solr?), and in some cases it’s parent.   This is a potentially expensive operation — which matters especially if you’re, say, trying to reindex all derivatives. Why does it need a reindex? Well, sufia/hyrax objects in Solr index have a relative URL to thumbnails in a `thumbnail_path_ss` field (a design our app no longer uses).  But thumbnail paths in sufia/hyrax are consistently predictable from file_set_id, of the form /downloads/#{file_set_id}?file=thumbnail.  Maybe the reindex dates from before this is true? Or maybe it’s just meant to register “yes, a thumbnail is there now”, so the front-end can tell the difference between missing and absent thumb?  (I’d rather just keep that out of the index and handle thumbs not present at expected URLs with some JS. )

I tried removing the index update from my locally overridden CreateDerivativesJob, and discovered one reason it is there. In normal operation, this is the only time a parent work gets reindexed after a fileset is added to it that will be marked it’s representative fileset. And it needs to get reindexed to have the representative_id and such.  I added it to AddFileToFileSet instead, where it belongs. Phew!

So anyway,  how are the derivatives actually created?  Just by calling file_set.create_derivatives(filename). Note the actual local (working directory) method on the model object doesn’t seem quite right for this, you might want different derivatives in different contexts for the same model, but it works. Hyrax is making the same call.  Hyrax introduces a DerivativeService class not present in Sufia/CC , which I believe is meant to support easier customization.


FileSet#create_derivatives is defined in a module that gets mixed into your FileSet class. It branches on the mime type of your original, running different (hard-coded) classes from the hydra-derivatives gem depending on type.  For images, that’s:

 outputs: [{ label: :thumbnail, 
             format: 'jpg', 
             size: '200x150>', 
             url: derivative_url('thumbnail') }])

You can see it passes in the local filepath again, as well as some various options in an outputs keyword arg — including a specified url of the to-be-created derivative — as a single hash inside an array for some reason. derivative_url uses a derivative_path_factory, to get a path (on local FS?), and change it into a file: url — so this is really more of a path than a URL, it’s apparently not actually the eventual end-user-facing URL, but just instructions for where to write the file. The derivative_path_factory is a DerivativePath, which uses CurationConcerns.config.derivatives_path, to decide where to put it — it seems like there’s a baked-in assumption (passed through several layers) that  destination will  be on a local filesystem on the machine running the job.

Hyrax actually changes this somewhat — the relevant create_derivatives method seems to moved to the FileSetDerivativeService — it works largely the same, although the different code to run for each mime-type branch has been moved to separate methods, perhaps to make it easier to override. I’m not quite sure how/where FileSet#create_derivatives is defined (Hyrax CreateDerivativesJob still calls it), as the Hyrax::FileSet::Derivatives module doesn’t seem to mix it in anymore. But FileSet#create_derivatives presumably calls #create_derivatives for the FileSetDerivativeService somehow.  Since I was mainly focusing on our code using Sufia/CC, I left the train here. The Hyrax version does have a cleanup_derivatives method as a before_destroy presumably on the FileSet itself, which is about cleaning up derivatives is a fileset is deleted (did the sufia version not do that at all?) Hyrax seems to still be using the same logic from hydra_derivatives to actually do derivatives creation.

Since i was mostly interested with images, I’m going to specifically dive in only to the  Hydra::Derivatives::ImageDerivatives code.  Both Hyrax and Sufia use this. Our Sufia 7.4 app is using hydra-derivatives 3.2.1. At the time of this writing, hydra-derivatives latest release is 3.3.2, and hyrax does require 3.3.x, so a different minor version than what I’m using.

Hydra::Derivatives::ImageDerivatives and cooperators

If we look at Hydra::Derivatives::ImageDerivatives (same in master and 3.2.1) — there isn’t much there. It sets a self.processor_class to Processors::Image, inherits from Runner, and does something to set a format: png as a default argument.

The superclass Hydra::Derivatives::Runner has some business logic for being a derivative processor. It has a class-wide output_file_service defaulting to whatever is configured as Hydra::Derivatives.output_file_service.  And a class-wide source_file_service defaulting to Hydra::Derivatives.source_file_service.  It fetches the original using the the source file service. For each arg hash passed in (now we understand why that argument was an array of hashes), it just sends it to the configured processor class, along with the output_file_service:  The processor_class seems to be responsible for using the passed-in  output_file_service to actually write output.  While it also passes in the source_file_service, this seems to be ignored:  The source file itself has already been fetched and had it’s local file system path passed in directly, and I did not find anything using the passed-in source_file_service.  (this logic seems the same between 3.2.1 and current master).

In my Sufia app, Hydra::Derivatives.output_file_service is CurationConcerns::PersistDerivatives — which basically just writes it to local file system, again using a derivative_path_factory set to DerivativePath.  The derivative_path_factory PersistDerivatives probably has to match the one up in FileSet#create_derivatives — I guess if you changed the derivative_path_factory in your FileSet, or probably bad things would happen?  And Hydra::Derivatives.source_file_service is CurationConcerns::LocalFileService which does nothing but open the local file path passed in, and return a File object. Hyrax has pretty much the same PersistDerivatives and LocalFileService services, I would guess they are also the defaults, although haven’t checked.

I’d guess this architecture was designed with the intention that if you wanted to get a source file from somewhere other than local file system, you’d set a custom  source_file_service.   But even though Sufia and Hyrax do get a source file from somewhere else, they don’t customize the source_file_service, they fetch from fedora a layer up and then just pass in a local file that can be handled by the LocalFileService.

Okay, but what about actually creating derivatives?

So okay, the actual derivative generation though, recall, was handled by the processor_class dependency, hard-coded to Processors::Image.

Hydra::Derivatives::Processors::Image I think is the same in hydra-derivatives 3.2.1 and current master. It uses MiniMagick to do it’s work. It will possibly change the format of the image. And possibly set (or change?) it’s quality (which mostly only effects JPGs I think, maybe PNGs too). Then it will run a layer flatten operation the image.  And resize it.  Recall that #create_derivatives actually passed in an imagemagick-compatible argument for desired size, size: '200x150>', so create_derivatives is actually assuming that the Hydra::Derivatives::ImageDerivatives.create will be imagemagick-based, or understand imagemagick-type size specifications, there’s some coupling here.

MiniMagick actually does it’s work by shelling  out to command-line imagemagick (or optionally graphicsmagick, which is more or less API-compatible with imagemagick). A line in the MiniMagick README makes me concerned about how many times MiniMagick is writing temporary files:

MiniMagick::Image.open makes a copy of the image, and further methods modify that copy (the original stays untouched). We then resize the image, and write it to a file. The writing part is necessary because the copy is just temporary, it gets garbage collected when we lose reference to the image.

I’m not sure if that would apply to the flatten command too. Or even the format and quality directives?  If the way MiniMagick is being used, files are written/read multiple times, that would definitely be an opportunity for performance improvements, because these days touching the file system is one of the slowest things one can do. ImageMagick/GraphicsMagick/other-similar are definitely capable of doing all of these operations without interim temporary file system writes in between each, I’m not certain if Hydra::Derivatives::Processors::Image use of MiniMagick is doing so.

It’s not clear to me how to change what operations Hydra::Derivatives::Processors::Image​ does — let’s say you want to strip extra metadata for a smaller thumb as for instance Google suggests, how would you do that? I guess you’d write your own class to use as a processor_class. It could sub-class Hydra::Derivatives::Processors::Image or not (really no need for a sub-class I don’t think, what it’s doing is pretty straightforward).  How would you set your custom processor to be used?  I guess you’d have to override the line in Hydra::Derivatives::ImageDerivatives Or perhaps you should you instead provide your own class to replace Hydra::Derivatives::ImageDerivatives, and have that used instead? Which in Sufia would probably be by overriding FileSet#create_derivatives to call your custom class.   Or in Hyrax, there’s that newer Hyrax::DerivativeService stuff, perhaps you’d change your local FileSet to use a different DerivativeService, which seems at least more straightforward (alas I’m not on Hyrax). If you did this, I’m not sure if it would be recommended for you to re-use pieces of the existing architecture as components (and in what way), or just write the whole thing from scratch.

Some Brief Analysis and Decision-making

So I actually wanted to change nearly every part of the default pipeline here in our app.

Reading: I want to continue reading from fedora, being sure to stream it from fedora to local file system as a working copy.

Cleanup: I want to make sure to clean up the temporary working copy when you’re done with it, which I know in at least some cases was not being done in our out of the box code. Maybe to leave it around for future ‘actor’ steps? In our actual app, downloading from one EC2 to another on the same local AWS network is very speedy, I’d rather just be safe and clean it up even if it means it might get downloaded again.

Transformation:  I want to have different image transformation options. Stripping metadata, interlaced JPGs, setting color profiles. Maybe different parameters for images to be used as in-browser thumbs vs downloadable files. (See advice about thumb parameters from  Google’s, or vips). Maybe using a non-ImageMagick processor (we ended up with vips).

Output: I want to write to S3, because it makes sense to store assets like this there, especially but not only if you’re deploying on AWS already like we are.  Of course, you’d have to change the front-end to find the thumbs (and/or downloads) at a separate URL still, more on that later.

So, there are many parts I wanted to customize. And for nearly all of them, it was unclear to me the ‘right’/intended/best way to to customize in the current architecture. I figured, okay then, I’m just going to completely replace CreateDerivativesJob with my own implementation.

The good news is that worked out pretty fine — the only place this is coupled to the rest of sufia at all, is in sufia knowing what URLs to link to for thumbs (which I suspect many people have customized already, for instance to use an IIIF server for thumbs instead of creating them statically, as the default and my new implementation both do). So in one sense that is an architectural success!


Sandi Metz has written about the consequences of “the wrong abstraction”, sometimes paraphrased as “the wrong abstraction is worse than no abstraction.”

hydra-derivatives, and parts of sufia/hyrax that use it, have a pretty complex cooperating object graph, with many cooperating objects and several inheritance hierarchies.  Presumably this was done intending to support flexibility, customization, and maintainability, that’s why you do such things.

Ironically, adding more cooperating objects (that is, abstractions), can paradoxically inhibit flexibility, customizability, or maintainability — if you don’t get it quite right. With more code, there’s more for developers to understand, and it can be easy to get overwhelmed and not be able to figure out the right place to intervene for a change  (especially in the absence of docs). And changes and improvements to the codebase can require changes across many different accidentally-coupled objects in concert, raising the cost of improvements, especially when crossing gem boundaries too.

If the lines between objects, and the places objects interface with each other, aren’t drawn quite right to support needed use cases, you may sometimes have to customize or override or change things in multiple places now (because you have more places) to do what seems like one thing.

Some of this may be at play in hydra_derivatives and sufia/hyrax’s use of them.  And I think some of it comes from people adding additional layers of abstraction to try to compensate for problems in the existing ones, instead of changing the existing ones (Why does one do this? For backwards compat reasons? Because they don’t understand the existing ones enough to touch them? Organizational boundaries? Quicker development?)

It would be interesting to do a survey see how often hooks in hydra_derivatives that seem to have been put there for customization have actually been used, or what people are doing instead/in addition for the customization they need.

Getting architecture right (the right abstractions) is not easy, and takes more than just good intentions. It probably takes pretty good understanding of the domain and expected developer usage scenarios; careful design of object graphs and interfaces to support those scenarios; documentation of such to guide future users and developers. Maybe ideally starting some working individual examples in local ‘bespoke’ codebases that are only then abstracted/generalized to a shared codebase (which takes time).  And with all that, some luck and skill and experience too.

The number of different cooperating objects you have involved should probably be proportional to how much thinking and research you’ve done about usage scenarios to support and how the APIs will support them — when in doubt keep it simpler and less granular.

What We Did

This article previous to here, I wrote about 5 months ago. Then I sat it on it until now… for some reason the whole thing just filled me with a sort of psychic exhaustion, can’t totally explain it. So looking back to code I wrote a while ago, I can try to give you a very brief overview of our code.

Here’s the PR, which involves quite a bit of code, as well as building on top of some existing custom local architecture.

We completely override the CreateDerivativesJob#perform method, to just call our own “service” class to create derivatives (extracted into a service object instead of being inline in the job!)– if our Env variables are configured to use our new-fangled store-things-on-s3 functionality.  Otherwise we call super — but try to clean up the temporary working files that the built-in code was leaving lying around to fill up our file system.

Our derivatives-creating service is relatively straightforward.  Creating a bunch of derivatives and storing them in S3 is not something particularly challenging.

We made it harder for ourself by trying to support derivatives stored on S3 or in local file system, based on config — partially because it’s convenient to not have to use S3 in dev and test, and partially thinking about generalizing to share with the community.

Also, there needs to be a way for front-end code to get urls to derivatives of course, and really this should be tied into the derivatives creation, something hydra-derivatives appears to lack.  And in our case, we also need to add our derivatives meant to be offered as downloads to our ‘downloads’ menu, including in our custom image viewer. So there’s a lot of code related to that, including some refactoring of our custom image viewer.

One neat thing we did is (at least when using S3, as we do in production) deliver our downloads with a content-disposition header specifying a more human-friendly filename, including the first few words of the title.

Generalizing? Upstream? Future?

I knew from the start that what I had wasn’t quite good enough to generalize for upstream or other shareable dependency.  In fact, in the months since I implemented it, it hasn’t worked out great even for me, additional use cases I had didn’t fit neatly into it, my architecture has ended up overly complex and confusing.

Abstracting/generalizing to share really requires even more care and consideration to get the right architecture, compared to having something that works well enough for your app. In part, because refactoring something only used by one app is a lot less costly than with a shared dependency.

Initially, some months ago, even knowing what I had was not quite good enough to generalize, I thought I had figured out enough and thought about enough to be able to spend more time to come up with something that would be a good generalized shareable dependency.  This would only be worth spending time on if there seemed a good chance others would want to use it of course.

I even had a break-out session at Samvera Connect to discuss it, and others who showed up agreed that the current hydra-derivatives API was really not right (including at least one who was involved in writing it originally), and that a new try was due.

And then I just… lost steam to do it.  In part overwhelmed by community things; the process of doing a samvera working group, the uncertainty of knowing whether anyone would really switch from hydra-derivatives to use a new thing, of whether it could become the thing in hyrax (with hyrax valkyrie refactor already going on, how does this effect it?), etc.

And in part, I just realized…. the basic challenge here is coming up with the right API and architecture to a) allow choice of back-end storage (S3, local file system, etc), with b) URL generation, and ideally API for both streaming bytes from the storage location and downloading the whole thing, regardless of back-end storage. This is the harder part architecturally then just actually creating the derivatives. And… nothing about this is particularly unique to the domain of digital collections/repositories, isn’t there something already existing we could just use?

My current best bet is shrine.  It already handles those basic things above with a really nice very flexible decoupled architecture.  It’s a bit more confusing to use than, say, carrierwave (or the newer built-into-Rails ActiveStorage), but that’s because it’s a more flexible decoupled-components API, which is probably worth it so we can do exactly what we want with it, build it into our own frameworks. (More flexibility is always more complexity; I think ActiveStorage currently lacks the flexibility we need for our communities use cases).   Although it works great with Rails and ActiveRecord, it doesn’t even depend on Rails or ActiveRecord (the author prefers hanami I think), so quite possibly could work with ActiveFedora too.

But then the community (maybe? probably?) seems to be… at least in part… moving away from ActiveFedora too. Could you integrate shrine, to support derivatives, with valkyrie in a back-end independent way? I’m sure you could, I have no idea how the best way would be to do so, how much work it would be, the overall cost/benefit, or still if anyone would use it if you did.

So I’m not sure I’m going to be looking at shrine myself in a valkyrie context. (Although I think the very unsuitable hydra-derivatives is the only relevant shared dependency anyone is currently using with valkyrie, and presumably what hyrax 3 will still be using, and I still think it’s not really… right).

But I am going to be looking at shrine more — I’ve already started talking to the shrine author about what I see as my (and my understanding of our communities) needs for features for derivatives (which shrine currently calls “versions”), and I think I’m going to try to do some R&D on a new shrine plugin that meets my/our needs better. I’m not sure I’ll end up wanting to try to integrate it with valkyrie and/or hyrax, or with some new approaches I’ve been thinking on and doing some R&D on, which I hope to share more about in the medium-term future.

Another round of citation features in a sufia app

I reported before on our implementation of an RIS export feature in our sufia 7.4 app.

Since then, we’ve actually nearly completely changed our implementation. Why? Well, it started with us moving on to our next goal: on-page human-readable citation. This was something our user analysis had determined portions of our audience/users wanted.

Turns out that what seemed “good enough” metadata for an RIS export (meeting or exceeding user expectations; users were used to citation exports not being that great, and having to hand-edit them themselves) seemed not at all good enough when actually placed on the page as a human-readable citation (in Chicago format).

We ended up first converting our internal metadata to citeproc-json format/schema. Then using that intermediate metadata as a source for our RIS export, as well as for conversion to human-readable citation with citeproc-ruby.  The conversion/production happens at display-time, from data in our Solr index, which required us to add some data to the Solr index that wasn’t previously there.

On metadata and citations

Turns out getting the right machine-interprable metadata for a really correct citation is pretty tricky.

It occurs to me that if citations is a serious use case, you should probably consider it when designing your metadata schema in the first place, to make sure you have everything you need in machine-readable/interprable format. (As unrealistic as this suggestion sounds for many actual projects in our sector). Otherwise can find you simply don’t have what you need for a reasonable citation.

We ended up adding a few metadata fields, including a “source” field for items in our digital collection that are excerpts from works (which are not in our collection), and need the container work identified in the citation.

In other cases, an excerpt is an independent work in our repo, but also has a ‘child’ relationship to a parent, that is it’s container for purposes of citation. But in yet other cases, there’s a work with a ‘parent’ work that is for organizational/arrangement purposes only, and is not a container for purposes of citation — but our metadata leaves the software no way to know which is which. (In this case we just treat them all like containers for purposes of citation, and tolerate the occasional not-really-correct-ness, as the “incorrect” citations still unambiguously identify the thing cited).

We also implemented a bunch of heuristics to convert various “just string” fields to parsed metadata. For instance our author (or publisher) names, while from FAST and other library vocabularies, are just in our system as plain single strings. The system doesn’t even record the original authority identifier. (I think this is typical for a sufia/hyrax app, while they use the qa gem to load terms, if the gem supplies identifiers from the original vocabulary, they aren’t recorded).

So, the name `Stayner, Heinrich, -1548` needs to be displayed in some parts of the citation (first author for instance) as Stayner, Heinrich, but in other parts (second author or publisher) as Heinrich Stayner, and in no case includes the dates in the citation, so we gotta try parsing it.  Which is harder than you’d think with all the stuff that can go into an AACR2-style name heading (question marks or the word “approximately”, or sometimes the word “active”, other idiosyncracies).  And then a corporate name like an imaginary design firm Jones, Smith, Garcia is never actually Garcia Jones, Smith or something like that.

Then there’s turning our dates from a custom schema into something that fits what a citation expects.

Our heuristics get good enough — in fact, I think our automatically-generated human readable citations end up as good or better as anything else I’ve seen automatically generated on the web, including from major publishers–but they are definitely far from perfect, and have lots of errors in many edge cases. Hopefully all errors that don’t change or confuse about the thing cited, which of course is the point.

CSL, CSL-json, and ruby-citeproc

CSL, the Citation Style Language, is a system for automatically generating human-readable citations according to XML stylesheets for various citation formats/styles.

While I believe CSL originally came out of zotero, some code has been extracted (and is open source like zotero itself), and the standard itself as an independent standard. Whether via the code or the schema/standard implemented in other and various code open source and not, it has been adopted by other software packages too (like Mendeley, which is not open source).

One part of CSL is a json format (defined with a json schema) to represent an individual “work to be cited”.  This also originally came from Zotero, and doesn’t seem to totally have a universal name yet, or a ton of documentation.  The schema in the repo is called “csl-data.json,” but I’ve also seen this format referred to as just “csl-json”, as well as “citeproc-json” (with or without the hyphens).  It also has even more adoption beyond zotero — it is one of the standard formats that CrossRef (and other DOI resolvers?) can return.  The common IANA/MIME “Content-Type” is `application/vnd.citationstyles.csl+json`, but historically another (incorrect?) form has sometimes been used, `application/citeproc+json`. Some of the names/content type(s) might confuse you into thinking this is a JSON representation of a CSL style (describing a citation format/style like “Chicago” or “MLA”), but it’s not, it’s a format of metadata about a particular “work to be cited”.  I kind of like to call it “csl-data-json” (after the schema URL) to avoid confusion.

Even apart from JSON serialization, this is a useful schema in that it separates out fields one will actually need to generate a citation (including machine-readable individual sub-elements for parts of a name or date).  It’s best available documentation, in addition to the JSON schema itself, seems to be this document written for the original Javascript implementation and not entirely applicable to generic implementations.

There is, amazingly, a ruby CSL processor in the citeproc-ruby gem.  Not only can it take input in csl-json and format it as an individual citation in a desired style, but, as a standard CSL processor, it can also format a complete bibliography and footnotes in the context of a complete document (where some citation styles call for appropriate ibid use in the context of multiple citations, etc).  I was only interested in formatting an individual citation though.

Initially, I wasn’t completely sure the citeproc-ruby gem would work out for me, for performance or other reasons. But I still decided to split processing into two steps: translating our internal metadata into a csl-json compatible format, and then formatting a human readable citation. This two step process just makes sense for manageable code, trying to avoid an unholy mess of nested if-elsifs all jumbled together. And gives you clear separation if you need to generate in multiple human-readable styles, or change your mind about what style(s) to generate. The csl-json schema is great for an intermediate format even if you are going to format as human-readable by non-CSL means, as it’s been road-tested and proven as having the right elements you need to generate a citation.

However, I did end up using citeproc-ruby in the end.  @inkshuk it’s author was amazingly helpful and giving in my questions on the GH issues. Initially it looked like there were some extreme performance problems, but using alternate citeproc-ruby API to avoid re-loading/parsing XML style documents from disk every time (with one PR by me to make this work for locale XML style docs too) avoided those.

Citeproc-ruby can’t yet handle formatting of date ranges in a citation (inkshuk has started on the first steps to an implementation in response to my filed issue).  So when I have a date range in a work-to-be-cited, I just format it myself in my own ruby code, and include it in the csl-data-json as a date “literal”.

CSL is amazing, and using a CSL processor handles all sorts of weird idiosyncratic edge cases for you. (One example, if a title already includes double-quotes, but is to be double-quoted in the citation, it changes the internal double quotes to single quotes for you. There are so many of these, that you’re not going to think of initially yourself in a custom hobbled-together unholy mess of if-elsif statement implementation).

Also, while I didn’t do it, you could hypothetically customize some of the existing styles in CSL XML if you need to for local context needs. I believe citeproc-ruby even gives you a way to override parts of an existing style in ruby code.

The particular and peculiar challenges of sufia/hyrax/samvera

There are two main, er, idiosyncracies of the sufia/hyrax/samvera architecture that provided additional challenges. One: the difficulty of efficiently determining the parent work of a work-in-hand, and (in sufia but not hyrax) the collection(s) that contain a work. Two: The split architecture between Solr index data (used at display-time), and fedora data (used at index time), and the need to write code very differently to get data in each of these sources/times.

Initially, I was worried about citeproc-ruby performance. So started out having our sufia app generate the human-readable citation at index time, and store it as text/html in the Solr index, so at display time it would just have to be retrieved and inserted on the page. Really, even if only takes 10ms to format a citation, wouldn’t it be better to not add 10ms to the page delivery time? (Granted, 10ms may be nothing to many slow sufia/hyrax apps).

However, to generate access to citations in our context, we need access to both the container collection (for archival arrangement/location when an archival item), and the parent work, for “container” for citation purposes. These are very slow to get out of fedora. (Changed/improved for fetching parent collections but not parent works in hyrax; we’re still sufia). Like, with our data and infrastructure, it was taking multiple seconds to get the answer from fedora to “what are the parent work(s) for this item-in-hand” (even trying to use the fedora API feature that seemed suited for this, whose name I now forget).  While one can accommodate more slowness at index-time than display-time, several-seconds-per-item was outside our tolerance — when re-indexing our ~20K item collection already can take many hours on an empty solr index.

So you want to get that info from the Solr index instead of fedora, but trying to access the Solr index in the indexing operation leads you to all sorts of problems when generating an initial index, with whether there’s already enough in the index to answer your question you need to index the item-in-hand. We want our indexing operation to always be usable starting from an empty index, for fault recovery purposes among others.  And even ignoring this issue, I found that the sufia ‘actor stack’ info actually led to the right info not being in the Solr index at the right time for a particular item-in-hand-to-index when changing the parent or collection membership for item(s).

Stopping myself as I got into trying to debug the actor stack yet again, I decided to switch to a pure display-time approach.  Just generate the citation on-demand, from the solr index.  At this point I already had a map-metadata-to-csl-json implementation based on doing it at index-time with info from fedora.  I had actually forgotten when I wrote that that I wasn’t leaving my options open to switch to display-time — so I had to rewrite the thing to retrieve the slightly different info in slightly different ways from the Solr index at display time using a sufia “show presenter”.

Also had to add some things to our Solr index so they could be used at display time — we were including in our solr index only the dates-of-work as strings we wanted to display to user on our pages, but the citation metadata transformer needed all our original structured metadata so it could determine how best to convert them (differently) to dates for inclusion in citation. (I stored our original data objects serialized to json, and then have the presenter “re-hydrate” them to our original ruby model objects without touching fedora).

Premature Abstraction

In our original implementation, I tried to provide a sort of generic “serialize to RIS”  base class, thinking it would make our code more readable, and potentially be of general use.

However, even originally it didn’t end up working quite as well as I’d hoped (needed custom logic more often than using the “built in” automatic mappings in the base class), and in fact this new implementation abandons it entirely. Instead, it first maps to CSL-json schema/format, and then the RIS serializer mostly just extracts the needed fields from there. (We wanted to take advantage of our improved citation data for on-screen human-readable to improve the RIS export too, of course).

No harm, no foul in our local codebase. You learn more about your requirements and you learn more about how particular architectural solutions work out, and you change your mind about implementation decisions and change them. This is a normal thing.

But if I had jumped to, say, add my “RIS Serializer base” abstraction to some shared codebase (say the hyrax gem, or even some kind of samvera-citations gem), it probably would have ended up not as generally useful as I thought at the time (it’s not even a good match for our needs/use case, it turns out!).  And it’s much harder to change your mind about an abstraction in a shared codebase, that many people may be relying upon, and can’t be changed without backwards incompatability problems. (That in a local codebase aren’t nearly as problematic, you just change all your code in your repo and commit it and you’re done, no need to worry about versioning or coordinating the work of various developers using the shared code).

It’s good to remember to be even more cautious with abstractions in shared code in general.  Ideally, abstractions in shared code (ie, a gem) should be based on a good understanding of the domain from some experience, and have been proven in one (or better more) individual app(s) over some amount of time, before being enshrined into a shared codebase. The first abstraction that seems to be working well for you in a particular codebase may not stand the test of time and diverse requirements/use cases, and “the wrong abstraction can be worse than no abstraction at all”—and the wrong abstraction can be very expensive and painful to undo in a gem/shared codebase.

Our implementation

You can see the Pull Request here.  (It’s possible there were some subsequent bug fixes postdating the PR).

We have a class called CitableAttributes, which takes a display-time ‘work show presenter’ (which as above has been customized to have access to some original component models), and formats it into data compatible with csl-data-json (retrievable via individual public accessors), as well as an actual JSON document that is csl-data-json.

Our RISSerializer uses a CitableAttributes object to extract individual metadata fields, and put them in the right place in an RIS document. It also needs it’s own logic for some things that aren’t quite the same in RIS and csl-data-json (different ‘type’ vocabulary, no ability to describe dates ranges machine-readably).  We wanted to take advantage of all the logic we had for transforming the metadata to something applicable to citations, to improve the RIS exports too.

Oh, one more interesting thing. We decided for photographs of “realia” (largely from our Museum‘s collection), it was more appropriate and useful to cite them as photographs (taken by us, dated the date of the photo), rather than try to cite “realia” itself, which most citation styles aren’t really set up to do, and some here thought was inappropriate for these objects as seen in our website anyhow. So we have some custom logic to determine when an item in our collection is such, and cite appropriately using some clever OO polymorphism. This logic now carries over to the RIS export, hooray.

And a simple Rails helper just uses a CitableAttributes to get a csl-data-json, and then feeds it to citeproc-ruby objects to convert to the human-readable Chicago-style citation we want on screen.

There are definitely still a variety of idiosyncratic edge cases it gets not quite right, from weird punctuation to semantics. But I believe it’s still actually one of the best on-screen automatically-generated human-readable citation implementations around!

Some live diverse examples:

brief intermezzo of library fanfic

Screenshot 2018-03-27 00.53.02


Team Alpha, enter through loading dock, proceed via stairway 3 to level B, and secure Recent Arrivals Fiction. Team Omega from main entrance, vault over the turnstiles, and take Circulation. If there’s any resistance, try not to hurt anyone, but as always, the books come first — but I don’t think there will be, we have some friends on the inside.

Comrades, once we’re in, we’re holding it and not leaving. We can run this library better than those bastards ever did, and read all the authors we want — for the rest of our lives! Shortened though those lives may now be, we know they will be more fulfilling than even centuries upon centuries with only one author.

Okay, synchronize your watches, on 3, 2, 1, mark.