Blacklight 7: current_user or other request context in SearchBuilder solr query builder

In Blacklight, the “SearchBuilder” is an object responsible for creating a Solr query. A template is generated into your app for customization, and you can write a kind of “plugin” to customize how the query is generated.

You might need some “request context” to do this. One common example is the current_user, for various kinds of axis control. For instance, to hide certain objects from returning in Solr query depending on user’s permissions, or perhaps to keep certain Solr fields from searched (in qf or pf params) unless a user is authorized to see/search them.

The way you can do this changed between Blacklight 6 and Blacklight 7. The way to do it in Blacklight 7.1 is relatively straightforward, but I’m not sure if it’s documented, so I’ll explain it here. (Anyone wanting to try to update the blacklight-access_controls or hydra-access-controls gems to work with Blacklight 7 will need to know this).

I was going to start by describing how this worked in Blacklight 6… but I realized I didn’t understand it, and got lost figuring it out. So we’ll skip that. But I believe that in BL 6, controllers interacted directly with a SearchBuilder. I can also say that the way a SearchBuilder got “context” like a current_user in BL6 and previous was a bit ad hoc and messy, without a clear API, and had evolved over time in a kind of “legacy” way.

Blacklight 7 introduces a new abstraction, the somewhat generically named “search service”, normally an instance of Blacklight::SearchService. (I don’t think this is mentioned in the BL 7 Release Notes, but is a somewhat significant architectural change that can break things trying to hook into BL).

Now, controllers don’t interact with the SearchBuilder, but with a “search service”, which itself instantiates and uses a SearchBuilder “under the hood”. In Blacklight 7.0, there was no good way to get “context” to the SearchBuilder, but 7.1.0.alpha has a feature that’s pretty easy to use.

In your CatalogController, define a search_service_context method which returns a hash of whatever context you need available:

class CatalogController < ApplicationController
  include Blacklight::Catalog

  def search_service_context
    { current_user: current_user }
  end

# ...
end

OK, now the Blacklight code will automatically add that to the "search service" context. But how does your SearchBuilder get it?

Turns out, in Blacklight 7, the somewhat confusingly named scope attribute in a SearchBuilder will hold the acting SearchService instance, so in a search builder or mix-in to a search_builder…

def some_search_builder_method
  if scope.context[:current_user]
    # we have a current_user!
  end
end

And that’s pretty much it.

I believe in BL 7, the scope attribute in a SearchBuilder will always be a “search service”, perhaps it would make sense to alias it as “search_service”. To avoid the somewhat ugly scope.context[:current_user], you could put a method in your SearchBuilder that covers that as current_user, but that would introduce some coupling between that method existing in SearchBuilder, and a SearchBuilder extension that needs to use it, so I didn’t go that route.

For a PR in our local app that supplies a very simple local SearchBuilder extension, puts it into use, and makes the current_user available in a context, see this PR. 

Solr Indexing in Kithe

So you may recall the kithe toolkit we are building in concert with our new digital collections app, which I introduced here.

I have completed some Solr Indexing support in kithe. It’s just about indexing, getting your data into Solr. It doesn’t assume Blacklight, but should work fine with Blacklight; there isn’t currently any support in kithe for what you do to provide UX for your Solr index.  You can look at the kithe guide documentation for the indexing features for a walk-through.

The kithe indexing support is based on ActiveRecord callbacks, in particular the after_commit callback. While callbacks get a bad rap, I think they are appropriate here, and note that both the popular sunspot gem (Solr/Rails integration, currently looking for new maintainers) and the popular searchkick gem (ElasticSearch/Rails integration) base their indexing synchronization on AR callbacks too. (There are various ways in kithe’s feature to turn off automatic callbacks temporarily or permanently in your code, like there are in those other two gems too). I spent some time looking at API’s, features, and implementation of the indexing-related functionality in sunspot, and searchkick, as well as other “prior art”, before/while developing kithe’s support.

The kithe indexing support is also based on traject for defining your mappings.

I am very happy with how it turned out, I think the implementation and public API both ended up pretty decent. (I am often reminded of the quote of uncertain attribution “I didn’t have time to write a short letter, so I wrote a long one instead” — it can take a lot of work to make nice concise code).

The kithe indexing support is independent of any other kithe features and doesn’t depend on them. I think it might be worth looking at for anyone writing a an app whose persistence is based on ActiveRecord. (If something ActiveModel-like but not ActiveRecord, it probably doesn’t have after_commit callbacks, but if it has after_save callbacks, we could make the kithe feature optionally use those instead; sunspot and searchkick can both do that).

Again, here’s the kithe documentation giving a tour of the indexing features. 

Note on traject

The part of the architecture I’m least happy with is traject, actually.

Traject was written for a different use case — command-line executed high-volume bulk/batch indexing from file serializations. And it was built for that basic domain and context at the time, with some YAGNI thoughts.

So why try to use it for a different case of event-based few-or-one object sync’ing, integrated into an app?  Well, hopefully it was not just because I already had traject and was the maintainer (‘when all you have is a hammer’), although that’s a risk. Partially because traject’s mapping DSL/API has proven to work well for many existing users. And it did at least lead me to a nice architecture where the indexing code is separate and fairly decoupled from the ActiveRecord model.

And the Traject SolrJsonWriter already had nice batching functionality (and thread-safety, although didn’t end up using it in current kithe architecture), which made it convenient to implement batching features in a de-coupled way (just send to a writer that’s batching, the other code doesn’t need to know about it, except for maybe flushing at the end).

And, well, maybe I just wanted to try it. And I think it worked out pretty well, although there are some oddities in there due to traject’s current basic architectural decisions. (Like, instantiating a Traject “Indexer” can be slow, so we use a global singleton in the kithe architecture, which is weird.)  I have some ideas for possible refactors of traject (some backwards compat some not) that would make it seem more polished for this kind of use case, but in the meantime, it really does work out fine.

Note on times to index, legacy sufia app vs our kithe-based app

Our collection, currently in a sufia app, is relatively small. We have about 7,000 Works (some of which are “child works”), 23,000 “FileSets” (which in kithe we call “Assets”), and 50 Collections.

In our existing Sufia-based app, it takes about 6 hours to reindex to Solr on an empty index.

  • Except actually, on an empty index it might take two re-index operations, because of the way sufia indexing is reliant on getting things out of the index to figure out the proper way to index a thing at hand. (We spent a lot of work trying to reorganize the indexing to not require an index to index, but I’m not sure if we succeeded, and may ironically have made performance issues with fedora worse with the new patterns?) So maybe 12 hours.
  • Except that 6 hours is just a guess from memory. I tried to do a bulk reindex-everything in our sufia app to reconfirm it — but we can’t actually currently do a bulk reindex at all, because it triggers an HTTP timeout from Fedora taking too long to respond to some API request.
    • If we upgraded to ActiveFedora 12, we could increase the timeout that ActiveFedora is willing to wait for a fedora response for. If we upgraded to ActiveFedora 12.1, it would include this PR, which I believe is intended to eliminate those super long fedora responses. I don’t think it would significantly change our end-to-end indexing time, the bulk of it is not in those initial very long fedora API calls. But I could be wrong. And not sure how realistic it is to upgrade our sufia app to AF 12 anyway.
    • To be fair, if we already had an existing index, but needed to reindex our actual works/collections/filesets because of a Solr config change, we had another routine which could do so in only ~25 minutes.

In our new app, we can run our complete reindexing routine in currently… 30 seconds. (That’s about 300 records/second throughput — only indexing Works and Collections. In past versions as I was building out the indexing I was getting up to 1000 records/second, but I haven’t taken time to investigate what changed, cause 30s is still just fine).

In our sufia app we are backing up our on-disk Solr indexes, because we didn’t want to risk the downtime it would take to rebuild (possibly including fighting with the code to get it to reindex).  In addition to just being more bytes to sling, this leads to ongoing developer time on such things as “did we back up the solr data files in a consistent state? Sync’d with our postgres backup?”, and “turns out we just noticed an error in the backup routine means the backup actually wasn’t happening.” (As anyone who deals with backups of any sort knows can be A Thing).

In the new system, we can just… not do that.  We know we can easily and quickly regenerate the Solr index whenever, from the data in postgres. (And if we upgrade to a new Solr version that requires an index rebuild, no need to figure out how to do so without downtime in a complicated way).

Why is the new system so much faster? I’ve identified three areas I believe are likely, but haven’t actually tried to do much profiling to determine which of these (if any?) are the predominant factors, so couldn’t say.

  1. Getting things out of fedora (at least under sufia’s usage patterns) is slow. Getting things out of postgres is fast.
  2. We are now only indexing what we need to support search.
    • The only things that show up in our search results are Works and Collections, so that’s all we’re indexing. (Sufia indexes not only FileSets too, but some ancillary objects such as one or two kinds of permission objects, and possibly a variety of other things I’m not familiar with. Sufia is trying to put pretty much everything that’s in fedora in Solr. For Reasons, mainly that it’s hard to query your things in Fedora with Fedora).
    • And we are only indexing the fields we actually need in Solr for those objects. Sufia tries to index a more or less round-trippable representation to Solr, with every property in it’s own stored solr field, etc. We aren’t doing that anymore. We could put all text in one “text” field, if we didn’t want to boost some higher than others. So we only index to as many fields as need different boosts, plus fields for facets, etc. Only what need to support the Solr functionality we want.
      • If you want to render your results from only Solr stored fields (as sufia/hyrax do, and blacklight kind of wants you to) you’d also need those stored fields, sufficiently independently addressable to render what you want (or perhaps just in one big serialized JSON?). We are hoping to not use solr stored fields for rendering at all, but even if we end up with Solr stored fields for rendering, it will be just enough that we need for rendering. (For instance, some people using Blacklight are using solr stored fields for the “index”/search results/hits page, but not for the individual record ‘show’ page).
  3. The indexing routines in new thing send updates to Solr in an efficient way, both batching record updates into fewer Solr HTTP update requests, and not sending synchronous Solr “hard commits” at all. (the bulk reindex, like the after_commit indexing, currently sends a softCommit per update request, although this could be configured differently).

So

Check out the kithe guide on indexing support! Maybe you want to use kithe, maybe you’re writing an ActiveRecord-based apps and want to consider kithe’s solr indexing support in isolation, or maybe you just want to look at it for API and implementation ideas in your own thing(s).

What “search engine” to use in a digital collections Rails app?

Traditional samvera apps have Blacklight, and it’s Solr index, very tightly integrated into many parts of persistence and discovery functionality, including management interfaces.

In rewriting our digital collections app , we have the opportunity to make other choices. Which of course is both a blessing and a curse, who wants choices?

One thing I know I don’t want is as tight and coupled an integration to Solr as a typical sufia or hyrax app.

We should be able to at least find persisted model items by id (or iterate through all of them), make some automated changes (say correcting a typo), and persist them to storage — without a Solr index existing at all. To the extent a Solr (or other text-search-engine) index exists at all, discrepencies between what’s in the index and what’s in our “real” store should not cause any usual mutation-and-persistence APIs to fail (either with an error or with wrong side effect outcome).

Really, I want a back-end interface that can do most if not all things a staff user needs to do in managing the repo, without any Solr index existing at all.  Just plain postgres ‘like’ search may sometimes be enough, when it’s not using pg’s full text indexing features likely are. These sorts of features are not as powerful as a ‘text search engine’ product like lucene or Solr — they aren’t going to do the complicated stemming that Solr does, or probably features like “phrase boosting”. They can give you filters/limits, but not facets in the Solr sense (telling you what terms are present in a given already restricted search, with term counts).

So we almost certainly still want Solr or a similar search engine providing user-facing front-end discovery, for this powerful search experience. We just want it sitting loosely on top or our app, not tightly integrated into every persistence and retrieval operation like it ends up in a typical sufia or hyrax app.

And part of this, for me, is I only want to have to index in Solr (or similar) what is neccesary for discovery/retrieval features, for search. This is how Solr works best. I don’t want to have to store a complete representation of the model instance in Solr, with every field addressable from a Solr result. So, that means, even in the implementation of the front-end UX search experience, i want display of the results to be from my ordinary ActiveRecord model objects (even on the ‘index’ search results page, and certainly on the ‘item detail’ page).  This is in fact how sunspot works — after solr returns a hit list, take the db pk’s (and model names) from the solr response, and then just fetch the results again from the database.  In a nice efficient SQL, using pre-loading (via SQL joins) etc. This is how one attempt at at elasticsearch-rails integration works too.

Yeah, it’s doing an “extra” fetch from the db, when it theoretically could have gotten everything it needed to display from Solr.  But properly optimized fetches from the db to display one page of search results are pretty quick, certainly faster than what was going on with ActiveFedora in our sufia app anyway, and the developer pain (and subsequent bugs) that can come from trying to duplicate everything in Solr just aren’t worth trying to optimize away the db fetch. There’s a reason popular Solr or ElasticSearch integrations with Rails do the “extra” fetch.

OK, I know what I want (and what I don’t), but what am I going to do? There’s still some choices, curses!

1. Blacklight, intervened in to return actual ActiveRecord models to views?

Blacklight was originally written for “library catalog” use cases where you might not be indexing from a local rdbms at all, you might be indexing from a third party data API, and you might not have representations in a local rdbms, solr is all you’ve got. So it was written to find everything it needs to display the results found in the Solr response.

But can we intervene in Blacklight to instead take the Solr responses, use them to get model names and pks to then fetch from a local rdbms instead? Like sunspot does?

This was my initial plan, and at first I thought I could easily. In fact, years ago, when writing a Blacklight catalog app, I had to do something in some ways analagous. We wanted our library catalog to show live checked in/out status for things returned by Solr. But we had no good way to get this into our Solr index in realtime. So, okay, we can’t provide a filter/facet by this value without it in the index, but can we still display it with realtime accuracy?

We could! We wanted to hook into the solr search results process in Blacklight, take the unique IDs from the solr response, and use them to make API calls to our ILS to figure out item status. (checked out, etc). And we wanted to do this in one bulk query (with all the IDs that are on the page of results), not one request per search hit, which would have been a performance problem. (We won’t talk about how our ILS didn’t actually have an API; let’s just say we gave it one).

So I had already done that, and thought the hook points and functions were pretty similar (just look up ‘extra info’ differently, this time the ‘extra info’ is an actual ActiveRecord model associated with each hit, instead of item status info). So I figured I could do it again!

The Blacklight method I had overridden to do that (back in maybe Blacklight 2.x days), was the search_results method called by Catalog#index action among other places. Overriding this got every place Blacklight got ‘results’, so we could make sure to hook in to get ‘extra stuff’ on every results fetching. it returned the @response itself, so we could hook into it to enhance the SolrDocument‘s returned to have extra stuff! Or hey, it’s a controller method, we can even have it set other iVars if we want. A nice clean intervention.

But alas! While I had used this in many-years-ago Blacklight, and it made it all the way to Blacklight 6… this method no longer exists in Blacklight 7, and there’s no great obvious place to override to do similar. It looks like it actually went away in a BL commit a couple years ago, but that commit didn’t make it into a BL release until 7.0. The BL 7 def index action method… doesn’t really have any clear point to intervene to do similar.

Maybe I could do something over in the ‘search_service.search_result’  method, I guess in a custom search_service. It’s not a controller method, so couldn’t provide it’s own iVar, but it could still modify the @response to enhance the SolrDocuments in it.  There are some more layers of architecture to deal with (and possibly break with future BL releases), and I haven’t yet really figured out what the search_service is and where it comes from! But it could be done.

Or I could try to get search_results cover method back in BL. (Not sure how ammenable BL would be to such a PR).

Or I could… override even more? The whole catalog index method? Don’t even use the blacklight Catalog controller at all but write my own? Both of those, my intuition based on experience with BL says, there be dragons.

2. Use Solr, but not Blacklight

So as I contemplated the danger of overriding big pieces of BL, I thought, ok, wait, why am I using BL at all, actually?

A couple senior developers at a couple institutions I talked to (I won’t reveal their names in case I’m accidentally misrepresenting them, and to not bring down the heat of bucking the consensus on them) said they were considering just writing ruby code to interact with Solr. (We’re talking on search/discovery, indexing — getting the data into Solr in the first place — is another topic). They said, gee, what we need in our UI for Solr results just isn’t that complicated, we think maybe it’s not actually that hard to just write the code to do it, maybe easier than fighting with BL, which in the words of one developer has a tendency to “infect” your application making everything more complicated when you try doing things the “Blacklight way”.

And it’s true we spend a lot of time overriding or configuring Blacklight to turn off features we didn’t think worked right, or we just don’t want in our UX. (Sufia/hyrax have traditionally tried to override Blacklight to make the ‘saved searches’ feature go away for instance). And there’s more features we just don’t use.

Could we just write our own code for issuing queries to BL, and displaying facets and hits from the results? Maybe. Off-hand, I can think of a couple things we get from BL that are non-trivial.

  1. The “back to search” link on the item detail page. Supplying a state-ful UX on top of state-less HTTP web app (while leaving the URLs clean) is a pain.  The Blacklight implementation has gone through various iterations and probably still has some weirdness, but my stakeholders have told me this feature is important.
  2. The date range facet with little mini-histogram, provided by blacklight_range_limit. This feature is also implemented kind of crazily (I should know, I wrote the first version, although I’m not currently really the maintainer) — if all you want is a date range limit where you enter a start and end year, without the little bar-graph-ish histogram, that’s actually easy, and I think some people are using blacklight_range_limit when that’s all they want, and could be doing it a lot simpler. But the histogram, with the nice calculated (for the particular result set!) human-friendly range endpoints (it’ll be 10s or 5s or whatever, at the right order of magnitude for your current facetted results!), kind of a pain, and it just works with blacklight_range_limit (although I don’t actually know if blacklight_range_limit works for BL7, it may not).

Probably a few more things I’m not thinking of that I’d run into.

On the plus side, wouldn’t have to fight with Blacklight to turn off the things I don’t want, or to get it to have the retrieval behavior I want retrieving hits my actual rdbms for display.

Hmm.

(While I keep looking at sunspot for ideas — it is/was somewhat popular, so must have at least gotten some developer APIs right for some use cases involving rdbms data searched in Solr — it’s got some awfully complicated implementation, is assuming certain “basic search” use cases as the golden path and definitely has some things I’d have to fight with, and has a “Looking for maintainers” line on it’s README, so I’m not really considering actually using it).

3. Should we use ElasticSearch?

Hypothetically, I think Blacklight was abstracted at one point to support ElasticSearch. I’m not totally sure how that went, if anyone is using BL with ES in production or whatever.

But if I wanted to use ElasticSearch, I think I probably wouldn’t try to use it with BL, but as an extension of the “2” part. If I’m going to be writing it myself anyway, might we want to use ElasticSearch instead?

ElasticSearch, like Solr, is an HTTP-api search engine app built on top of lucene. In some ways, I think Solr is a victim of being first. It’s got a lot more… legacy.  And different kinds of deployments it supports. (SolrCloud or not cloud? Managed schema or not? What?) Solr can sometimes seem to me like it gives you a billion ways to do whatever you want to do, but you’ve got to figure out which one to do (and whatever you choose may break some other feature). Whereas ElasticSearch just seems to be more straighfforrward. Or maybe that’s just that it seems to have better clearer documentation. It just seems less overwhelming, and I theoretically am familiar with Solr from years of use (but I always learned just enough to get by).

For whatever reasons, ElasticSearch seems to have possibly overtaken Solr in popularity, be, seems to be easier to pay someone else for a cloud-hosted PaaS instance at an affordable price, and seems to just generally be considered a bit easier to get started with.

I’ve talked to some other people in samvera space who are hypothetically considering ElasticSearch too, if they could do whatever they wanted (although I know of nobody actually moving forward with plans).

ElasticSearch at least historically didn’t have all the features and flexiblity of Solr, but it’s caught up a lot. Might it have everything we actually need for this digital collections app?

I’m not sure. Historically ES had problems with facets, and while it seems to have caught up a lot… they don’t work quite like Solr’s, and looking around to see if they do everything I need, it seems like there are a couple different ES features that approach Solr’s “facets”, and I’m not totally sure either does what I actually need (ordinary Solr facets: exact term counts, sorted by most-represented-term-first, within a result set).

It might! But really ES’s unfamiliarity is the biggest barrier. I’d have to figure out how to do things with slightly different featureset, and sometimes might find myself up against a brick wall, and am not sure I’d know that for sure until I’m in the middle of it. I have a pretty good sense of what Solr can do at this point, I know what I’m getting into.

(ES also maybe exposes different semantics around lucene ‘commits’? If you need synchronous “realtime” commits immediately visible on next query, I think maybe you can get that from ES, but I’m not 100% confident, it’s definitely not ES’s “happy path”. Historically samvera apps have believed they needed this; I’m not sure I do if I succesfully have search engine functionality resting more lightly on the app. But I’m not sure I don’t).

So what will I do?

I’m actually not sure, I’m a bit stumped.

I think going to ElasticSearch is probably too much for me right now, there’s too many balls in the air in rewriting this app to add in search engine software I’m not familiar with that may not have quite the same featureset.

But between using BL and doing it myself… I’m not sure, both offer some risks and possible rewards.

The fact that I can’t use the override-point in BL I was planning to, cause it’s gone from BL 7, annoys me and pushes me a bit more to consider a DIY approach. But I’m not sure if I’m going to regret that later. I might start out trying it out and seeing where it gets me… or I might just figure out how to hack in the rdbms-retrieval pattern I want into BL, even if it’s not pretty. I know want to write my display logic in terms of my ActiveRecord models, and with full access to ActiveRecord eager-loading to load any associated records I need (ala sunspot), instead of trying to jam it all into a Solr record in a denormalized fashion. Being able to get out of that by escaping from sufia/hyrax was one of the main attractions of doing so!

Our progress on new digital collections app, and introducing kithe

In September, I wrote a post on a “Proposed Rails-based digital collections developer’s toolkit”

What has happened since then?

Yes we decided to go ahead with a rewrite of our digital collections app, with the new app not based on Hyrax or Valkryie, but a persistence layer based on ActiveRecord (making use of postgres-specific features were appropriate), and exposing ActiveRecord models to the app as a whole.

No, we are not going forward with trying to make that entire toolkit”, with all the components mentioned there.

But Yes, unlike Alberta, we are taking some functionality and putting it in a gem that can be shared between institutions and applications. That gem is kithe. It includes some sharable modeling/persistence code, like Valkyrie (but with a very different approach than Valkyrie), but also includes some additional fundamental components too.

Scaling back the ambition—and abstraction—a bit

The total architecture outlined in my original post was starting to feel overwhelming to me. After all, we also need to actually produce and launch an app for ourselves, on a “reasonable” timeline, with fairly high chance of success.  I left my conversation with U Alberta (which was quite useful, thank you to the Alberta team!), concerned about potential over-reach and over-abstraction. Abstraction always has a cost and building shared components is harder and more time-consuming than building a custom app.

But, then, also informed by my discussion with Alberta,  I realized we basically just had to build a Rails app, and this is something I knew how to do, and we could, as we progressed, jetison anything that didn’t seem actually beneficial for that goal or seem feasible at the moment. And, also after discussion with a supportive local team, my anxiety about the project went down quite a bit — we can do this.

Even when writing the original proposal, I knew that some elements might be traps. Building a generalized ACL permissions system in an rdbms-based web app… many have tried, many have fallen. :)  Generalized controllers are hard, because they are a piece very tightly tied to your particular app’s UI flows, which will vary.

So we’ve scaled back from trying to provide a toolkit which can also be “scaffolding” for a complete starter app.  The goals of the original thought-experiment proposal — a toolkit which provides  pieces developers put together when building their own app — are better approached, for now, by scaling back and providing fewer shared tools, which we can make really solid.

After all, building shared code is always harder than building code for your app. You have more use cases to figure out and meet, and crucially, shared code is harder to change because it’s (potentially) got cross-institutional dependents, which you have to not break. For the code I am putting into kithe, I’m trying to make it solidly constructed and well-polished. In purely local code,  I’m more willing to do something experimental and hacky — it’s easy enough (comparatively!) to change local app code later.  As with all software, get something out there that works, iterating, using what you learn. (It’s just that this is a lot harder to do with shared dependencies without pain!)

So, on October 1st, we decided to embark on this project. We’re willing to show you our fairly informal sketch of a work plan, if you’d like to look.

Introducing kithe

But we’re not just building a local app, we are also trying to create some shareable components. While the costs and risks of shared code and abstractions are real,  I ultimately decided that “just Rails” would not get us to the most maintainable code after all. (And of course nothing is really just Rails, you are always writing code and using non-Rails dependencies; it’s a matter of degree, how much your app seems like a “typical” Rails app to developers).

It’s just too hard to model the data we ourselves already needed (including nested/compound/repeated models) in “just” ActiveRecord, especially in a way that lets you work with it sanely as “just” ActiveRecord, and is still performant. (So we use attr_json, which I also developed, for a No-SQLy approach without giving up rdbms or ActiveRecord benefits including real foreign-key-based associations). And in another example, ActiveStorage was not flexible/powerful enough for our file-handling needs (which are of course at the core of our domain!), and I wasn’t enthused about CarrierWave either — it makes sense to me to make some solid high-quality components/abstractions for some of our fundamental business/domain concerns, while being aware of the risks/costs.

So I’ve put into kithe the components I thought seemed appropriate on several considerations:

  • Most valuable to our local development effort
  • Handling the “trickiest” problems, most useful to share
  • Handling common problems, most likely to be shareable; and it’s hard to build a suite of things that work together without some modelling/persistence assumptions, so got to start there.
  • I had enough understanding of the use-cases (local and community) that I thought I could, if I took a reasonable amount of extra time, produce something well-polished, with a good developer experience, and a relatively stable API.

That already includes, in maybe not 1.0-production-ready but used in our own in-progress app and released (well-tested and well-documented) in kithe:

  • A modeling and persistence layer tightly coupled to ActiveRecord, with some postgres-specific features, and recommending use of attr_json, for convenient “NoSQL”-like modelling of your unique business data (in common with existing samvera and valkyrie solutions, you don’t need to build out a normalized rdbms schema for your data). With models that are samvera/PCDM-ish (also like other community solutions).
    • Including pretty slick handling of “representatives”, dealing with the performance issues in figuring out representative to display with constant query time (using some pg-specific SQL to look up and set “leaf” representative on save).
    • Including UUIDs as actual DB pk/fks, but also a friendlier_id feature for shorter public URL identifiers, with logic to automatically create such if you wish.
  • A nice helper for building Rails forms with repeatable complex embedded values. Compare to the relevant parts of hydra-editor, but (I think) lighter and more flexible.
  • A flexible file-handling architecture based on shrine — meaning transparent cloud-storage support out of the box.
    • Along with a new derivatives architecture, which seems to me to have the right level of abstraction and affordances to provide a “polished” experience.
    • All file-handling support based on assuming expensive things happen in the background, and “direct upload” from browser pre-form-submit (possibly to cloud storage)

It will eventually include some solr/blacklight support, including a traject-based indexing setup, and I would like to develop an intervention in blacklight so after solr results are returned, it immediately fetches the “hit” records from ActiveRecord (with specified eager-loading), so you can write your view code in terms of your actual AR models, and not need to duplicate data to solr and logic for dealing with it. This latter is taken from the design of sunspot.

But before we get there, we’re going to spend a little bit of time on purely local features, including export/import routines (to get our data into the new app; with some solid testing/auditing to be confident we have), and some locally bespoke workflow support (I think workflow is something that works best just writing the Rails). 

We do have an application deployed as demo/staging, with a basic more-than-just-MVP-but-not-done-yet back-end management interface (note: it does not use Solr/Blacklight at all which I consider a feature), but not yet any non-logged-in end-user search front-end. If you’d like a guest login to see it, just ask.

Technical Evaluation So Far

We’ve decided to tie our code to Rails and ActiveRecord. Unlike Valkyrie, which provides a data-mapper/repository pattern abstraction, kithe expects the dependent code to use ActiveRecord APIs (along with some standard models and modelling enhancements kithe gives you).

This means, unlike Valkyrie, our solution is not “persistence-layer agnostic”. Our app, and any potential kithe apps, are tied to Rails/ActiveRecord, and can’t use fedora or other persistence mechanisms. We didn’t have much need/interest in that, we’re happy tying our application logic and storage to ActiveRecord/postgres, and perhaps later focusing on regularly exporting our data to be stored for preservation purposes in another format, perhaps in OCFL.

It’s worth noting that the data-mapper/repository pattern itself, along the lines valkyrie uses, is favored by some people for reasons other than persistence-swapability. In the Rails and ruby web community at large, there is a contingent that think the data-mapper/repository pattern is better than what Rails gives you, and gives you better architecture for maintainable code. Many of this contingent is big on hanami, and the dry-rb suite.  (I have never been fully persuaded by this contingent).

And to be sure, in building out our approach over the last 4 months, I sometimes ran right into the architectural issues with Rails “model-based” architecture and some of what it encourages like dreaded callbacks.  But often these were hypothetical problems, “What if someone wanted to do X,” rather than something I actually needed/wanted to do now. Take a breath, return to agility and “build our app”.

And a Rails/ActiveRecord-focused approach has huge advantages too. ActiveRecord associations and eager-loading support are very mature and powerful tools, that when exposed to the app as an API give you very mature, time-tested tools to build your app flexibly and performantly (at least for the architectures our community are used to, where avoiding n+1 queries still sometimes seems like an unsolved problem!).  You have a whole Rails ecosystem to rely on, which kithe-dependent apps can just use, making whatever choices they want (use reform or not?) as with most any Rails app, without having to work out as many novel approaches or APIs. (To be sure, kithe still provides some constraints and choices and novelty — it’s a question of degree).

Trying to build up an alternative based on data-mapper/repository, whether in hanami or valkyrie, I think you have a lot of work to do to be competitive with Rails mature solutions, sometimes reproducing features already in ActiveRecord or it’s ecosystem. And it’s not just work that’s “time implementing”, it’s work figuring out the right APIs and patterns. Hanami, for instance, is probably still not as mature, as Rails, or as easy to use for a newcomer.

By not having to spend time re-inventing things that Rails already has solutions for, I could spend time on our actual (digital collections) domain-specific components that I wasn’t happy with existing solutions for. Like spending time on creating shareable file handling and derivatives solutions that seem to me to be well-polished, and able to be used for flexible use-cases without feeling like you’re fighting the system or being surprised by it. Components that hopefuly can be re-used by other apps too.

I think schneem’s thoughts on “polish” are crucial reading when thinking about the true costs of shared abstractions in our community.  There is a cost to additional abstractions: in initial implementation, ongoing maintenance, developer on-boarding, and just figuring out the right architectures and APIs to provide that polish. Sometimes these costs are worthwhile in delivered benefits, of course.

I’d consider our kithe-based approach to be somewhere in between U Alberta’s approach and valkryie, in the dimension of “how close do we stick to and tie our line to ‘standard’ Rails”.

Unlike Hyrax, we are building our own app, not trying to use a shared app or “solution bundle” like Hyrax. I would suggest we share that aspect with both the U Alberta approach as well as the several institutions building valkyrie-not-hyrax apps. But if you’ve had good experiences with the over-time maintenance costs of Hyrax, you have a use case/context where Hyrax has worked well for you — then that’s great, and there’s never anything wrong with doing what has worked for you.

Overall, 4 months in, while some things have taken longer to implement than I expected, and some unexpected design challenges have been encountered — I’m still happy with the approach we are taking.

If you are considering a based-on-valkyrie-no-hyrax approach, I think you might be in a good position to consider a kithe approach too.

How do we evaluate success?

Locally,

We want to have a replacement app launched in about a year.

I think we’re basically on target, although we might not hit it on the nose, I feel confident at this point that we’re going to succeed with a solid app, in around that timeline. (knock on wood).

When we were considering alternate approaches before committing to this one, we of course tried to compare how long this would take to various other approaches. This is very hard to predict, because you are trying to compare multiple hypotheticals, but we had to make some ballpark guesses (others may have other estimates).

Is this more or less time than it would have taken to migrate our sufia app to current hyrax? I think it’s probably taking more time to do it this new way, but I think migrating our sufia app to current hyrax (with all it’s custom functionality for current features) would not have been easy or quick — and we weren’t sure current hyrax was a place we wanted to end up.

Is it going to take more or less time than it would have taken to write an app on valkyrie, including any work we might contribute to valkyrie for features we needed? It’s always hard to guess these things, but I’d guess in the same ballpark, although I’m optimistic the “kithe” approach can lead to developer time-savings in the long-run.

(Of course, we hope if someone else wants to follow our path, they can re-use what’s now worked out in kithe to go quicker).

We want it to be an app whose long-term maintenance and continued development costs are good

In our sufia-based app, we found it could be difficult and time-consuming to add some of the features we needed. We also spent a lot of time trying to performance-tune to acceptable levels (and we weren’t alone), or figure out and work towards a manageable and cost-efficient cloud deployment architecture.

I am absolutely confident that our “kithe” approach will give us something with a lower TCO (“total cost of ownership”) than we had with sufia.

Will it be a lower TCO than if we were on the present hyrax (ignoring how to get there), with our custom features we needed? I think so, and that current hyrax isn’t different enough from sufia we are used to — but again this is necessarily a guess, and others may disagree. In the end, technical staff just has to make their best predictions based on experience (individual and community).  Hyrax probably will continue to improve under @no-reply’s steady leadership, but I think we have to make our decisions on what’s there now, and that potential rosey future also requires continued contribution by the community (like us) if it is to come to fruition, which is real time to be included in TCO too.   I’m still feeling good about the “write our own app” approach vs “solution bundle”.

Will we get a lower TCO than if we had a non-hyrax valkyrie-based app? Even harder to say. Valkryie has more abstractions and layers that have real ongoing maintenance costs (that someone has to do), but there’s an argument that those layers will lower your TCO over the long-term. I’m not totally persuaded by that argument myself, and when in doubt am inclined to choose the less-new-abstraction path, but it’s hard to predict the future.

One thing worth noting is the main thing that forced our hand in doing something with our existing sufia-based app is that it was stuck on an old version of Rails that will soon be out-of-support, and we thought it would have been time-consuming to update, one way or another.  (When Rails 6.0 is released, probably in the next few months, Rails maintenance policy says nothing before 5.2 will be supported.) Encouragingly, both kithe and attr_json dependency (also by me), are testing green on Rails 6.0 beta releases — and, I was gratified to see, didn’t take any code changes to do so, they just passed.  (Valkyrie 1.x requires Rails 5.1, but a soon-to-be-released 2.0 is planned to work fine up to Rails 6; latest hyrax requires Rails 5.1 as well, but the hyrax team would like to add 5.2 and 6 soon).

We want easier on-boarding of new devs for succession planning

All developers will leave eventually (which is one reason I think if you are doing any local development, a one-developer team is a bad idea — you are guaranteeing that at some point 100% of your dev team will leave at once).

We want it to be easier to on-board new developers. We share U Alberta’s goal that what we could call a “typical Rails developer” should be able to come on and maintain and enhance the app.

Are we there? Well, while our local app is relatively simple rails code (albeit using kithe API’s), the implementation of  kithe and attr_json, which a dev may have to delve into, can get a bit funky, and didn’t turn out quite as simple as I would have liked.

But when I get a bit nervous about this, I reassure myself remembering that:

  • a) Our existing sufia-based app is definitely high-barrier for new devs (an experience not unique to us), I think we can definitely beat that.
    • Also worth pointing out that when we last posted a position, we got no qualified applicants with samvera, or even Rails, experience. We did make a great hire though, someone who knew back-end web dev and knew how to learn new tools; it’s that kind of person that we ideally need our codebase to be accessible to, and the sufia-based one was not.
  • b) Recruiting and on-boarding new devs is always a challenge for any small dev shop, especially if your salaries are not seen as competitive.  It’s just part of the risk and challenge you accept when doing local development as a small shop on any platform. (Whether that is the right choice is out of scope for this post!)

I think our code is going to end up more accessible to actually-existing newly onboarded devs  than a customized hyrax-based solution would be. More than Valkyrie? I do think so myself, I think we have fewer layers of “specialty” stuff than valkyrie, but it’s certainly hard to be sure, and everyone must judge for themselves.

I do think any competent Rails consultancy (without previous LAM/samvera expertise) could be hired to deal with our kithe-based app no problem; I can’t really say if that would be true of a Valkyrie-based app (it might be); I do not personally have confidence it would be true of a hyrax-based app at this point, but others may have other opinions (or experience?).

Evaluating success with the community?

Ideally, we’d of course love it if some other institutions eventually developed with the kithe toolkit, with the potential for sharing future maintenance of it.

Even if that doesn’t happen, I don’t think we’re in a terrible place. It’s worth noting that there has been some non-LAM-community Rails dev interest in attr_json, and occasional PRs; I wouldn’t say it’s in a confidently sustainable place if I left, but I also think it’s code someone else could step into and figure out. It’s just not that many lines of code, it’s well-tested and well-documented, and and i’ve tried to be careful with it’s design — but take a look at and decide for yourself!. I can not emphasize enough my belief that if you are doing local development at all (and I think any samvera-based app has always been such), you should have local technical experts doing evaluation before committing to a platform — hyrax, valkyrie, kithe, entirely homegrown, whatever.

Even if no-one else develops with kithe itself, we’d consider it a success if some of the ideas from kithe influence the larger samvera and digital collections/repository communities. You are welcome to copy-paste-modify code that looks useful (It’s MIT licensed, have at it!). And even just take API ideas or architectural concepts from our efforts, if they seem useful.

We do take seriously participating in and giving back to the larger community, and think trying a different approach, so we and others can see how it goes, is part of that. Along with taking the extra time to do it in public and write things up, like this. And we also want to maintain our mutually-beneficial ties to samvera and LAM technologist communities; even if we are using different architectures, we still have lots of use-cases and opportunities for sharing both knowledge and code in common.

Take a look?

If you are considering development of a non-Hyrax valkyrie-based app, and have the development team to support that — I believe you have the development team to support a kithe-based approach too.

I would be quite happy if anyone took a look, and happy to hear feedback and have conversations, regardless of whether you end up using the actual kithe code or not. Kithe is not 1.0, but there’s definitely enough there to check it out and get a sense of what developing with it might be like, and whether it seems technically sound to you. And I’ve taken some time to write some good “guide” overview docs, both for potential “onboarding” of future devs here, and to share with you all.

We have a staging server for our in-development app based on kithe; if you’d like a guest login so you can check it out, just ask and I can share one with you.

Our local app also should also probably be pretty easy for you to get installed (with dependencies) from a git checkout, and just run it and see how it goes. See: https://github.com/sciencehistory/scihist_digicoll/

Hope to hear from you!

What “Just standard Rails” means to the University of Alberta libraries

I recently had a chance to speak with the development team at the University of Alberta about their development of their jupiter digital repository app (live, github).

UAlberta had a sufia 6 app in production that was a pretty stock “institutional repository holding PDFs. Around Fall 2015, they started trying to “catch up” to sufia 7 with PCDM etc. — to get features they believed would make it easier to incorporate more ‘digital collections’ content, and to just avoid stale non-maintained dependencies.

In Summer 2017, after having spent almost two years trying to get on sufia 7, with mounting frustrations and still seeming far from the finish line — and after having hired a few non-library-archives-museum-experienced but experienced Rails developers — the University of Alberta libraries development team decided on a radical new approach. They decided it wasn’t clear what Sufia was giving them to justify the difficulty they were having with it. They started over, trying to keep things as close to “ordinary Rails” as possible.

At that time, Fedora still was an institutional requirement.  So they didn’t toss out all of the samvera stack. They decided that they’d chop off the trunk as close to the bottom as they could while still getting tools for working with fedora, and to them that meant a hydra-works dependency, but few other hyrax dependencies.  They basically started the app over.

Within about 6 months of that effort (Early spring 2018) with approximately two full-time developers, they were live with their app (jupiter repo), and have been happy with it so far. But they also still haven’t gotten to the originally planned content beyond the IR-type PDFs, the scanned monographs, newspapers, etc. And have had some developer turnover. (Hey, they’re currently hiring y’all).

The jupiter app implementation

My understanding of how their app works is based on just an hour conversation, plus a few hours spent looking at their source code and internal docs — I may get some things wrong!

Jupiter seems to be to be a pretty simple app, a fairly basic idea of an “institutional repository”.  Most of the items are single PDFs, without children/members.  The software does support some items being composed of a list of files — but no “child works”.  The metadata seems relatively simple; some repeatable strings, but no nested/compound objects (ie, an attribute whose values are multi-property objects themselves). While there is some self-deposit, there is no complicated workflow, basically just an edit form.

The permissions model is relatively simple. Matt Barnett, a lead developer for much of this process (who was there for our conversation, but left the team soon after that) told me that originally some internal stakeholders had been pushing for a more complex permissions model. But knowing what a tar-pit trying to implement ACLs could be, Matt pushed back, and they ultimately implemented a simple model: There are owners who can edit the things they own, and admins who can edit everything, and that’s about it.  By virtue of their campus SSO system, they got “shared accounts” for free, so people could log into a shared account when they needed to share edit privs.

They had been using hydra-deratives for their simple derivative needs (maybe just a single thumbnail for each PDF?), but when ActiveStorage, part of Rails, was released, they began switching the code to that (may or may not be merged into master/deployed in repo yet as this gets published).

Fedora is still there, modeled with hydra-works.  The indexing to solr is whatever was built into hydra-works. They just wrote their own straightforward forms with simple_form.  They also do a lot of CSV-based ingest, which they just wrote code for, like even sufia users would I think.

They use UUID primary keys.

Their app does index to solr — using the general ActiveFedora indexing methods, I think, solrizer and all.  You can see that their indexer is pretty stock, it mostly just calls “super”.

All of their objects exist as ActiveRecord “draft” objects while they are being edited, through more or less ordinary Rails mechanisms. When they have multi-valued fields, they use postgres json arrays, rather than actual normalized schema (which would suggest a different table). I’m not sure what they need to do to get this to work with forms and controller updates. These active record objects seem to use something custom for collection memberships, rather than actual active record associations. So in these regards it’s not quite a totally ordinary activerecord modelling.

The objects have a life in activerecord, but are mirrored to fedora at certain life cycle points — I believe this is also what gets them into solr (using samvera/active-fedora solr indexing code).  The public-facing front-end is based entirely on data from solr — but not using Blacklight, simply writing Rails code to issue queries and handle responses to Solr (with Rsolr I think).

A brief overview of their architecture, by Matt before he left, focusing especially on the persistence stuff (since that’s the least “rails”-y and most custom stuff), can be found in their public repo, here.   Warning, it is pretty negative about samvera/sufia/active_fedora, gird yourself. You can see there they have done a couple custom local things to make the ActiveFedora objects and classes to some extent mimic ActiveRecord, to make using them in Rails easier, trying to encapsulate the fedora-specific stuff inside API boundaries. While at a high level this is what ActiveFedora’s goal is — their implementation is simpler, smaller, less powerful and custom-fit to what they need. We can just say they’re happier with how their local implementation turned out. They also explicitly wrote it to support potential future elimination of fedora altogether.

Matt said if he had to do it over, he might have pushed harder on stripping fedora out too, and just having everything in postgres. And that is something the team still plans to look at seriously for the future.

So what does “just a rails app” mean?  And how do you deal with increased complexity of your requirements?

The most useful thing for me in the conversation was that Matt pushed back on my outline of a potential plan, saying I was still undertaking too much abstraction.

The U Alberta team suggested that I should keep it even simpler, with less DRY abstraction (and thus less tools that could be shared between institutions/apps), and more just building your app for what you need right now.  You can see some of this push-back, specifically in the context of what U Alberta needs, in another document he wrote before he left Alberta in the jupiter repo, on notes for future expansion. It is really worth reading,  to see an argument from even more extreme simplicity, from a developer experienced with Rails but not “infected” with “how libraries do things”   But beware, it’s not shy about our community shibboleths.

We developers (and we library folks) are really drawn the abstraction, generalization, and shared tools that meet as many needs as possible.  It can sometimes lead us astray. It is actually very common advice in software engineering to stick to what you  actually need today, for your app you are developing (you know your business/user needs and which are the highest priorities to business value, right?).  “Do the simplest thing that could possibly work”, “You aren’t gonna need it.” It keeps us honest.

However, I also think it’s possible to code yourself into a corner this way, where your app was fine for exactly what you needed then, but when you need one more thing… you can find you need to re-write large parts of it to accommodate.  In some ways this is my experience with current samvera stack, early fundamental architectural decisions pen us in when we have new use cases. That kind of problem stays smaller when you avoid  harder-to-change shared code, but I don’t it goes away entirely. Trying to play for the future always entails some “YAGNI” risk, but the more domain knowledge and experience you have… maybe you can do better at seeing where you are going and planning for it?

Just some of the specific component plans Matt was skeptical of…

attr_json vs. Just Plain ActiveRecord schemas

The jupiter app already has an activerecord implementation which isn’t strictly “ordinary” activerecord, in the sense they serialize multi-valued/repeatable fields to json arrays,  rather than needing a separate table for each attribute as an actual normalized schema would require. Plus the logic I don’t entirely understand but think might not be ordinary AR associations around collection and “community” membership.

So this already gets you away from the strict “ordinary Rails” path — I’m not sure how the JSON array fields are handled by form submission, or querying if you wanted to do querying (it’s possible all their querying is solr-based, which is familiar to samvera-land, and also unusual for “ordinary rails”).

At my institution, we already have the need for complex repeatable data–a simple example would be repeatable “inscription” notations, each of which has the text of the inscription and the location in the book.  So not just an array of strings, but perhaps an array of hashes.  Weiwei Shi (Digital Initiatives Applications Librarian) suggested in a follow-up message, “We can use the JSON data type to support a more complex data structure as needed” — that is, if I understand it, they are contemplating actual postgres representation somewhat similar to what I am with attr_json, if they end up needing complex json. Matt’s second document tries to draw a line between how they are doing things in “more-or-less completely standard Rails models” and the way I was proposing to do things — I’m not sure I actually see such a great distinction, the representations in postgres to me seem pretty similar, neither of which is standard Active Record patterns.

They do have each attribute in a separate column, whereas I was contemplating putting them all in a single json column. Their approach does have advantages for avoiding update race conditions (or needing optimistic locking to avoid them).  I perhaps should consider that, as an extra feature to attr_json. Although either way you get columns you can’t necessarily use ordinary ActiveRecord querying or form-based update with.

Everyone seems to doubt that attr_json is really going to work, ha. The skepticism towards newly invented non-trivial dependencies is justified, but I can point out attr_json is a lot fewer lines of code than ActiveFedora, or even Valkyrie —  I think it is a risk, but it’s scoped carefully and implemented robustly, and I can’t figure out any other way I’m confident would be simpler to actually meet our modeling needs — once you start doing this complex json stuff, I think you’ll find that it doesn’t behave like “ordinary rails” — for forms/updates, validations, etc. — and rather than hack it out on a case by case basis, it makes a lot of sense to me to solve the problem with something like attr_json, encapsulating the “not really like ordinary ActiveRecord” stuff as much as possible.

The other option of course would be an actual normalized schema, with one table per attribute. For our “inscriptions” that table might have two columns (text and location), for a simple repeatable alternate title it might only have one. It’s going to be a mess to try to prevent n+1 queries and keep db access performant.  I am encouraged I’m not on an insane track by the fact that even the U Alberta is using JSON serializations in postgres, not actually ordinary normalized data — I think as your data gets more complex (not just array of primitives, but need serialization as arrays of hashes), you’re really going to want something like attr_json.  But maybe I’m wrong.

And for better or worse, I have trouble giving up entirely on the idea of some shared tools to make things easier for others in the community too — because it’s fun and rewarding, and why should we all constantly re-invent the wheel? But it’s good to be reminded of the dangers that lie in that direction.

Associations

I’m not sure if Matt mentioned this specifically, but I realize I have added a lot of non “basic ActiveRecord” complexity to the data modelling my plan in order to support the PCDM-ish association modeling, where a work has “members” and the members can be either works of themselves (which can have multiple members) or single file objects, and they all need to be kept in order.

U Alberta’s app doesn’t have that. A work can have a list of files, the end.

At my institution I actually spent a while trying to convince stakeholders that we didn’t need that either, but it was the one thing I could make no headway on — eventually they convinced me we did, to accomplish our actual business goals.

If you need this, I can’t figure out any way to get there in an “ActiveRecord-first”-ish direction, except either single-table-inheritance or polymorphic associations.  Both of which are some of the oddest and least reliable corners of ActiveRecord. Of the two, I think STI is probably least weird and most likely to do more of standard use cases minimizing number of db queries. (Valkryie’s approach is somewhat similar in how it uses the DB to single-table inheritance, but without actually using that AR feature).

Shrine

Matt thought that shrine might do more than ActiveStorage now, but history shows things built into Rails will probably expand and get better. (Yes, but it’s unclear to me how to make audio or video “variants” or derivatives with ActiveStorage, which my place of work predicts to need very shortly. If we are really ruthless about only what we need right now, are we going to have to just rewrite it as soon as we need another thing? There are no easy answers, “YAGNI” is simpler when it’s all about software you are writing yourself and not dependencies… but there are grey areas too).

But I’m not certain about this, after trying to help shrine developers enhance the versions/derivatives functionality to better support some flexibility we need as to storage locations and point-in-time of creation. The answer may just be trying to write an app which adds on locally to do exactly what it needs (whether in terms of shrine  or ActiveStorage), without trying to make a shareable general purpose tool?

Blacklight

Matt was very suspicious of using Blacklight at all, he found that it was quite easy for them to just write the UI they needed based on Solr responses. And their app certainly is as good as many sufia/hyrax apps (it even has an actual search-within-the-collection feature on collection pages, which I think our sufia 7 app didn’t, although I think latest hyrax does).

But remember my inability to entirely give up on the idea of a shareable toolkit? I really would like something that serves as “scaffolding” that gives you an app out of the box with basic file ingest, metadata edit, and search out of the box. And Blacklight is a way to do this. And I think my plan to segregate Blacklight from the rest of the app (it should be a dependency you can switch out) by immediately fetching records from postgres corresponding to solr search results — may be able to keep Blacklight from “infecting” the rest of the app with Blacklight-isms, as Matt was worried it could.

How simple is simple?

It was useful to have Matt call my bluff to some extent: What I have been hypothetically proposing isn’t really “just plain rails”.  But it’s a lot closer than current samvera, or even valkyrie.

I think some of the valkyrites think valkyrie’s departures from “ordinary Rails” are a a positive, that they can use different patterns to do better than Rails…  which is not a unique idea to them…  but I think is a bit hubristic, to think you can invent something better (and easier to onboard new developers with?) than Rails. (I also wonder if the valkyrites, if freed from the need to support fedora too, would simply use hanami?)

The same charges of hubris can be brought to my initial sketch of plans though — it was useful to be challenged from the “left” of “you’re still not simple enough” by Matt. I am so used to thinking about my in-formation plans as a/the simple alternative to, well, samvera or even valkyrie… it was a refreshing and needed corrective to be talking to Matt who thought my plans were still too much abstraction, not as simple as possible, not sticking close enough to implementing only what was needed for my business needs. On the one hand, it makes me worried he’s right; on the other, it makes me more comfortable to be in a nice middle ground of moderation with people advocating things on both sides or me, both heavier-weight and lighter-weight, sharing more code with the LAM digital collections community on one side, and sharing basically none on the other.

Really, “just plain rails” or “just plain [any code]” is to some extent a mirage, or an aspiration. We’re always writing code when we build a Rails app.  We’re always using some dependencies. While there can be a false economy in trying to share all your code in hopes of minimizing the amount of code that has to be written in aggregate (it often doesn’t work out that way because building good re-usable abstractions is hard) — there can also of course be a false economy in never using anyone elses dependency, and “not invented here” syndrome.  And if you’re writing it yourself — it’s writing abstraction layers that are potentially giving you not-worth-it complexity, whether you keep them in the app or make them into a gem. But abstraction layers are also what allow us to do complex things that we can still comprehend as humans — when it works. 

Software is still a craft. As Alberta needs to add additional features, with their aspirations to add a lot more digital-collections-like content — it’s going to take software craftsmanship to figure out how to keep it simple.  What I like about U Alberta’s approach is they realize this.  They realize they are an internal development shop, and need to let developers do what developers do — rather than have non-technical stakeholders making technical decisions for non-technical reasons.  (At one point someone said: After having been ‘burned’ before, they are very suspicious of using common/shared software, vs. just writing their app — which is part of their skepticism towards attr_json —  I think they’re not wrong).

One thing letting an internal development shop excel entails is figuring out how to recruit and retain a solid development team with limited budget, which is one reason Alberta is trying to be so ruthless about keeping it simple and “standard”.  One phrase I heard repeated was “industry-standard onboarding”, which I think also implies needing to be accessible to relatively “junior” new hires, which requires keeping your stack simple and standard. (That is, traditional-samvera or valkyrie-using institutions do not necessarily have any less challenge here and may have more, as for instance Steven Anderson of BPL argued)

(But I wonder if on-boarding a new developer to an existing project that has a very small dev team is going to be challenging across the industry!  I am not convinced that “Where the Rails community has a diversity of opinions on an approach, we should prefer the approach espoused by the Rails core team” (from a Matt/Alberta manifestoalways and necessarily leads to the simplest code or easiest to on-board new developers with. sometimes you can build a monster in the pursuit of not doing something novel…. the irony, right? But it’s always worth considering the trade-offs).

I definitely might end up re-orienting.  For instance, Matt reminded me of something I knew but tried to forget even when writing out my notes for a possible plan: A generalized permissions/ACL system is a craggy shore that many ships have crashed upon. Should I just write for my app the permissions we need instead? After doing some more business analysis to figure out what they are?  Perhaps. More broadly, if we end up trying to implement this “toolkit” and I’m running into troubles and worrying our reach exceeded our grasp — retreat to just the app good enough for what we need right now is always a valid escape hatch.

U Alberta’s story, where they’ve been working on this app with a very different approach for over a year, and so far are happy —  is another good data point reminding us that dissatisfaction with the samvera stack is not new, especially institutions that have developers with wider Rails experience have been suspicious of the value propositions of fedora and samvera for some time.  And that there are a variety of approaches being tried. We all need community to bounce our ideas off of and get feedback, especially those of us who operate in 2-4 person development shops need more than we may get internally. I’m so glad they were willing to spend some time talking to me.  And I highly encourage reading all of Matt/U Alberta’s somewhat iconoclastic analysis docs, as one way of considering other perspectives.  I’m not sure if I can find the time, but I’d kind of like to “onboard” myself into their codebase, and understand how it works better as one example.

 


Thanks to the whole U Alberta team, and especially Peter Binkley, Weiwei Shi, and Matt Barnett, for spending time explaining what they were up to to me. Thanks to Peter and Weiwei for reviewing this post for any terrible errors.  All remaining mistakes and wrong opinions are my own.

On the present and future of samvera technical architectures

Here where I work, we have a digital collections app (live; source) based on sufia 7.4. This is not sustainable for the long-term, as the community’s development efforts have largely moved from sufia to its replacement hyrax, and the latest version of Rails sufia runs on is 5.0, which will eventually be end-of-lifed. (exact schedule unknown).

Upgrading/migrating to hyrax would be the ‘obvious’ path, but it would take some significant work; we aren’t super happy with the sufia/hyrax architecture; and this turns out to be a time of some transition in the samvera community.

In figuring out what’s going on and identifying and evaluating available options, I’ve had to do quite a bit of research.  So I wanted to share my evaluation and analysis with the community, to hopefully help others understand the lay of the land — but also to explain why we are considering some new approaches. As I’ve been doing this, I have begun to develop opinions on how to move forward here, and I’m leaning towards a novel approach not based on existing community consensuses — I’ve done my best to present information objectively and identify the parts that are my own judgements/evaluations, but I’ll be up front about my present bias/leanings so you can judge for yourself or be cautious.

Also, while there has been recent exciting work on changing and improving governance processes and structures in Samvera, this post will focus only on the software products and technical architectures in the samvera community, “the stack”.

The Challenging Near Past/Present

I think it’s important to be clear about some of the challenges of the current software stack, to understand why people are going in some different directions, and how to evaluate those directions.

The current situation is that many, probably not all, but more than a few, people and teams working with sufia/hyrax and the samvera stack have found it very challenging in a variety of ways.  Here are some I know about, many from personal experience, that you may have seen me address in past blog posts too.

Performance can be a significant problem, at several different parts of the stack. Some I have encountered:

⇒ Saving a Work can take a 10 or more seconds in our app. Perhaps only an inconvenience for a single work, but can add up to be a real problem in higher-order functions that save multiple works, bulk ingests, or test suites. (also increases the cost of logic that saves multiple times where one time could conceivably have worked, as I have encountered in the stack).

⇒ So far in our attempts to make a feature to let you change a fileset into a child work (delete fileset, create work at same order in members list, with come copied attributes over), the operation can take five minutes to complete. We are in the midst of quite a bit of developer work to try to figure out what’s going on and if we can improve it. This feature is taking several weeks to develop because of underlying stack complexity.

⇒ Our app with stock sufia code had several “n+1 query” problems, where on display a separate Solr query was being done for each item displayed (on results page, or child items on a work detail page), making response time unacceptably slow. When using ActiveRecord this has well-understood and easy fixes, but with this stack it took some somewhat complex local hacking to fix.

⇒ Re-indexing (to solr) our corpus consisting of ~6400 GenericWorks and ~18500 FileSets can take from 3 hours to 9+ hours, depending on nature of indexing, and even after some extensive optimization work. Comparing the 1.25/second best case to industry standards, it doesn’t look good.  For instance, indexing MARC to Solr using traject, people routinely get from a couple hundred to 1000+ records/s.

Trying to customize or add features to a sufia/hyrax app can be quite complicated, some find they are spending as much or more time trying to figure out how to get it to integrate with shared stack code (without creating large forwards-compat problems on upgrades) as they spend on the actual ‘business logic’.

⇒ This isn’t really about adding for more features to be built-in/configurable to Sufia/Hyrax. No matter how much is, our use cases vary enough that people will always want to be changing things in local ways or adding custom local features, and sufia/hyrax and the rest of the stack has always meant to support this.

Some organizations have tried but had problems attracting or retaining Rails developers (with Rails experience but without library/samvera experience).  These developers can find the samvera stack unnecessarily complex considering the problems it solves.

The cost of keeping your app up to date with new versions of stack dependencies can be great enough that many institutions wind up staying on old versions of shared dependencies.  My attempts at analyzing this appear to show a pretty big spread among sufia/hyrax and other dependency versions in repos “in the wild”.  (Here where I am, we are on sufia 7.4 — after valiantly trying to stay up to date, we decided we had to stick there to meet our launch deadlines).

ActiveFedora was intended to be a kind of port of ActiveRecord, with close to api-compatible modelling/persistence layer (not including querying).  But ActiveRecord is an incredibly complicated stack with literally years of developer time put into it, and is constantly evolving itself. What we’ve ended up with in AF has been found by many to be unreliable, with unpredictable performance characteristics and general behavior, very difficult to figure out how to use ‘correctly’, with very complex architecture hard to debug.

Parts of the stack, especially in sufia/hyrax, often seem not as mature as expected; there are bugs in features one thought were long-standing in the app; there isn’t necessarily clear and accurate shared understanding about what things are present in the code already, and what things need more work, or are likely to have lots of edge case bugs. This may be because of the several times there have been major refactorings to the sufia/hyrax codebase (fedora 3 to 4; an institutional repo focused app to more general; sufia to hyrax; etc). (It should be noted that the documentation working group is, working on building out better recorded shared understanding of features).

When thinking about this, I often go back to Richard Schneeman’s post on “polish” in software:

I’ve previously called these types of moments papercuts. They’re not life threatening and may not even be mission critical but they are much more painful than they should be. Often these issues force you to stop what you’re doing and either investigate the root cause of the rogue behavior or at bare minimum abandon your thought process and try something new.

When we say something is “polished” it means that it is free from sharp edges, even the small ones. I view polished software to be ones that are mostly free from frustration. They do what you expect them to and are consistent.

My experience  building an app to meet local needs using the samvera stack has often been at the other end of this continuum — near constant “papercuts”, sharp edges, frustrations, and “yak-shaving” investigations of the root causes of some unexpected behavior. My experience is that the software often does not do what I expect, or behave consistently.

I think sometimes when I discuss these issues, non-engineers think I’m just talking about programmers’ personal experience/emotions, that the code isn’t “fun” to work with. Now, I do think the affective result on your programmers’ day-to-day matters, how your programmers feel — burn-out is a real thing — but certainly no more than the pleasantness and efficacy of day-to-day work for all other non-programmer staff too; and we don’t expect it all to be “fun”, that’s why it’s a job.

But the reason this matters to your organization isn’t primarily because of how it makes programmers feel. It’s because all of the foregoing significantly increases the cost of launching and maintaining your software. Organizations find it takes much longer, or many more engineers, than expected to get to first launch. Adding what even the engineers might have expected would be a fairly simple feature can take order(s) of magnitude more time than expected. Bugs can appear which are enormously time-consuming to track down and fix, if they can feasibly be fixed at all by your engineers. In addition to cost/schedule, this can also simply affect your chances and levels of successfully meeting your business needs, both in initial launch and ongoing maintenance and development.

And when making technical choices, that’s what matters to an organization above all else — meeting business needs as efficiently and cost-effectively as possible (including staff-time and number of staff; staff is the biggest costs for most of us).  And to many, it wasn’t clear that current directions were getting them there.  Building and maintaining a samvera-stack based app that met local business needs well has seemed to some very expensive.

These are not observations unique to me, there has been a growing recognition of these challenges in the samvera development community. It has led to new samvera processes working to improve the situation gradually and steadily (for instance, the “Component Maintenance Working Group”, the Hyrax maintenance working group and the “Road Map Interest Group”); but has also led others to think it’s time to explore new architectural approaches and more drastic software changes.

Valkyrie: A new approach

Princeton University Libraries had an app called plum supporting their digital collections. It was:

  • A hydra app based on curation_concerns and some fairly old hydra dependency versions (not sufia/hyrax).
  • Staff-only editing/workflow. No self-deposit.
  • Used for metadata/asset management (with fedora 4 back-end), had no public interface of it’s own — (meta)data was vended to other public-facing app(s).

As outlined in two blog posts on a PUL Systems blog, they ran into some pretty severe performance problems. They spent significant development effort  improving performance, both locally and in PR’s back to hyrax and the stack.

In a presentation at Samvera Virtual Connect 2018, Esmé Cowles (presentation begins at 40:00 in video) said Princeton’s eventual direction (valkyrie) was motivated “not just becuase of performance problems, but because while we were working on those problems, we were constantly butting up against the complexity of the stack… That complexity was impeding us doing what we wanted to do to work on performance.”

While frustration with performance or legibility of the inherited architecture was not new to either Princeton or others, Princeton reached a point where they decided they had basically no choice but to take a departure from the inherited architecture, if they wanted to achieve their business goals, that the “inherited” stack was simply not tenable for their needs. Additionally, as the performance problems were centered on Fedora (as well as the ActiveFedora architecture), they decided the best path  was to move away from Fedora as the persistent store, and towards using the postgres rdbms.

We could imagine responding to that by writing either a bespoke local app or a shared toolkit/framework simply based on postgres. But Princeton really prioritized not separating from the samvera community, and based on that, decided instead to build a persistence abstraction that would allow the developer to switch between multiple back-ends (mainly targeting fedora or postgres, both likely in concert with solr), using the same class/method-level APIs for both.

That is what valkyrie is. It is just a modeling/persistence layer.  As far as what it does, valkyrie could be roughly compared to ActiveFedora or ActiveRecord.  It is not a “solution bundle”. It pretty much only addresses API(s) for modelling metadata and saving those models, whether to fedora, to postgres, or to other hypothetical future back-ends.  The rest of the business logic in a digital collections or institutional repository application would come from somewhere other than valkyrie, whether shared gems or local code.

Princeton proposed an official hydra/samvera working group to work on valkyrie, and got significant interest from other developers active in samvera community. valkyrie became a samvera community project, and as I write this is housed in the samvera-labs grouping.

Valkyrie uses a “Repository/Data Mapper” architecture that is different in some ways from Rails’ ActiveRecord design, and seems to be inspired by Hanami’s repository/data mapper implementation.  Valkyrie also uses some of the dry-rb libraries that are also used by hanami.   Valkyrie also requires the use of the reform form object library, generally in the form of the ChangeSet reform sub-class specialization.

In building out the main modelling and persistence abstraction to meet planned use cases, other particular-to-valkyrie abstractions were required, like ChangeSets (I don’t entirely understand them, but I think someone building an app based on valkyrie is going to have to) , and others that may normally stay “below the hood” like OptimisticLockToken.

Valkryie is not fundamentally based on linked data/RDF, its models are not defined based on linked data. The valkyrie fedora metadata adapter requires a mapping from model attributes to RDF predicates so it can be serialized to fedora; other external RDF serializations would require similar.

valkyrie “bespoke” apps

Princeton is live with figgy, their plum-replacement app based on valkyrie. figgy kind of served as a ‘demonstration/proof-of-concept app’ throughout valkyrie development, and still serves that role to some extent, as I believe the only valkyrie-based app in production, and the one by the same group of developers most central to valkyrie development.

Figgy is a rewrite of plum to meet same basic usage parameters. It is not technically a git fork/branch of plum, but some business logic was ported from plum.

Figgy does not use a samvera “solution bundle” (such as hyrax). It uses only a a few existing samvera-community dependencies as component building blocks where it makes sense (mainly hydra-editor and hydra-derivatives, see their Gemfile.lock). Existing pre-valkyrie components that can be used with a valkyrie-based app will generally be de-coupled enough that they can also be easily swapped out if the need ever arises. (Personally, my experience with hydra-derivatives for my own local needs would not lead me to follow their lead in using hydra-derivatives! But perhaps porting hydra-derivatives using code from plum to figgy made sense as a starting point).

Figgy then has a lot of local/bespoke architecture custom-fitted for it’s business needs, sometimes based on existing general Rails dependencies. One major example is custom local workflow logic/architecture based on the popular aasm (“acts as state machine”) gem.  It also vends changes to the other apps serving as front-ends using an RabbitMQ based eventing system, also more-or-less locally designed.

The other known valkyrie app in development is Penn State Library’s cho. I know less about cho, but my understanding is that it is not yet in production, and takes some very original/innovative architectural approaches — it is largely based on ingesting and editing via CSVs (rather than interactive web-based GUIs), including being able to dynamically define metadata schemas based on CSV.  Cho seems to use few existing samvera components on top of valkyrie, perhaps even fewer than figgy; mainly hydra-characterization.

Where is valkryie at now

Valkyrie has been under development for around 2 years, and has probably hundreds of developer-hours of work. As I write this a 1.2.0 version has an imminent release.  While valkyrie is already being used in production by princeton’s figgy, features that some might expect, need, or want for generalized use are still being developed on an ongoing basis. The 1.2.0 release (as I write this still in pre-release) adds some significant features, including: The ability to store single-values (rather than arrays of values) in properties; Optimistic locking; and Guaranteed persistently-ordered values (the first value in a list stays the first value in the list).

To some extent, as is common for open source, features are added to valkyrie when apps using valkyrie need them and the developers of those apps spend the time to add them to valkyrie.  While the valkyrie team is thinking to the future and trying to generalize for others, right now it’s primarily the needs of figgy and cho driving prioritization.  For instance, an Issue to suggest providing a generalized solution in valkyrie to “n+1 query” problems (a problem pretty central to my experience and concerns, as discussed above, but maybe not initially figgy or cho) was recently created, after it came up in figgy development.

If you need something that is conceptually part of modelling/persistence layer but isn’t really built into valkyrie, you often still have an option to add it, which often involves going “under the hood” and adding custom logic to the valkyrie adapters or custom queries.  So you may have to reckon with architectural components/complexity  ‘under the hood’ to meet such needs; and likely also means that you’d have to re-implement if you switched storage layers (from fedora to postgres or vice versa).

For instance, at present if you wanted values to be indexed to solr as a numeric type instead of string/text (so it could be sorted or range-facetted in solr), Trey Pendragon told me “you’d need to add a custom indexer to the solr adapter.” One should probably be cautious of assuming what features or use-case-supports are or aren’t already built out in valkyrie (like any other relatively complex dependency still reaching for maturity).

You can watch what things are being considered and prioritized for future valkyrie development in the open valkyrie waffle board.

Milestones in valkyrie and figgy history

Some personal analysis and evaluation — Valkyrie

Princeton and others investing in Valkyrie began from the requirement of being able to provide a stable consistent API on top of underlying data that could be stored either in Fedora or Postgres.

If you start from that requirement, the Valkyrie architecture is exactly where you are reasonably going to end up, this is an appropriate way of approaching that requirement (which typical Rails apps are not capable of fulfilling).

However (in my own opinion/experience/evaluation, like everything in this section), there is a significant cost to building the abstractions to make that possible. Every abstraction has a cost: in implementation, ongoing maintenance, and cognitive burden and ongoing work of developers using the abstraction and integrating it with others.  Building successful (efficient, polished, good TCO) abstractions as shared code between apps with diverse needs can be especially challenging.

Valkyrie is a fairly significant abstraction.  Its development necessarily involves significant work to figure out the right APIs and create implementations for features that, if you were simply assuming an rdbms (or even postgres specifically) and using ActiveRecord might just already be there. In addition to the basic mechanics of persistence, also: ordered values; optimistic locking; associations, joins and eager-loading to handle n+1 queries.  Or Rails recommended “Russian-Doll Caching” with automatic touching of parents.  In ActiveRecord, not just already there, but very mature with well-understood community knowledge about strengths, weaknesses, work-arounds, and best-practice usage patterns. Whereas all of these things, if they are to be used, need to be designed and implemented (and iterated to maturity and polish) in valkyrie — and with the increased challenge of making them work well for each of the persistence back-ends valkyrie intends to support.

Whether these costs are worth it depends on how important or beneficial the foundational requirement is, as well as how well the abstractions meet developer use cases. In part, this is hard to be absolutely sure about in advance — both the ultimate benefits and the ultimate costs can to some extent only be estimated/predicted and not known with certainty in advance of the actual development and community use.

Will valkyrie actually result in shared codebases/dependencies between postgres-using and fedora-using applications in our community?  At this point, there are not many examples already existing, it’s more a design goal for the future. I think it’s hard to know to what extent this prediction of the future will pan out.

How one evaluates the value proposition of the valkyrie effort also depends on the value one places on the base goal/requirement of supporting both fedora and postgres as persistence back-ends. It may be worth asking in what circumstances does fedora actually make sense, and how widespread are these circumstances?  I believe few (none?) of the current developers/institutions investing in Valkyrie are actually planning on using fedora, or missing it.   The requirement to support the possibility of back-end agnosticism may be less about the technical needs of anyone investing in valkyrie, and more about the social/political situation in our community, which has always used fedora before, and where simply moving to a non-fedora solution seemed originally too big a jump to be comprehensible as staying within the community.

⇒ (While there was some initial hope that the performance problems could be significantly improved while still using fedora by using valkyrie with, say, a non-active-fedora-based means of accessing fedora — so far only relatively minor improvements have been seen via this route, not enough to resolve the performance issues that led to valkyrie. It’s possible future implementations of the fedora APIs, whether from the fcrepo implementation or other, will do differently; predicting the future is always a gamble).

The valkyrie enthusiasts have been wisely careful not to express any judgement over the commitments of other institutions to fedora (we each have different business needs) — however, many of us beyond valkyrie have been increasingly questioning what value fedora brings us at what costs for some time, and I think it’s worth considering in exactly what conditions using fedora actually makes sense, and how common these conditions are.

If the eventual result is that most/all codebases using Valkyrie are using postgres rather than fedora — and I think that’s a real possibility — that is a significant cost we paid in development to make other things possible, and a significant ongoing cost we’ll continue to bear in maintaining and developing against the abstractions that were designed for that. (Or in a subsequent costly switch to not using them).

Esmé suggests that another benefit of valkyrie can be in hiring/retaining/onboarding developers, especially Rails developers from outside our development community, and that “following the patterns those developers know makes it easier to hire Rails developers and have them be productive and happy, (instead of being frustrated by ActiveFedora and Fedora more broadly).”

I share that concern and goal, but it is not totally clear to me how much valkyrie achieves  there — choosing to write to the Valkyrie API instead of ActiveRecord arguably takes us substantially outside of patterns that Rails developers know. While it seems safe to believe it will result in some level of improvement over previous stack,  when I look at figgy code I am not that confident in predicting to what extent a figgy-style app will be legible to the typical Rails developer, or escape what we could call the “weird custom community architecture” problem.

For myself, it’s not clear that the costs of developing (and developing against) the valkyrie abstraction will bear benefits justifying it. Might there be other ways to meet our individual as well as shared business/domain needs more efficiently in total-cost-of-development-and-ownership?  Might there be other ways for different teams to try different things while staying part of a community of practice?  Is the valkyrie approach actually necessary or sufficient for allowing people using different back-ends (or other architectures) to share other domain logic?

It is hard to answer these questions with surety, they rely on estimations and predictions of future events and comparing hypothetical counter-factuals. But based on an experience of challenges from complex and newer/less-mature architectures, I’m interested in trying to be ruthless about minimizing the number and complexity of abstractions/architectures, trying to find the simplest architecture possible to optimize our efficiency/productivity. “as simple as possible, but no simpler.” A significant abstraction layer to make possible both fedora and postgres does not excite me, when that’s not a requirement I think important for our local business needs.

However, one thing that is worth commenting is that I am actually totally happy with the valkyrites demonstrating the viability and sense of writing a “bespoke” app (which can still be based, where possible, on shared components), instead of trying to use a pre-built application/”solution bundle” that you customize according to it’s customization points.  Providing the latter in a high-quality way, mature, efficiency-increasing way is hard — especially when the developer community has diverse needs — and I personally suspect that a much wider swath of business cases will be better-served by the ‘component’ approach than has often been recognized in our community.

I suspect that using the hydra/samvera stack has almost always required local programming expertise, it has never (for most institutions) provided a “shrinkwrap” install-and-go experience. I appreciate the “bespoke” valkyrie apps openly trying to demonstrate that at least in some cases an explicit and acknowledged component-based put-it-together-yourself approach may be more productive (as Esmé in particular has pointed out).

The two current real-world valkyrie demonstration apps actually differ from what I see as the recent  “consensus path” in two significant ways:  valkyrie persistence layer and in explicitly embracing an “assemble-components” approach instead of a “customize-pre-built-solution” approach.

A Hyrax based on Valkyrie?

Okay, so we talked about valkyrie, and “bespoke” apps using valkyrie — what about the idea of refactoring/enhancing Hyrax to be based on valkyrie?

It is my impression that those initiating valkyrie, via a samvera working group, hoped this would the ultimate outcome, believing it was important for keeping those with “bespoke” valkyrie-based apps connected to and participating in the wider community — as well as a contribution the valkyrie effort could make to institutions wanting to stay on hyrax but with persistence layer flexibility.

As the valkyrie working group work wrapped up, even before the “final report” was released actually, there seemed to be general community consensus on this, and I believe a community decision was made to commit to this, although I’m not certain.

Work to switch hyrax over to valkyrie was begun, and significant development-hours were put into it. At one point it was believed there would be a hyrax version 3.0 based on valkyrie released around May 2018.

However, that phase of effort didn’t reach the goal-line (a release of valkyrie based on hyrax, or even a merge into master) before work mostly halted. I believe the valkyrie branch in the hyrax repo has the product of that work — last commit there is from March 6, 2018. I think it’s very hard to estimate how much work was remaining on that branch to get to a release (most of us have experienced the phenomenon where the “last 5%” can become more like half of total development effort).   Some of the developers who were primarily involved in that work seem, at least for the moment, no longer spending as much development time on hyrax generally; and as other hyrax development has continued, that branch would need to be reconciled with current master.

Since that work, Tom Johnson (@no-reply) has taken over as formal “technical lead” of hyrax, meaning technical architect in this case.

I asked on slack about the current thinking on future Hyrax and valkryie. Tom provided some info on his current plans and thinking in messages in the #hyrax channel of the samvera slack, dated August 13 2018 12:22PM and 12:34PM (eastern). (Alas, we may have no slack archives).

– Moving away from `ActiveFedora` and toward a backend-agnostic persistence technology is viewed as on the critical path for Hyrax’s success

– The community’s capacity to maintain `ActiveFedora` is quickly diminishing, in part because the software is challenging to maintain and in part because the development personnel best equipped to maintain it have shifted to other projects (including but not limited to Valkyrie)

– Valkyrie is the presumptive replacement; this is the case largely because of key community members succeeding at delivering (and generally being happy developing) applications based on it.

– We are committed to making this transition without it looking like a stop-the-world-and-rewrite-the application affair for existing adopters.

That is (this interpretation/re-wording also based on further discussion in slack channel and PMs), some kind of work to make Hyrax have a backend-agnostic persistence layer is in the plans, and it is presumed this will involve Valkyrie.

But it will likely not involve simply refactoring Hyrax to use valkyrie instead of ActiveFedora, which was that original valkryie branch approach. Tom is committed to making future Hyrax releases less disruptive existing adopters, and that original approach would be the kind of “stop the world” rewrite involving significant backwards-incompatibilities that has been disruptive in the past.  It probably will involve re-using/porting/copy-pasting code (as well as ideas) in that existing  valkyrie branch, but probably will not be based on that branch in the repo.

Instead, there will probably (these are Tom’s current thoughts not official plans) be a first step to create an architecture within Hyrax that “that is open to Valkyrie, but ships using active fedora by default”.  Then a period of “getting an advanced guard trying to build apps based on this [which] can and should provide a lot of useful information about how platform support needs to work.”  Then later, “a transition to valkyrie-by-default and removing AF would then be based on what we learn and demand[s] from adopters.”

Tom plans to share some of these road-map-recommendations on this at Samvera Connect in October, at which point some of this will presumably start becoming somewhat more formalized and official as plans.

I think it’s very hard to predict calendar timelines for all this. If you were waiting for the end-point, a hyrax version that just uses valkyrie (and allows postgres as a backend thusly) out-of-the-box, supported/documented/tested… I personally would predict it could be 1-2 years, but others may have much more optimistic estimates; one answer is just that it’s very difficult to predict at this point, depending on so much including what developers/institutions are interested in contributing to what extent.  We can be sure it won’t be May 2018.  :)

Note well the current Valkyrie fedora adapter does not store things in fedora in a way compatible with current hyrax/sufia modelling. Either a new adapter (with some challenges) needs to be created, or there would have to be data migration.

Some personal analysis and evaluation

I totally understand and support @no-reply’s priority to keep Hyrax stable, with changes being iterative and backwards-compatible, no “stop the world” changes — this is one of the biggest challenges facing Hyrax users I think, and it makes sense to prioritize it.

And I understand how that leads to his current thinking on how to approach valkyrie — by providing the architecture where valkyrie can be optionally switched in as a simultaneous alternative to what’s already there, which for at least a time remains there.

But this leads to a kind of ironic/counter-intuitive outcome.  Valkryie is already an abstraction layer for swappable persistence back-ends.  For reasons that are indeed sensible in overall hyrax context, we’ve arrived at a proposal to add more architecture (more abstraction) on top, to valkryie itself swappable in or out (at the point you swap it in, you can then use it to swap actual back-ends). An persistence abstraction API to let us use another persistence abstraction API beneath it.

Abstraction has costs, including to legibility of the codebase.  One might wonder if you’re going to put in this extra hyrax-specific persistence-swappability architecture anyway, does it even make sense to swap to valkyrie as the happy path supported option, or should you swap directly to postgres instead and skip valkyrie?  But there might be various reasons it really does make sense — but, it’s got a cost.

So in evaluating hyrax-on-valkyrie, I think we start out with all the pros and cons outlined in the valkyrie analysis section above.

On top of that we have pro’s and con’s of hyrax itself. How you’ll evaluate those depends on your experience with or otherwise evaluation of hyrax generally. There will be significant advantages for people who have found that hyrax has features they need, and using them via hyrax (including any customization) has worked out well and seemed like an efficient path compared to alternatives — and who want to switch to a postgres-based persistence store.

I have not had a great experience with sufia. I’ve found it very challenging to deal with the existing architecture when implementing the customizations and features we need. When I’ve tried to look at what has changed in hyrax I don’t expect significant improvements for my business cases. On the other hand, there has been code added  which increase architectural complexity for me without giving me features relevant to my needs (adminsets, nested collections).   Of course hyrax will continue to improve — especially under Tom’s excellent technical leadership, which I have a lot of faith in.  But the community’s historic value on new features over architectural rehabilitation comes from structural pressures that will have be resisted or changed. And even within the realm of architecture rehab, an investment in hyrax-on-valkyrie — while it might be a totally reasonable and appropriate priority — is development hours not spent on improving the other parts of hyrax architecture that have gotten in my way and lowered our efficiency (raised TCO) of running sufia, and which may have to temporarily increase architectural complexity/number of abstractions.

I am concerned that hyrax may have painted itself into a corner where it could be quite a while until the problems with fundamental architectural aspects of hyrax that I have run into become better; a while until the app’s architecture becomes more legible with the minimal amount of abstraction/architecture needed for its goals, instead of more complex with more layers of abstraction as a bridge to get there.  Doing this in a way that minimizes upgrade pain will make it take even longer/more effort, but not doing that is not desirable/feasible either, I believe Tom is making the right decision to prioritize minimizing upgrade/backwards-incompat pain in hyrax.

But my experiences with sufia have not been positive enough to excite me about trying to upgrade my app to present hyrax, or about a hyrax based on valkyrie or postgres but otherwise largely similar backwards/compat with current hyrax release. If you take out the persistence parts that are proposed to change, and the business logic components where I have had a lot of trouble using them to meet my local needs — I’m not sure how much of hyrax is left. From my experience, I am not enthused about investing lots more in hyrax (whether that’s contributing to the shared codebase, or work on upgrading-or-rewriting our app from sufia 7.4 to a recent hyrax version and continuing to maintain it). I’d be more excited about trying to find a more efficient way to invest development time that could ultimately, get us to a happy place quicker — both in terms of our local app, and shareable components.

What if there’s another way? (my “fantasy plan”)

Let’s say valkyrie (and apps and architecture built from it) starts from the basic non-negotiable requirement: Allow code using fedora or postgres as a persistence back-end to use the same persistence APIs; and then adds on some subsidiary goals, including sticking closer to common Rails patterns where possible.

What if instead we started from the basic requirement: Stick as close to standard Rails patterns as possible, with as few and as simple additional abstractions as we can; as simple as we can while still not requiring re-invention of the wheel in digital collections use cases?

How would we do this, what would it look like? Like valkyrie, we’ll start from modelling/persistence.

We could consider really just putting all our metadata in a standard normalized database schema. But that’s going to result in some really complex and challenging to work with rdbms schemas, for the kinds of metadata schemas we use; for instance, with frequent repeatable fields, and apps that need to handle multiple “types” of objects in the same app.

Let’s rule that out.  What’s a next step up in complexity, still straying as little from standard Rails as possible, with as few and as simple new abstractions as possible? Is there a way where we still use ActiveRecord, but we aren’t required to create normalized rdbms schemas for our complex/various/evolving metadata schemas?

Recently some rdbms have developed features to allow “schemaless” use by storing json in a column. Really, you could always have used rdbms this way by serializing complex structured data to text columns, but the additional features, especially in postgres, make this even more attractive.  (Although keep in mind that our legacy fedora-based architecture basically stores complex data as a blob without indexing or querying capabilities either; although this is part of what makes it challenging to work with).

While ActiveRecord has basic support for storing arbitrary json-able hashes in MySQL or postgres json columns, the individual data elements aren’t really “first-class” objects and are missing many standard AR modelling/persistence features.

I developed the attr_json gem as an experiment to see if I could add more seamless support for many of the standard AR model features, for individual attributes serialized to json(b), sticking as close to how AR normally works as possible. attr_json supports typing, complex/nested objects, standard Rails-style form support, dirty-tracking, and some limited postgres-jsonb query support. This allows you to use many standard Rails patterns and approaches with individual attributes serialized to json in the rdbms.

attr_json has received some limited attention from the wider rails community. A handful or rails developers have communicated with me in github issues, one or two are already using it in production, and it has 32 ‘stars‘ and 5 watchers on github, almost all apparently from developers not from the LAM/samvera community.  While I’d like even more attention and collaboration, this is still encouraging, and all reviews so far have been very positive.

What if we were to try to build out a developer’s toolkit for creating digital collections/repository applications, beginning from attr_json (in ActiveRecord and probably requiring postgres) as the central modelling/persistence layer, but not stopping there, trying to launch with an expanded toolkit addressing other app and business needs too?

This is what I’ve been calling my “fantasy plan”.  I think it could provide a way to build digital collections/repo apps with a better developer experience and overall lower TCO  (both in building out the toolkit and building apps based on it) then other options. Of course, success isn’t guaranteed and it also has risks. This is not a plan I or my institution are currently committed to at this point, but we are considering it.

In addition to modelling/persistence, the next most core area of functionality in our domain, I’d suggest, is handling bytestreams/digital assets. Both originals  and derivatives. My fantasy plan developer’s toolkit would be based on shrine for this area — possibly with custom shrine plugins.  shrine’s goal itself is as a toolkit for file/attachment handling, giving you components and primitives you can assemble into exactly what you need, leads me to judge it well-suited for use when flexibility around how to handle bytestreams/assets (including storage platforms) is so core to our domain requirements.

I have more ideas about building out this “developer’s toolkit”, and analysis of the potential benefits and risks of this approach, but I’ll save them for a follow-up post focusing on this possible plan specifically. 

But is this Samvera? The spreading out of the community

I think we are at a moment where, like it or not, different institutions are trying different things.

Even just within the new “based on valkyrie” approach (which people are valiantly trying to make a community consensus), we have both “bespoke” apps and the potential future possibility of “solution bundles”.

There is experimentation and divergent planning going on apart from this too.

Christina Harlow of Stanford recently presented at ELAG in Prague on Stanford’s current planning to re-architect their digital collection/repository system(s) in a project called TACO. (slides; video; See 8:35 in video for some brief evaluation of hyrax for their needs).  If I understand the current plans (and I may not!) they are planning an architecture which is substantially written in Go (not rails or even ruby); which does not involve Fedora; which is not based on RDF/linked data at the basic persistence level; and I think which is not planned to involve samvera shared code, although it may involve Blacklight.   Stanford clearly has a much more complex environment  than many of us, requiring a fairly complex architecture to keep it more sane than it had become — although they were running into some of the same problems of architectural legibility and performance discussed above, just at a larger scale (“scale” more in terms of diverse and complex business requirements and collections than necessarily scale of documents or users/use). [update September 5 2018, more info/documentation on Stanford’s approach is being made available here.]

In 2016 at Hydra Connect, Steven Anderson, then of Boston Public Library, gave an 8-minute lightning talk presentation called “I love you fedora, but it’s over”, about their plans to move to a non-fedora non-samvera stick-close-to-Rails kind of architecture. (slides; video).  He mentioned experiencing some of the same problems with architectural legibility and performance with the existing stack that we’ve discussed previously, and arrived at a plan similar in some ways to my “fantasy plan” above. So there have been rumblings on this for a while now — I hadn’t seen this presentation until now, but feel a real affinity with it.  Steven left BPL shortly after this talk though, and Eben English (who is still at BPL) tells me the plans basically stalled out at that point. BPL is currently still using their previously existing app based on active-fedora 8.0 and fedora 3.8. (no sufia), and is awaiting some additional hiring before determining future plans.

In one sense, the samvera community has for years been less homogenous than our aspirations or self-images. Actual samvera-based apps in production have become very spread out amongst various versions of various samvera gems seen as consensus path at various times in samvera history: just hydra-head and active-fedora, curation_concerns, sufia, hyrax, etc., all at various recent and historical versions (and both fedora 3 and fedora 4). (Plus other branches of effort with varying levels of code-sharing and varying possible futures, like Avalon and Hyku).

There does seem to be a recent increase in heterogeneity of plans though. What does this mean for Samvera? Samvera (née hydra) has always been described as a community, not a users’ group.  (Perhaps a community of practice?).   We are a community of people working on similar problems; sharing knowledge; developing and sharing techniques; developing shared understandings of options and patterns available, and best practices; and looking for opportunities to collaborate on development and use of shared software.

To be sure, if people go in different technical/software directions, it makes this more challenging, but it doesn’t make it impossible, we don’t all need to be using the same software to be such a community (and even just all using Rails is actually significant opportunity for code-sharing).  One of the things I missed most in my year outside of library world in a more for-profit world — was the way that here in non-profit library-archive-museum-land, our peers are collaborators not competitors.  And I think most of the individuals and institutions involved in the community don’t want to lose this, and want to find ways to maintain our community ties and benefits even if our software becomes more heterogenous. I think we can do it. We are a community, not a users’ group.

In some ways, I think the increase in software diversity in our community indicates some good things going on. Some institutions are realizing that the current stack wasn’t working well for them, and going back to “first principles” of technical decision-making — in being clear about their local business needs/requirements, and analyzing the path most likely to meet those well and efficiently. And diverse investigations of approaches will give our community more information and knowledge.

Personally, I think samvera community efforts have been hampered by people/institutions making technical plans influenced in part by what they think other people want, what they think “everyone else” is doing, or occasionally even where grant money is available.  The “self-interest” in “enlightened self-interest” sometimes got the short-shrift.  (To be clear, this is just one factor among many. No matter what creating a shared codebase in this kind of domain is hard and comes with no guarantees of success).  Institutions going back to their local business needs/requirements and using local technical expertise to try diverse approaches can strengthen our community with more knowledge and experience and options, compared to an attempt at a monoculture.

And also to be clear, we couldn’t be here without what has gone before. That many found the “consensus” stack wasn’t meeting their needs does not mean the community was a failure. None of these new approaches would be possible without all that’s been learned — by individuals, institutions, and the community — about our individual and shared use cases, requirements, approaches, options, dead-ends, patterns, etc. We were literally figuring out the domain together, and we will continue to do so. Plus what we’ve all learned individually, institutionally, and as a community about software architecture and design in those years. Plus the additional software tools that have come to exist giving us radically new options (the first hydra release was prior to Rails 3.0!!)

It does mean that we’re in a time where those with the capacity to do so have to actually go back to those first principles of 1) evaluating our local business needs 2) using technical expertise to evaluate the best way to meet them (short and long term), taking into account 3) opportunities for collaboration, which can mutually benefit all collaborators.   To the extent that there are institutions that have this capacity, but where decision making on choice of software platforms is not being led by people with the technical expertise to make technical decisions, and decisions are being made on other than technical grounds…  it is not serving us, and in the best case this new situation will force the issue, and we’ll all come out stronger.  In any event, these are exciting times, and I think we have some opportunities for real advancement in the maturity of the software we use and provide.

Feedback welcome

I may have gotten some things wrong; my subjective evaluations and analyses can be disagreed with. Discussion and feedback is very welcome: As comments here, as blog responses, in Slack, wherever you like is good with me.

I am also of course interested in connecting with developers or institutions who may be interested in the “Rails-first” developer’s toolkit approach we are considering, which I’ll go into more about in a subsequent follow-up post.


Thanks for early comments from Eddie Rubeiz, Dan Sanford, Anna Headley, Trey Pendragon, and Tom Johnson. All errors and all opinions are solely my own. 

BrowseEverything in Sufia, and refactoring the ingest flow

[With diagram of some Sufia ingest classes]

So, our staff that ingests files into our Sufia 7.4-based repository regularly needs to ingest dozens of 100MB+ TIFFs. For our purposes here, we’re considering uploading a bunch of “children” (in our case usually page images) of a single “work”, through the work edit page.

Trying to upload so much data through the browser ends up being very not great — even with the fancy JS immediately-upload-with-progress-bar code in Sufia. Takes an awful long time (hours; in part cause browsers’ 3-connections-per-host limit is a bottleneck compared to how much network bandwidth you could get), need to leave your browser open the whole time, and it actually locks up your browser from interacting with our site in any other tabs (see again 3-connections-per-host limit).

The solution would seem to be getting the files on some network-accessible storage, and having the app grab them right from there. browse_everything was already included in sufia, so we decided to try that. (Certainly another solution would be having workflow put the files on some network-accessible storage to begin with, but there were Reasons).

After a bunch of back-and-forth’s, for local reasons we decided to use AWS S3. And a little windows doohickey that gives windows users a “folder” they can drop things into, that will be automatically uploaded to S3. They’ve got to wait until the upload is complete before the things are available in the repo UI. (But it goes way faster than upload through browser, doesn’t lock up your browser, you don’t even need to leave your browser open, or your computer on at all, as the windows script is actually running on a local network server).  When they do ask the sufia app to ingest, the sufia app (running on EC2) can get the files from S3 surprisingly quickly — in-region AWS network is pretty darn fast.

Browse_everything doesn’t actually work in stock Sufia 7.4

The first barrier is, it turns out browse_everything doesn’t actually work in Sufia 7.4, the feature was broken.

(Normally when I do these things, I try to see what’s been fixed/changed in hyrax: To see if we can backport hyrax fixes;  to get a sense of what ‘extra’ work we’re doing by still being in sufia; and to report to you all. But in this case, I ended up just getting overwhelmed and couldn’t keep track. I believe browse_everything “works” in Hyrax, but may still have problems/bugs, not sure, read on.)

ScholarSphere had already made browse-everything work with their sufia 7.x, by patching various parts of sufia, as I found out from asking in Slack and getting helpful help from PSU folks, so that could serve as a model.  The trick was _finding_ the patches in the scholarsphere source code, but it was super helpful to not have to re-invent the wheel when I did. Sometimes after finding a problem in my app, I’d have a better sense of which files to look at in ScholarSphere for relevant patches.

Browse-everything S3 Plugin

Aside from b-e integration on the sufia side, the S3 plugin for browse-everything also had some problems.  The name of the file(s) you choose in the b-e selector didn’t show up in the sufia edit screen after you selected it, because the S3 b-e adapter wasn’t sending it. I know some people have told me they’re using b-e with S3 in hyrax (the successor to sufia) — I’m not sure how this is working. But what I did is just copy-and-paste the S3 adapter to write a custom local one, and tell b-e to use that.

The custom local one includes a fix for the file name thing (PR’d to browse-everything), and also makes the generated S3 public links have a configurable expires_in (PR’d to browse-everything) — which I think you really want for S3 use with b-e, to keep them from timing out before the bg jobs get to them.

Both of those PR’s have been merged to b-e, but not included in a release yet. It’s been a while since a b-e release (As I write this latest b-e is 0.15.1 in Dec 2017; also can we talk about why 0.15.1 isn’t just re-released as 1.0 since it’s being used in prod all over the place?).  Another fix in b-e which isn’t in prod yet, is a fix for directories with periods in them, which I didn’t notice until after we had gone live with our implementation, and then back-ported in as a separate PR.

Instead of back-porting this stuff in as patches, one could consider using b-e off github ‘master’. I really really don’t like having dependencies to particular un-released git trees in production. But the other blocker for that course of action is that browse-everything master currently has what I consider a major UX regression.  So back-port patches it is, as I get increasingly despondent about how hard it’s gonna be to ever upgrade-migrate our sufia 7.4 app to (some version of) hyrax.

The ole temp file problem

Another problem is that the sufia ImportUrlJob creates some files as ruby Tempfiles, which means the file on disk can/will be deleted by Tempfile code whenever it’s reference gets garbage collected. But those files were expected to stay around for other code, potentially background jobs, to have to process.  But bg jobs are in entirely different ruby processes, they aren’t keeping a reference to the TempFile keeping it from being deleted.

In some cases the other things expecting the file are able to re-download it from fedora if it’s not there (via the WorkingDirectory class), which is a performance issue maybe, but works. But in other cases, they just 500.

I’m not sure why that wasn’t a problem all along for us, maybe the S3 ingest changed timing to make it so? It’s also possible it still wasn’t a problem, I just mistakenly thought it was causing the problems I was having, but I noticed the problem code-reading trying to figure out the mysterious problems we were having, so I went ahead and fixed it it into our custom ImportUrlJob.

Interestingly, while the exact problem I had already been fixed in Hyrax —  a subsequent code-change in Hyrax re-introduced a similar TempFile problem in another way, then fixed again by mbklein. That fix is only in Hyrax 2.1.0.

But then the whole Sufia/Hyrax ingest architecture…

At some point I had browse-everything basically working, but… if you tried to ingest say 100 files via S3, you would have to wait a long time for your browser to get a response back. In some cases timing out.

Why? Because while a bunch of things related to ingest are done in background jobs, the code in sufia tried to create all the FileSet objects and attach them to the Work in  Sufia::CreateWithRemoteFilesActor, which ends up called in the foreground, during the request-response loop.  (I believe this is the same in Hyrax, not positive). (This is not how “local”/”uploaded” files are handled).

And this is a very slow thing to do in Sufia. Whether that’s becuase of Fedora, ActiveFedora, the usage patterns of ActiveFedora in sufia/hyrax… I think it’s combo of all of them. The code paths being used sometimes do slow things things once-per-new file that really could be done just once for the work. But even fixing that, it still ain’t really speedy.

At this point (or maybe after a day or two of unsuccessfully hacking things, I forget), I took a step back, and spent a day or two getting a handle on the complete graph of classes involved in this ingest process, and diagraming it.

sufia7.4_ingest_28Jun2018

You may download XML you can import into draw.io to edit, if you’d like to repurpose for your own uses, for updating for Hyrax, local mods, whatever.  

This has changed somewhat in Hyrax, but I think many parts are still substantially the same.

A few thoughts.

If I’m counting right, we have nine classes/objects involved in: Creating some new “child” objects, attaching an uploaded file to each one (setting a bit of metadata based on original file name), and then attaching the “child” objects to a parent (copying a bit of metadata from parent). (This is before any characterization or derivatives).

This seems like a lot. If we were using ActiveRecord and some AR file attachment library (CarrierWave, or I like the looks of shrine) this might literally be less than 9 lines of code.

Understanding why it ended up this way might require some historical research. My sense is that: A) The operations being done are so slow (again, whether due to Fedora, AF, or Sufia architecture) that things had to be broken up into multiple jobs that might not have to be otherwise. B) A lot of stuff was added by people really not wanting to touch what was already there (cause they didn’t understand it, or cause it was hard to get a grasp on what backwards incompat issues might arise from touching it), so new classes were added on top to accomodate new use cases even if a “greenfield” implementation might result in a simpler object graph (and less code duplication, more DRY).

But okay, it’s what we got in Sufia. Another observation though is that the way ‘local’ files (ie “uploaded” files, via HTTP, to a dir the web app can access) and ‘remote’ files (browse-everything) are handled is not particularly parallel/consistent, the work is divided up between classes in pretty different ways for the two paths. I suspect this may be due to “B” above.

And if you get into the implementations of various classes involved, there seems to be some things being done _multiple times_ accross different classes, the same things. Which doesn’t help when the things are very slow (if they involve saving a Work).  Again I suspect (B) above.

So, okay, at this point I hubristically thought, okay, let’s just rewrite some parts of this to make more sense, at least to my view of what makes sense. (What was in Hyrax did not seem to me to be substantially different in the ways relevant here). Partially cause I felt it would be really hard to figure out and fix the remaining bugs or problems in the current code, which I found confusing, and it’s lack of parallelism between local/remote file handling meant a problem could be fixed in one of those paths and not in the other which did things very differently.

Some of my first attempts involved not having a class that created all the new “filesets” and attached them to the parent work.  If we could just have a job for each new file, that created a fileset for that file and attached it to the work, we’d be fitting into the ActiveJob architecture better — where you ideally want a bunch of fairly small and quick and ideally idempotent jobs, not one long-running job doing a lot of things.

The problem I ran into there, is that every time you add a member to a ‘Work’ in the Sufia/Fedora architecture, you actually need to save that Work, and do so by updating a single array of “all the members”.  So if a bunch of jobs are running concurrently trying to add members to the same Work at once, they’re going to step on each others toes. Sufia does have a “locking” mechanism in place (using redlock), so they shouldn’t actually overwrite each others data. But if they each have to wait in line for the lock, the concurrency benefits are significantly reduced — and it still woudln’t really be playing well with ActiveJob architecture, which does’t expect jobs to be just sitting there waiting for a lock blocking the workers.  Additionally, in dev, i was sometimes getting some of these jobs timing out trying to get the lock (which may have been due to using SQLite3 in dev, and not an issue if I was using pg, which I’ve since switched to in dev to match prod).

After a few days of confusion and banging my head against the wall here, I returned to something more like stock sufia where there is one mega-job that creates and associates all the filesets. But it does it in some different ways than stock sufia, in a couple places having to use “internal” Sufia API — with the goal of _avoiding_ doing slow/expensive things multiple times (save the work once with all new filesets added as members, instead of once for each member as stock code did), and getting the per-file jobs queued as soon as possible under the constraints.

I also somewhat reduced the number of different bg jobs. There was at least one place in stock code where a bg job existed only to decide which of two other possible bg jobs it really wanted to invoke, and then perform_later on them. I had my version of a couple jobs do a perform_now instead — I wanted to re-use the logic locked in the two ActiveJob workers being dispatched, but there was no reason to have a job that existed only for milliseconds whose purpose was only to queue up another job, it could call that existing logic synchronously instead.

I also refactored to try to make “uploaded” (local) vs “remote” file ingest much more consistently parallel — IMO it makes it easier to get the code right, with less code, and easier to wrap your head around.

Here’s a diagram of where my architecture ended up:

sufia-7.4-scihist-custom-28Jun2018.png

 

 

Did it work?

So I began thinking we had a solution to our staff UX problem that would take “a couple days” to implement, because it was “already a Sufia feature” to use browse-everything from S3.

In fact, it took me 4-5 weeks+ (doing some other parts of my job in those weeks, but with this as the main focus).  Here’s the PR to our local app.

It involved several other fixes and improvements that aren’t mentioned in this report.

We found several bugs in our implementation — or in sufia/cc — both before we actually merged and after we merged (even though we thought we had tested all the use cases extensively, there were some we hadn’t until we got to real world data — like the periods-in-directory-names b-e bug).

In general, I ran into something I’ve run into before — not only does sufia has lots of parts, but they are often implicitly tightly-coupled, assuming that other parts are doing things in a certain way, where if the other things change that certain way, it breaks the first things, with none of these assumptions documented (or probably intentional or even conscious from the code writers).

Another thing I think happens, is that sometimes there can be bugs in ActiveFedora, but the particular way the current (eg) Sufia implementation is implemented doesn’t hit them, but you change the code in certain ways that probably ought to be fine, and now they hit bugs that were actually always there, but nobody noticed since the shared implementation didn’t hit them.

Some time after we deployed the new feature, we ran into a bug that I eventually traced to an ActiveFedora bug (one I totally  don’t understand myself), which had already been fixed and available in AF 11.5.2 (thanks so much to Tom Johnson for, months ago, backporting the fix to AF 11.x, not just in 12.x).  We had been running ActiveFedora 11.1.6. After some dependency hell of getting a consistent dependency tree with AF 11.5.2, it seems to have fixed the problem without breaking anything else or requiring any other code changes (AF appears to have not actually introduced backwards incommpats between these minor version releases, which is awesome).

But what’s a mystery to me (well, along with what the heck is up with that bug, which I don’t understand at all in the AF source), is why we didn’t encounter this bug before, why were the functions working just fine with AF 11.1.6 until recently? It’s a mystery, but my wild guess is that the changes to order and timing of how things are done in my ingest refactor made us hit an AF bug that the previous stock Sufia usage had not.

I can’t hide it cause I showed you the PR, I did not write automated tests for the new ingest functionality. Which in retrospect was a mistake. Partially I’m not great at writing tests; partially because when I started it was so experimental and seemed like it could be a small intervention, but also implementation kept changing so having to keep changing tests could have been a slowdown. But also partially cause I found it overwhelming to figure out how to write tests here, it honestly gave me anxiety to think about it.  There are so many fairly tightly coupled moving parts, that all had to change, in a coordinated fashion, and many of them were ActiveJob workers.

Really there’s probably no way around that but writing some top-level integration tests, but those are so slow in sufia, and difficult to write sometimes too. (Also we have a bunch of different paths that probably all need testing; one of our bugs ended up being with when someone had chosen a ‘format’ option in the ‘batch create’ screen; something I hadn’t been thinking to test manually and wouldn’t have thought to test automated-ly either. Likewise the directory-containing-a-period bug. And the more separate paths to test, the more tests, and when you’re doing it in integration tests… your suite gets so so slow.  But we do plan to add at least some happy path integration tests, we’ve already got a unit of work written out and prioritized for soonish. Cause I don’t want this to keep breaking if we change code again, without being caught by tests.

So… did it work?  Well, our staff users can ingest from S3 now, and seems to have successfully made their workflow much more efficient, productive, and less frustrating, so I guess I’d say yes!

What does this say about still being on Sufia and upgrade paths?

As reported above, I did run into a fair number of bugs in the stack that would be have been fixed if we had been on Hyrax already.  Whenever this happens, it rationally makes me wonder “Is it an inefficient use of our developer time that we’re still on Sufia dealing with these, should we have invested developer time in upgrading to Hyrax already?”

Until roughly March 2018, that wouldn’t have really been an option, wasn’t even a question. At earlier point in the two-three-ish year implementation process (mostly before I even worked here), we had been really good at keeping our app up to date with new dependency releases. Which is why we are on Sufia 7.4 at least.

But at some point we realized that getting off that treadmill was the only way we were going to hit our externally-imposed deadlines for going live. And I think we were right there. But okay, since March, it’s more of an open book at the moment — and we know we can’t stay on Sufia 7.4.0 forever. (It doesn’t work on Rails 5.2 for one, and Rails before 5.2 will be EOL’d before too long).  So okay the question/option returns.

I did spend 4-5 weeks on implementing this in our sufia app. I loosely and roughly and wild-guessedly “estimate” that upgrading from our Sufia 7.4 app all the way to Hyrax 2.1 would take a lot longer than 4-5 weeks. (2, 3, 4 time as long?)

But of course this isn’t the only time I’ve had to fight with bugs that would have been fixed in Hyrax, it adds up.

But contrarily, quite a few of these bugs or other architecture issues corrected here are not fixed in Hyrax yet either. And a couple are fixed in Hyrax 2.1.0, but weren’t in 2.0.0, which was where Hyrax was when I started this.  And probably some new bugs too. Even if we had already been on Hyrax before I started looking at “ingest from S3”, it would not have been the “couple day” implementation I naively assumed. It would have been somewhere in between that and the 4-5 week+ implementation, not really sure where.

Then there’s the fact that even if we migrate/upgrade to Hyrax 2.1 now… there’s another big backwards-incompatible set of changes slated to come down the line for a future Hyrax version already, to be based on “valkyrie” instead.

So… I’m not really sure. And we remain not really sure what’s going to become of this Sufia 7.4 app that can’t just stay on Sufia 7.4 forever. We could do the ‘expected’ thing and upgrade to hyrax 2.1 now, and then upgrade again when/if future-valkyrie-hyrax comes out. (We could also invest time helping to finish future-valkyrie-hyrax). Or we could actually contribute code towards a future (unexpected!) Sufia release (7.5 or 8 or whatever) that works on Rails 5.2 — not totally sure how hard that would be.

Or we could basically rewrite the app (copying much of the business logic of course, easier in business logic we managed to write in ways less coupled to sufia) — either based on valkyrie-without-sufia (as some institutions have already done for new apps, I’m not sure if anyone has ported a sufia or hyrax app there yet; it would essentially be an app rewrite to do so) or…. not.  If it would be essentially an app rewrite to go to valkyrie-without-hyrax anyway (and unclear at this point how close to an app rewrite to go to a not-yet-finished future hyrax-with-valkyrie)…

We have been doing some R&D development into what an alternate digital collections/repo architecture could look like, not necessarily based on Valkyrie — my attr_json gem is part of that, although doesn’t demonstrate a commitment to actually use that gem in the future here at MPOW, we’re just exploring different things.

Deep-dive into hydra-derivatives

(Actually first wrote this in November, five months ago, getting it published now…)

In our sufia 7.4 digital repository, we wanted to add some more derivative thumbnails and download JPGs from our large TIFF originals: 3-4 sizes of JPG to download, and 3 total sizes of thumbnail for the three sizes in our customized design, with each of them having a 2x version for srcset too. But we also wanted to change some of the ways the derivatives-creation code worked in our infrastructure.

1. Derivatives creation is already in a bg ActiveJob, but we wanted to run it on a different server than the rails app server. While the built-in job was capable of this, downloading the original from fedora, in our experience,in at least some circumstances, it left behind that temporary download instead of removing it when done. Which caused problems especially if you had to do bulk derivatives creation of already uploaded items.

  • Derivative-creating bg jobs ought not to be fighting over CPU/RAM with our Rails server, and also ought to be able to be on a server separately properly sized and scaled for the amount of work to be done.

2. We wanted to store derivatives on AWS S3

  • All our stuff is deployed on AWS, storing on S3 is over the long-term cheaper than storing on an Elastic Block Storage ‘local disk’.
  • If you ever wanted to horizontally scale your rails server “local disk” storage (when delivered through a rails controller as sufia 7 does it) requires some complexity, probably a shared file system, which can be expensive and/or unreliable on AWS.
  • If we instead deliver directly from S3 to browsers, we take that load off the Rails server, which doesn’t need it. (This does make auth more challenging, we decided to punt on it for now, with the same justification and possible future directions as we discussed for DZI tiles).
  • S3 is just a storage solution that makes sense for a whole bunch of JPGs and other assets you are going to deliver over the web, it’s what it’s for.

3. Ideally, it would be great to tweak the TIFF->JPG generation parameters a bit. The JPGs should preferably be progressive JPGs, for instance, they weren’t out of stock codebase. The parameters might vary somewhat between JPGs intended as thumbnails and on-screen display, vs JPGs intended as downloads. The thumb ones should ideally use some pretty aggressive parameters to reduce size, such as removing embedded color profiles. (We ended up using vips instead of imagemagick).

4. Derivatives creation seemed pretty slow, it would be nice to speed it up a bit, if there were opportunities discovered to do so. This was especially inconvenient if you had to generate or re-generate one or more derivatives for all objects already existing in the repo. But could also be an issue even with routine operation, when ingesting many new files at once.

I started with a sort of “deep-dive” into seeing what Sufia (via hydra-derivatives) were doing already. I was looking for possible places to intervene, and also to see what it was doing, so if I ended up reimplementing any of it I could duplicate anything that seemed important.  I ultimately decided that I would need to customize or override so many parts of the existing stack, it made sense to just replace most of it locally. I’ll lead you through both those processes, and end with some (much briefer than usual) thoughts.

Deep-dive into Hydra Derivatives

We are using Sufia 7.4, and CurationConcerns 1.7.8. Some of this has changed in Hyrax, but I believe the basic architecture is largely similar. I’ll try to make a note of parts I know have changed in Hyrax. (links to hyrax code will be to master at the time I write this, links to Sufia and CC will be to the versions we are using).

CreateDerivativesJob

We’ll start at the top with the CurationConcerns CreateDerivativesJob. (Or similar version in Hyrax).  See my previous post for an overview of how/when this job gets scheduled.  Turns out the execution of a CreateDerivativesJob is hard-coded into the CharacterizeJob, you can’t choose to have it run a different job or none at all. (Same in hyrax).

The first thing this does is acquire a file path to the original asset file, with `CurationConcerns::WorkingDirectory.find_or_retrieve(file_id, file_set.id, filepath)`. CurationConcerns::WorkingDirectory (or see in hyrax) checks to see if the file is already there in an expected place inside CurationConcerns.working_directory, and if not copies it to the working directory from a fedora fetch,  using a Hydra::PCDM::File object.

Because it’s using Hydra::PCDM::File object #content API, it fetches the entire fedora file into memory, before writing it to the CurationConcerns.working_directory.  For big files, this uses a lot of RAM temporarily, but more distressing to me is the additional latency, to first fetch the thing into RAM and then stream RAM to disk, instead of streaming right to disk. While the CurationConcerns::WorkingDirectory code seems to have been written originally to try to stream, with a copy_stream_to_working_directory method in terms of streams, the current implementation just turns a full in-memory string into a StringIO instead.  The hyrax implementation is the same. 

Back to the CreateDerivativesJob, we now have a filename to a copy of the original asset in the ‘working directory’.  I don’t see any logic here to clean up that copy, so perhaps this is the source of the ‘temporary file buildup’ my team has sometimes seen.  I’m not sure why we only sometimes see it, or if there are other parts of the stack meant to clean this up later in some cases. I’m not sure if the contract of `CurationConcerns::WorkingDirectory#find_or_retrieve` is to always return a temporary file that the caller is meant to clean up when done, if it’s always safe to assume the filename returned can be deleted by caller; or if instead future actors are meant to use it and/or clean it up.

The CreateDerivativesJob does an acquire_lock_for: I think this is probably left over from when derivatives were actually stored in fedora, now that they are not, this seems superflous (and possibly expensive, not sure). And indeed it’s gone from the hyrax version, so that’s probably true.

Later, the CreateDerivativesJob reindexes the fileset object (first doing a file_set.reload, I think that’s from fedora, not solr?), and in some cases it’s parent.   This is a potentially expensive operation — which matters especially if you’re, say, trying to reindex all derivatives. Why does it need a reindex? Well, sufia/hyrax objects in Solr index have a relative URL to thumbnails in a `thumbnail_path_ss` field (a design our app no longer uses).  But thumbnail paths in sufia/hyrax are consistently predictable from file_set_id, of the form /downloads/#{file_set_id}?file=thumbnail.  Maybe the reindex dates from before this is true? Or maybe it’s just meant to register “yes, a thumbnail is there now”, so the front-end can tell the difference between missing and absent thumb?  (I’d rather just keep that out of the index and handle thumbs not present at expected URLs with some JS. )

I tried removing the index update from my locally overridden CreateDerivativesJob, and discovered one reason it is there. In normal operation, this is the only time a parent work gets reindexed after a fileset is added to it that will be marked it’s representative fileset. And it needs to get reindexed to have the representative_id and such.  I added it to AddFileToFileSet instead, where it belongs. Phew!

So anyway,  how are the derivatives actually created?  Just by calling file_set.create_derivatives(filename). Note the actual local (working directory) method on the model object doesn’t seem quite right for this, you might want different derivatives in different contexts for the same model, but it works. Hyrax is making the same call.  Hyrax introduces a DerivativeService class not present in Sufia/CC , which I believe is meant to support easier customization.

FileSet#create_derivatives

FileSet#create_derivatives is defined in a module that gets mixed into your FileSet class. It branches on the mime type of your original, running different (hard-coded) classes from the hydra-derivatives gem depending on type.  For images, that’s:

Hydra::Derivatives::ImageDerivatives.create(filename,
 outputs: [{ label: :thumbnail, 
             format: 'jpg', 
             size: '200x150>', 
             url: derivative_url('thumbnail') }])

You can see it passes in the local filepath again, as well as some various options in an outputs keyword arg — including a specified url of the to-be-created derivative — as a single hash inside an array for some reason. derivative_url uses a derivative_path_factory, to get a path (on local FS?), and change it into a file: url — so this is really more of a path than a URL, it’s apparently not actually the eventual end-user-facing URL, but just instructions for where to write the file. The derivative_path_factory is a DerivativePath, which uses CurationConcerns.config.derivatives_path, to decide where to put it — it seems like there’s a baked-in assumption (passed through several layers) that  destination will  be on a local filesystem on the machine running the job.

Hyrax actually changes this somewhat — the relevant create_derivatives method seems to moved to the FileSetDerivativeService — it works largely the same, although the different code to run for each mime-type branch has been moved to separate methods, perhaps to make it easier to override. I’m not quite sure how/where FileSet#create_derivatives is defined (Hyrax CreateDerivativesJob still calls it), as the Hyrax::FileSet::Derivatives module doesn’t seem to mix it in anymore. But FileSet#create_derivatives presumably calls #create_derivatives for the FileSetDerivativeService somehow.  Since I was mainly focusing on our code using Sufia/CC, I left the train here. The Hyrax version does have a cleanup_derivatives method as a before_destroy presumably on the FileSet itself, which is about cleaning up derivatives is a fileset is deleted (did the sufia version not do that at all?) Hyrax seems to still be using the same logic from hydra_derivatives to actually do derivatives creation.

Since i was mostly interested with images, I’m going to specifically dive in only to the  Hydra::Derivatives::ImageDerivatives code.  Both Hyrax and Sufia use this. Our Sufia 7.4 app is using hydra-derivatives 3.2.1. At the time of this writing, hydra-derivatives latest release is 3.3.2, and hyrax does require 3.3.x, so a different minor version than what I’m using.

Hydra::Derivatives::ImageDerivatives and cooperators

If we look at Hydra::Derivatives::ImageDerivatives (same in master and 3.2.1) — there isn’t much there. It sets a self.processor_class to Processors::Image, inherits from Runner, and does something to set a format: png as a default argument.

The superclass Hydra::Derivatives::Runner has some business logic for being a derivative processor. It has a class-wide output_file_service defaulting to whatever is configured as Hydra::Derivatives.output_file_service.  And a class-wide source_file_service defaulting to Hydra::Derivatives.source_file_service.  It fetches the original using the the source file service. For each arg hash passed in (now we understand why that argument was an array of hashes), it just sends it to the configured processor class, along with the output_file_service:  The processor_class seems to be responsible for using the passed-in  output_file_service to actually write output.  While it also passes in the source_file_service, this seems to be ignored:  The source file itself has already been fetched and had it’s local file system path passed in directly, and I did not find anything using the passed-in source_file_service.  (this logic seems the same between 3.2.1 and current master).

In my Sufia app, Hydra::Derivatives.output_file_service is CurationConcerns::PersistDerivatives — which basically just writes it to local file system, again using a derivative_path_factory set to DerivativePath.  The derivative_path_factory PersistDerivatives probably has to match the one up in FileSet#create_derivatives — I guess if you changed the derivative_path_factory in your FileSet, or probably bad things would happen?  And Hydra::Derivatives.source_file_service is CurationConcerns::LocalFileService which does nothing but open the local file path passed in, and return a File object. Hyrax has pretty much the same PersistDerivatives and LocalFileService services, I would guess they are also the defaults, although haven’t checked.

I’d guess this architecture was designed with the intention that if you wanted to get a source file from somewhere other than local file system, you’d set a custom  source_file_service.   But even though Sufia and Hyrax do get a source file from somewhere else, they don’t customize the source_file_service, they fetch from fedora a layer up and then just pass in a local file that can be handled by the LocalFileService.

Okay, but what about actually creating derivatives?

So okay, the actual derivative generation though, recall, was handled by the processor_class dependency, hard-coded to Processors::Image.

Hydra::Derivatives::Processors::Image I think is the same in hydra-derivatives 3.2.1 and current master. It uses MiniMagick to do it’s work. It will possibly change the format of the image. And possibly set (or change?) it’s quality (which mostly only effects JPGs I think, maybe PNGs too). Then it will run a layer flatten operation the image.  And resize it.  Recall that #create_derivatives actually passed in an imagemagick-compatible argument for desired size, size: '200x150>', so create_derivatives is actually assuming that the Hydra::Derivatives::ImageDerivatives.create will be imagemagick-based, or understand imagemagick-type size specifications, there’s some coupling here.

MiniMagick actually does it’s work by shelling  out to command-line imagemagick (or optionally graphicsmagick, which is more or less API-compatible with imagemagick). A line in the MiniMagick README makes me concerned about how many times MiniMagick is writing temporary files:

MiniMagick::Image.open makes a copy of the image, and further methods modify that copy (the original stays untouched). We then resize the image, and write it to a file. The writing part is necessary because the copy is just temporary, it gets garbage collected when we lose reference to the image.

I’m not sure if that would apply to the flatten command too. Or even the format and quality directives?  If the way MiniMagick is being used, files are written/read multiple times, that would definitely be an opportunity for performance improvements, because these days touching the file system is one of the slowest things one can do. ImageMagick/GraphicsMagick/other-similar are definitely capable of doing all of these operations without interim temporary file system writes in between each, I’m not certain if Hydra::Derivatives::Processors::Image use of MiniMagick is doing so.

It’s not clear to me how to change what operations Hydra::Derivatives::Processors::Image​ does — let’s say you want to strip extra metadata for a smaller thumb as for instance Google suggests, how would you do that? I guess you’d write your own class to use as a processor_class. It could sub-class Hydra::Derivatives::Processors::Image or not (really no need for a sub-class I don’t think, what it’s doing is pretty straightforward).  How would you set your custom processor to be used?  I guess you’d have to override the line in Hydra::Derivatives::ImageDerivatives Or perhaps you should you instead provide your own class to replace Hydra::Derivatives::ImageDerivatives, and have that used instead? Which in Sufia would probably be by overriding FileSet#create_derivatives to call your custom class.   Or in Hyrax, there’s that newer Hyrax::DerivativeService stuff, perhaps you’d change your local FileSet to use a different DerivativeService, which seems at least more straightforward (alas I’m not on Hyrax). If you did this, I’m not sure if it would be recommended for you to re-use pieces of the existing architecture as components (and in what way), or just write the whole thing from scratch.

Some Brief Analysis and Decision-making

So I actually wanted to change nearly every part of the default pipeline here in our app.

Reading: I want to continue reading from fedora, being sure to stream it from fedora to local file system as a working copy.

Cleanup: I want to make sure to clean up the temporary working copy when you’re done with it, which I know in at least some cases was not being done in our out of the box code. Maybe to leave it around for future ‘actor’ steps? In our actual app, downloading from one EC2 to another on the same local AWS network is very speedy, I’d rather just be safe and clean it up even if it means it might get downloaded again.

Transformation:  I want to have different image transformation options. Stripping metadata, interlaced JPGs, setting color profiles. Maybe different parameters for images to be used as in-browser thumbs vs downloadable files. (See advice about thumb parameters from  Google’s, or vips). Maybe using a non-ImageMagick processor (we ended up with vips).

Output: I want to write to S3, because it makes sense to store assets like this there, especially but not only if you’re deploying on AWS already like we are.  Of course, you’d have to change the front-end to find the thumbs (and/or downloads) at a separate URL still, more on that later.

So, there are many parts I wanted to customize. And for nearly all of them, it was unclear to me the ‘right’/intended/best way to to customize in the current architecture. I figured, okay then, I’m just going to completely replace CreateDerivativesJob with my own implementation.

The good news is that worked out pretty fine — the only place this is coupled to the rest of sufia at all, is in sufia knowing what URLs to link to for thumbs (which I suspect many people have customized already, for instance to use an IIIF server for thumbs instead of creating them statically, as the default and my new implementation both do). So in one sense that is an architectural success!

Irony?

Sandi Metz has written about the consequences of “the wrong abstraction”, sometimes paraphrased as “the wrong abstraction is worse than no abstraction.”

hydra-derivatives, and parts of sufia/hyrax that use it, have a pretty complex cooperating object graph, with many cooperating objects and several inheritance hierarchies.  Presumably this was done intending to support flexibility, customization, and maintainability, that’s why you do such things.

Ironically, adding more cooperating objects (that is, abstractions), can paradoxically inhibit flexibility, customizability, or maintainability — if you don’t get it quite right. With more code, there’s more for developers to understand, and it can be easy to get overwhelmed and not be able to figure out the right place to intervene for a change  (especially in the absence of docs). And changes and improvements to the codebase can require changes across many different accidentally-coupled objects in concert, raising the cost of improvements, especially when crossing gem boundaries too.

If the lines between objects, and the places objects interface with each other, aren’t drawn quite right to support needed use cases, you may sometimes have to customize or override or change things in multiple places now (because you have more places) to do what seems like one thing.

Some of this may be at play in hydra_derivatives and sufia/hyrax’s use of them.  And I think some of it comes from people adding additional layers of abstraction to try to compensate for problems in the existing ones, instead of changing the existing ones (Why does one do this? For backwards compat reasons? Because they don’t understand the existing ones enough to touch them? Organizational boundaries? Quicker development?)

It would be interesting to do a survey see how often hooks in hydra_derivatives that seem to have been put there for customization have actually been used, or what people are doing instead/in addition for the customization they need.

Getting architecture right (the right abstractions) is not easy, and takes more than just good intentions. It probably takes pretty good understanding of the domain and expected developer usage scenarios; careful design of object graphs and interfaces to support those scenarios; documentation of such to guide future users and developers. Maybe ideally starting some working individual examples in local ‘bespoke’ codebases that are only then abstracted/generalized to a shared codebase (which takes time).  And with all that, some luck and skill and experience too.

The number of different cooperating objects you have involved should probably be proportional to how much thinking and research you’ve done about usage scenarios to support and how the APIs will support them — when in doubt keep it simpler and less granular.

What We Did

This article previous to here, I wrote about 5 months ago. Then I sat it on it until now… for some reason the whole thing just filled me with a sort of psychic exhaustion, can’t totally explain it. So looking back to code I wrote a while ago, I can try to give you a very brief overview of our code.

Here’s the PR, which involves quite a bit of code, as well as building on top of some existing custom local architecture.

We completely override the CreateDerivativesJob#perform method, to just call our own “service” class to create derivatives (extracted into a service object instead of being inline in the job!)– if our Env variables are configured to use our new-fangled store-things-on-s3 functionality.  Otherwise we call super — but try to clean up the temporary working files that the built-in code was leaving lying around to fill up our file system.

Our derivatives-creating service is relatively straightforward.  Creating a bunch of derivatives and storing them in S3 is not something particularly challenging.

We made it harder for ourself by trying to support derivatives stored on S3 or in local file system, based on config — partially because it’s convenient to not have to use S3 in dev and test, and partially thinking about generalizing to share with the community.

Also, there needs to be a way for front-end code to get urls to derivatives of course, and really this should be tied into the derivatives creation, something hydra-derivatives appears to lack.  And in our case, we also need to add our derivatives meant to be offered as downloads to our ‘downloads’ menu, including in our custom image viewer. So there’s a lot of code related to that, including some refactoring of our custom image viewer.

One neat thing we did is (at least when using S3, as we do in production) deliver our downloads with a content-disposition header specifying a more human-friendly filename, including the first few words of the title.

Generalizing? Upstream? Future?

I knew from the start that what I had wasn’t quite good enough to generalize for upstream or other shareable dependency.  In fact, in the months since I implemented it, it hasn’t worked out great even for me, additional use cases I had didn’t fit neatly into it, my architecture has ended up overly complex and confusing.

Abstracting/generalizing to share really requires even more care and consideration to get the right architecture, compared to having something that works well enough for your app. In part, because refactoring something only used by one app is a lot less costly than with a shared dependency.

Initially, some months ago, even knowing what I had was not quite good enough to generalize, I thought I had figured out enough and thought about enough to be able to spend more time to come up with something that would be a good generalized shareable dependency.  This would only be worth spending time on if there seemed a good chance others would want to use it of course.

I even had a break-out session at Samvera Connect to discuss it, and others who showed up agreed that the current hydra-derivatives API was really not right (including at least one who was involved in writing it originally), and that a new try was due.

And then I just… lost steam to do it.  In part overwhelmed by community things; the process of doing a samvera working group, the uncertainty of knowing whether anyone would really switch from hydra-derivatives to use a new thing, of whether it could become the thing in hyrax (with hyrax valkyrie refactor already going on, how does this effect it?), etc.

And in part, I just realized…. the basic challenge here is coming up with the right API and architecture to a) allow choice of back-end storage (S3, local file system, etc), with b) URL generation, and ideally API for both streaming bytes from the storage location and downloading the whole thing, regardless of back-end storage. This is the harder part architecturally then just actually creating the derivatives. And… nothing about this is particularly unique to the domain of digital collections/repositories, isn’t there something already existing we could just use?

My current best bet is shrine.  It already handles those basic things above with a really nice very flexible decoupled architecture.  It’s a bit more confusing to use than, say, carrierwave (or the newer built-into-Rails ActiveStorage), but that’s because it’s a more flexible decoupled-components API, which is probably worth it so we can do exactly what we want with it, build it into our own frameworks. (More flexibility is always more complexity; I think ActiveStorage currently lacks the flexibility we need for our communities use cases).   Although it works great with Rails and ActiveRecord, it doesn’t even depend on Rails or ActiveRecord (the author prefers hanami I think), so quite possibly could work with ActiveFedora too.

But then the community (maybe? probably?) seems to be… at least in part… moving away from ActiveFedora too. Could you integrate shrine, to support derivatives, with valkyrie in a back-end independent way? I’m sure you could, I have no idea how the best way would be to do so, how much work it would be, the overall cost/benefit, or still if anyone would use it if you did.

So I’m not sure I’m going to be looking at shrine myself in a valkyrie context. (Although I think the very unsuitable hydra-derivatives is the only relevant shared dependency anyone is currently using with valkyrie, and presumably what hyrax 3 will still be using, and I still think it’s not really… right).

But I am going to be looking at shrine more — I’ve already started talking to the shrine author about what I see as my (and my understanding of our communities) needs for features for derivatives (which shrine currently calls “versions”), and I think I’m going to try to do some R&D on a new shrine plugin that meets my/our needs better. I’m not sure I’ll end up wanting to try to integrate it with valkyrie and/or hyrax, or with some new approaches I’ve been thinking on and doing some R&D on, which I hope to share more about in the medium-term future.

Performance on a many-membered Sufia/Hyrax show page

We still run Sufia 7.3, haven’t yet upgraded/migrated to hyrax, in our digital repository. (These are digital repository/digital library frameworks, for those who arrived here and are not familiar; you may not find the rest of the very long blog post very interesting. :))

We have a variety of ‘manuscript’/’scanned 2d text’ objects, where each page is a sufia/hyrax “member” of the parent (modeled based on PCDM).  Sufia was  originally designed as a self-deposit institutional repository, and I didn’t quite realize this until recently, but is now known sufia/hyrax to still have a variety of especially performance-related problems with works with many members. But it mostly works out.

The default sufia/hyrax ‘show’ page displays a single list of all members on the show page, with no pagination. This is also where admins often find members to ‘edit’ or do other admin tasks on them.

For our current most-membered work, that’s 473 members, 196 of which are “child works” (each of which is only a single fileset–we use child works for individual “interesting” pages we’d like to describe more fully and have show up in search results independently).  In stock sufia 7.3 on our actual servers, it could take 4-6 seconds to load this page (just to get response from server, not including client-side time).  This is far from optimal (or even ‘acceptable’ in standard Rails-land), but… it works.

While I’m not happy with that performance, it was barely acceptable enough that before getting to worrying about that, our first priority was making the ‘show’ page look better to end-users.  Incorporating a ‘viewer’, launched by clicks on page thumbs, more options in a download menu, , bigger images with an image-forward kind of design, etc. As we were mostly just changing sizes and layouts and adding a few more attributes and conditionals, I didn’t think this would effect performance much compared to the stock.

However, just as we were about to reach a deadline for a ‘soft’ mostly-internal release, we realized the show page times on that most-membered work had deteriorated drastically. To 12 seconds and up for a server response, no longer within the bounds of barely acceptable. (This shows why it’s good to have some performance monitoring on your app, like New Relic or Skylight, so you have a chance to notice performance degradation as a result of code changes as soon as it happens. Although we don’t actually have this at present.)

We thus embarked on a week+ of most of our team working together on performance profiling to figure out what was up and — I’m happy to say — fixing it, perhaps even getting slightly better perf than stock sufia in the end. Some of the things we found definitely apply to stock sufia and hyrax too, others may not, we haven’t spend the time to completely compare and contrast, but I’ll try to comment with my advice.

When I see a major perf degradation like this, my experience tells me it’s usually one thing that’s caused it. But that wasn’t really true in this case, we had to find and fix several issues. Here’s what we found, how we found it, and our local fixes:

N+1 Solr Queries

The N+1 query problem is one of the first and most basic performance problems many Rails devs learn about. Or really, many web devs (or those using SQL or similar stores) generally.

It’s when you are showing a parent and it’s children, and end up doing an individual db fetch for every child, one-per-child. Disastrous performance wise, you need to find a way to do a single db fetch that gets everything you want instead.

So this was our first guess. And indeed we found that stock sufia/hyrax did do n+1 queries to Solr on a ‘show’ page, where n is the number of members/children.

If you were just fetching with ordinary ActiveRecord, the solution to this would be trivial, adding something like .includes(:members) to your ActiveRecord query.  But of course we aren’t, so the solution is a bit more involved, since we have to go through Solr, and actually traverse over at least one ‘join’ object in Solr too, because of how sufia/hyrax stores these things.

Fortunately Princeton University Library already had a local solution of their own, which folks in the always helpful samvera slack channel shared with us, and we implemented locally as well.

I’m not a huge fan of overriding that core member_presenters method, but it works and I can’t think of a better way to solve this.

We went and implemented this without even doing any profiling first, cause it was a low-hanging fruit. And were dismayed to see that while it did improve things measurably, performance was still disastrous.

Solrizer.solr_name turns out to be a performance bottleneck?(!)

I first assumed this was probably still making extra fetches to solr (or even fedora!), that’s my experience/intuition for most likely perf problem. But I couldn’t find any of those.

Okay, now we had to do some actual profiling. I created a test work in my dev instance that had 200 fileset members. Less than our slowest work in production, but should be enough to find some bottlenecks, I hoped. The way I usually start is by a really clumsy and manual deleting parts of my templates to see what things deleted makes things faster. I don’t know if this is really a technique I’d recommend, but it’s my habit.

This allowed me to identify that indeed the biggest perf problem at this time was not in fetching the member-presenters, and indeed was in the rendering of them. But as I deleted parts of the partial for rendering each member, I couldn’t find any part that speeded up things drastically, deleting any part just speeded things up proportional to how much I deleted. Weird. Time for profiling with ruby-prof.

I wrapped the profiling just around the portion of the template I had already identified as problem area. I like the RubyProf::GraphHtmlPrinter report from ruby-prof for this kind of work. (One of these days I’m going to experiment GraphViz or compatible, but haven’t yet).

Surprisingly, the top culprit for taking up time was — Solrizer.solr_name. (We use Solrizer 3.4.1; I don’t believe as of this date newer versions of solrizer or other dependencies would fix this).

It makes sense Solrizer.solr_name is called a lot. It’s called basically every time you ask for any attribute from your Solr “show” presenter. I also saw it being called when generating an internal app link to a show page for a member, perhaps because that requires attributes. Anything you have set up to delegate …, to: :solr_document probably  also ends up calling Solrizer.solr_name in the SolrDocument.

While I think this would be a problem in even stock Sufia/Hyrax, it explains why it could be more of a problem in our customization — we were displaying more attributes and links, something I didn’t expect would be a performance concern; especially attributes for an already-fetched object oughta be quite cheap. Also explains why every part of my problem area seemed to contribute roughly equally to the perf problem, they were all displaying some attribute or link!

It makes sense to abstract the exact name of the Solr field (which is something like ​​title_ssim), but I wouldn’t expect this call to be much more expensive than a hash lookup (which can usually be done thousands of times in 1ms).  Why is it so much slower? I didn’t get that far, instead I hackily patched Solrizer.solr_name to cache based on arguments, so all calls after the first with the same argument would be just a hash lookup. 

I don’t think this would be a great upstream PR, it’s a workaround. Would be better to figure out why Solrizer.solr_name is so slow, but my initial brief forays there didn’t reveal much, and I had to return to our app.

Because while this did speed up my test case by a few hundred ms, my test case was still significantly slower compared to an older branch of our local app with better performance.

Using QuestioningAuthority gem in ways other than intended

We use the gem commonly referred to as “Questioning Authority“, but actually released as a gem called qa for most of our controlled vocabularies, including “rights”.  We wanted to expand the display of “rights” information beyond just a label, we wanted a nice graphic and user-facing shortened label ala rightstatements.org.

It seemed clever some months ago to just add this additional metadata to the licenses.yml file already being used by our qa-controlled vocabulary.  Can you then access it using the existing qa API?  Some reverse-engineering led me to using CurationConcerns::LicenseService.new.authority.find(identifier).

It worked great… except after taking care of Solrizer.solr_name, this was the next biggest timesink in our perf profile. Specifically it seemed to be calling slow YAML.load a lot. Was it reloading the YAML file from disk on every call? It was!  And we were displaying licensing info for every member.

I spent some time investigating the qa gem. Was there a way to add caching and PR it upstream? A way that would be usable in an API that would give me what I wanted here? I couldn’t quite come up with anything without pretty major changes.  The QA gem wasn’t really written for this use case, it is focused pretty laser-like on just providing auto-complete to terms, and I’ve found it difficult in the past to use it for anything else. Even in it’s use case, not caching YAML is a performance mistake, but since it would usually be done only once per request it wouldn’t be disastrous.

I realized, heck, reading from a YAML is not a complicated thing. I’m going to leave it the licenses.yml for DRY of our data, but I’m just going to write my own cover logic to read the YAML in a perf-friendly way. 

That trimmed off a nice additional ~300ms out of 2-3 seconds for my test data, but the code was still significantly slower compared to our earlier branch of local app.

[After I started drafting this post, Tom Johnson filed an issue on QA on the subject.]

Sufia::SufiaHelperBehavior#application_name is also slow

After taking care of that one, the next thing taking up the most time in our perf profile was, surprisingly, Sufia::SufiaHelperBehavior#application_name (I think Hyrax equivalent is here and similar).

We were calling that #application_name helper twice per member… just in a data-confirm attr on a delete link! `Deleting #{file_set} from #{application_name} is permanent. Click OK to delete this from #{application_name}, or Cancel to cancel this operation. ` 

If the original sufia code didn’t have this, or only had application_name once instead of twice, that could explain a perf regression in our local code, if application_name is slow. I’m not sure if it did or not, but this was the biggest bottleneck in our local code at this time either way.

Why is application_name so slow? This is another method I might expect would be fast enough to call thousands of times on a page, in the cost vicinity of a hash lookup. Is I18n.t just slow to begin with, such that you can’t call it 400 times on a page?  I doubt it, but it’s possible. What’s hiding in that super call, that is called on every invocation even if no default is needed?  Not sure.

At this point, several days into our team working on this, I bailed out and said, you know what, we don’t really need to tell them the application name in the delete confirm prompt.

Again, significant speed-up, but still significantly slower than our older faster branch.

Too Many Partials

I was somewhat cheered, several days in, to be into actual generic Rails issues, and not Samvera-stack-specific ones. Because after fixing above, the next most expensive thing identifiable in our perf profile was a Rails ‘lookup_template’ kind of method. (Sorry, I didn’t keep notes or the report on the exact method).

As our HTML for displaying “a member on a show page” got somewhat more complex (with a popup menu for downloads and a popup for admin functions), to keep the code more readable we had extracted parts to other partials. So the main “show a member thumb” type partial was calling out to three other partials. So for 200 members, that meant 600 partial lookups.

Seeing that line in the profile report reminded me, oh yeah, partial lookup is really slow in Rails.  I remembered that from way back, and had sort of assumed they would have fixed it in Rails by now, but nope. In production configuration template compilation is compiled, but every render partial: is still a live slow lookup, that I think even needs to check the disk in it’s partial lookup (touching disk is expensive!).

This would be a great thing to fix in Rails, it inconveniences many people. Perhaps by applying some kind of lookup caching, perhaps similar to what Bootsnap does for $LOAD_PATH and require, but for template lookup paths. Or perhaps by enhancing the template compilation so the exact result of template lookups are compiled in and only need to be done on template compilation.  If either of these were easy to do, someone would probably have done them already (but maybe not).

In any event, the local solution is simple, if a bit painful to code legibility. Remove those extra partials. The main “show a member” partial is invoked with render collection, so only gets looked-up once and is not a problem, but when it calls out to others, it’s one lookup per render every time.  We inlined one of them, and turned two more into helper methods instead of partials. 

At this point, I had my 200-fileset test case performing as well or better as our older-more-performant-branch, and I was convinced we had it!  But we deployed to staging, and it was still significantly slower than our more-performant-branch for our most-membered work. Doh! What was the difference? Ah right, our most-membered work has 200 child works, my test case didn’t have child works.

Okay, new test case (it was kinda painful to figure out how to create a many-hundred-child-work test case in dev, and very slow with what I ended up with). And back to ruby-prof.

N+1 Solr queries again, for representative_presenter

Right before our internal/soft deadline, we had to at least temporarily bail out of using riiif for tiled image viewer and other derivatives too, for performance reasons.  (We ultimately ended up not using riiif, you can read about that too).

In the meantime, we added a feature switch to our app so we could have the riiif-using code in there, but turn it on and off.  So even though we weren’t really using riiif yet (or perf testing with riiif), there was some code in there preparing for riiif, that ended up being relevant to perf for works with child-works.

For riiif, we need to get a file_id to pass to riiif. And we also wanted the image height and width, so we could use lazysizes-aspect ratio so the image would be taking up the proper space on the screen even if waiting for a slow riiif server to deliver it. (lazysizes for lazy image loading, and lazysizes-aspectratio which can be used even without lazy loading — are highly recommended, they work great).

We used polymorphism, for a fileset member, the height, width and original_file_id were available directly on the solr object fetched corresponding to the member. But for a child work, it delegated to representative_presenter to get them. And representative_presenter, of course, triggered a solr fetch. Actually, it seemed to trigger three solr fetches, so you could actually call this a 3n+1 query!

If we were fetching from ActiveRecord, the solution to this would possibly be as simple as adding something like .includes("members", "members.representative") . Although you’d have to deal with some polymorphism there in some ways tricky for AR, so maybe that wouldn’t work out. But anyway, we aren’t.

At first I spent some time thinking through if there was a way to bulk-eager-load these representatives for child works similarly to what you might do with ActiveRecord. It was tricky, because the solr data model is tricky, the polymorphism, and solr doesn’t make “joins” quite as straighforward as SQL does.  But then I figured, wait, use Solr like Solr.   In Solr it’s typical to “de-normalize” your data so the data you want is there when you need it.

I implemented code to index a representative_file_id, representative_width, and representative_height directly on a work in Solr. At first it seemed pretty straightforward.  Then we discovered it was missing some edge cases (a work that has as it’s representative a child work, that has nothing set as it’s representative?), and that there was an important omission — if a work has a child work as a representative, and that child work changes it’s representative (which now applies to the first work), the first work needs to be reindexed to have it. So changes to one work need to trigger a reindex of another. After around 10 more frustrating dev hours, some tricky code (which reduces indexing performance but better than bad end-user performance), some very-slow and obtuse specs, and a very weary brain, okay, got that taken care of too. (this commit may not be the last word, I think we had some more bugfixes after that).

After a bulk reindex to get all these new values — our code is even a little bit faster than our older-better-performing-branch. And, while I haven’t spent the time to compare it, I wouldn’t be shocked if it’s actually a bit faster than the Stock sufia.  It’s not fast, still 4-5s for our most-membered-work, but back to ‘barely good enough for now’.

Future: Caching? Pagination?

My personal rules of thumb in Rails are that a response over 200ms is not ideal, over 500ms it’s time to start considering caching, and over 1s (uncached) I should really figure out why and make it faster even if there is caching.  Other Rails devs would probably consider my rules of thumb to already be profligate!

So 4s is still pretty slow. Very slow responses like this not only make the user wait, but load down your Rails server filling up it’s processing queue and causing even worse problems under multi-user use. It’s not great.

Under a more standard Rails app, I’d definitely reach for caching immediately. View or HTTP caching is a pretty standard technique to make your Rails app as fast as possible, even when it doesn’t have pathological performance.

But the standard Rails html caching approaches use something they call ‘russian doll caching’, where the updated_at timestamp on the parent is touched when a child is updated. The issue is making sure the cache for the parent page is refreshed when a child displayed on that page changes.

classProduct < ApplicationRecord
  has_many :games
end
classGame < ApplicationRecord
  belongs_to :product, touch: true
end

With touch set to true, any action which changes updated_at for a game record will also change it for the associated product, thereby expiring the cache.

ActiveFedora tries to be like ActiveRecord, but it does not support that “touch: true” on associations used in the example for russian doll caching. It might be easy to simulate with an after_save hook or something — but updating records in Fedora is so slow. And worse, I think (?) there’s no way to atomically update just the updated_at in fedora, you’ve got to update the whole record, introducing concurrency problems. I think this could be a whole bunch of work.

jcoyne in slack suggested that instead of russian-doll-style with touching updated_at, you could assemble your cache key from the updated_at values from all children.  But I started to worry about child works, this might have to be recursive, if a child is a child work, you need to include all it’s children as well. (And maybe File children of every FileSet?  Or how do fedora ‘versions’ effect this?).  It could start getting pretty tricky.  This is the kind of thing the russian-doll approach is meant to make easier, but it relies on quick and atomic touching of updated_at.

We’ll probably still explore caching at some point, but I suspect it will be much less straightforward to work reliably than if this were a standard rails/AR app. And the cache failure mode of showing end-users old not-updated data is, I know from experience, really confusing for everyone.

Alternately or probably additionally, why are we displaying all 473 child images on the page at once in the first place?  Even in a standard Rails app, this might be hard to do performantly (although I’d just solve it with cache there if it was the UX I wanted, no problem). Mostly we’re doing it just cause stock sufia did it and we got used to it. Admins use ctrl-f on a page to find a member they want to edit. I kind of like having thumbs for all pages right on the page, even if you have to scroll a lot to see them (was already using lazysizes to lazy load the images only when scrolled to).  But some kind of pagination would probably be the logical next step, that we may get to eventually. One or more of:

  • Actual manual pagination. Would probably require a ‘search’ box on titles of members for admins, since they can’t use cntrl-f anymore.
  • Javascript-based “infinite scroll” (not really infinite) to load a batch at a time as user scrolls there.
  • Or using similar techniques, but actually load everything with JS immediately on page load, but a batch at a time.  Still going to use the same CPU on the server, but quicker initial page load, and splitting up into multiple requests is better for server health and capacity.

Even if we get to caching or some of these, I don’t think any of our work above is wasted — you don’t want to use this technique to workaround performance bottlenecks on the server, in my opinion you want to fix easily-fixable (once you find them!) performance bottlenecks or performance bugs on the server first, as we have done.

And another approach some would be not rendering some/all of this HTML on the server at all, but switching to some kind of JS client-side rendering (react etc.). There are plusses and minuses to that approach, but it takes our team into kinds of development we are less familiar with, maybe we’ll experiment with it at some point.

Thoughts on the Hydra/Samvera stack

So. I find Sufia and the samvera stack quite challenging, expensive, and often frustrating to work with. Let’s get that out of the way. I know I’m not alone in this experience, even among experienced developers, although I couldn’t say if it’s universal.

I also enjoy and find it rewarding and valuable to think about why software is frustrating and time-consuming (expensive) to work with, what makes it this way, and how did it get this way, and (hardest of all), what can be done or done differently.

If you’re not into that sort of discussion, please feel free to drop out now. Myself, I think it’s an important discussion to have. Developing a successful collaborative open source shared codebase is hard, there are many things we (or nobody) has figured out, and I think it can take some big-picture discussion and building of shared understanding to get better at it.

I’ve been thinking about how to have that discussion in as productive a way as possible. I haven’t totally figured it out — wanting to add this piece in but not sure how to do it kept me from publishing this blog post for a couple months after the preceding sections were finished — but I think it is probably beneficial to ground and tie the big picture discussion in specific examples — like the elements and story above. So I’m adding it on.

I also think it’s important to tell beginning developers working with Samvera, if you are feeling frustrated and confused, it’s probably not you, it’s the stack. If you are thinking you must not be very good at programming or assuming you will have similar experiences with any development project — don’t assume that, and try to get some experience in other non-samvera projects as well.

So, anyhow, this experience of dealing with performance problems on a sufia ‘show’ page makes me think of a couple bigger-picture topics:  1) The continuing cost of using a less established/bespoke data store layer (in this case Fedora/ActiveFedora/LDP) over something popular with many many developer hours already put into it like ActiveRecord, and 2) The idea of software “maturity”.

In this post, I’m actually going to ignore the first other than that, and focus on the second “maturity”.

Software maturity: What is it, in general?

People talk about software being “mature” (or “immature”) a lot, but googling around I couldn’t actually find much in the way of a good working definition of what is meant by this. A lot of what you find googling is about the “Capability Maturity Model“. The CMM is about organizational processes rather than product, it’s came out of the context of defense department contractors (a very different context than collaborative open source), and I find it’s language somewhat bureaucratic.  It also has plenty of critique.  I think organizational process matters, and CMM may be useful to our context, but I haven’t figured out how to make use of CMM to speak to about software maturity in the way I want to here, so I won’t speak of it again here.

Other discussions I found also seemed to me kind of vague, hand-wavy, or self-referential, in ways I still didn’t know how to make use of to talk about what I wanted.

I actually found a random StackOverflow answer I happened across to be more useful than most, I found it’s focus on usage scenarios and shared understanding to be stimulating:

I would say, mature would add the following characteristic to a technology:

  1. People know how to use it, know its possibilities and limitations
  2. People know what the typical usage scenarios are, patterns, what are good usage scenarios for this technology so that it shows its best
  3. People have found out how to deal with limitations/bugs, there is a community knowledge and help out there
  4. The technology is trusted enough to be used not only by individuals but in productive commercial environment as well

In this way of thinking about it, mature software is software where there is shared understanding about what the software is for, what patterns of use it is best at and which are still more ‘unfinished’ and challenging; where you’re going to encounter those, and how to deal with them.  There’s no assumption that it does everything under the sun awesomely, but that there’s a shared understanding about what it does do awesomely.

I think the unspoken assumption here is that for the patterns of use the software is best at, it does a good job of them, meaning it handles the common use cases robustly with few bugs or surprises. (If it doesn’t even do a good job of those, that doesn’t seem to match what we’d want to call ‘maturity’ in software, right? A certain kind of ‘ready for use’; a certain assumption you are not working on an untested experiment in progress, but on something that does what it does well.).

For software meant as a tool for developing other software (any library or framework; I think sufia qualifies), the usage scenarios are at least as much about developers (what they will use the software for and how) as they are about the end-users those developers are ultimately develop software for.

Unclear understanding about use cases is perhaps a large part of what happened to me/us above. We thought sufia would support ‘manuscript’ use cases (which means many members per work if a page image is a member, which seems the most natural way to set it up) just fine. It appears to have the right functionality. Nothing in it’s README or other ‘marketing’ tells you otherwise. At the time we began our implementation, it may very well be that nobody else thought differently either.

At some point though, a year+ after the org began implementing the technology stack believing it was mature for our use case, and months after I started working on it myself —  understanding that this use case would have trouble in sufia/hyrax began to build,  we started realizing, and realizing that maybe other developers had already realized, that it wasn’t really ready for prime time with many-membered works and would take lots of extra customization and workarounds to work out.

The understanding of what use cases the stack will work painlessly for, and how much pain you will have in what areas, can be something still being worked out in this community, and what understanding there is can be unevenly distributed, and hard to access for newcomers. The above description of software maturity as being about shared understanding of usage scenarios speaks to me; from this experience it makes sense to me that that is a big part of ‘software maturity’, and that the samvera stack still has challenges there.

While it’s not about ‘maturity’ directly, I also want to bring in some of what @schneems wrote about in a blog post about “polish” in software and how he tries to ensure it’s present in software he maintains.

Polish is what distinguishes good software from great software. When you use an app or code that clearly cares about the edge cases and how all the pieces work together, it feels right.…

…User frustration comes when things do not behave as you expect them to. You pull out your car key, stick it in the ignition, turn it…and nothing happens. While you might be upset that your car is dead (again), you’re also frustrated that what you predicted would happen didn’t. As humans we build up stories to simplify our lives, we don’t need to know the complex set of steps in a car’s ignition system so instead, “the key starts the car” is what we’ve come to expect. Software is no different. People develop mental models, for instance, “the port configuration in the file should win” and when it doesn’t happen or worse happens inconsistently it’s painful.

I’ve previously called these types of moments papercuts. They’re not life threatening and may not even be mission critical but they are much more painful than they should be. Often these issues force you to stop what you’re doing and either investigate the root cause of the rogue behavior or at bare minimum abandon your thought process and try something new.

When we say something is “polished” it means that it is free from sharp edges, even the small ones. I view polished software to be ones that are mostly free from frustration. They do what you expect them to and are consistent…

…In many ways I want my software to be boring. I want it to harbor few surprises. I want to feel like I understand and connect with it at a deep level and that I’m not constantly being caught off guard by frustrating, time stealing, papercuts.

This kind of “polish” isn’t the same thing as maturity — schneems even suggests that most software may not live up to his standards of “polish”.

However, this kind of polish is a continuum.  On the dark opposite side, we’d have hypothetical software, where working with it is about near constant surprises, constantly “being caught off guard by frustrating, time-stealing papercuts”, software where users (including developer-users for tools) have trouble developing consistent mental models, perhaps because the software is not very consistent in it’s behavior or architecture, with lots of edge cases and pieces working together unexpectedly or roughly.

I think our idea of “maturity” in software does depend on being somewhere along this continuum toward the “polished” end. If we combine that with the idea about shared understanding of usage scenarios and maturity, we get something reasonable. Mature software has shared understanding about what usage scenarios it’s best at, generally accomplishing those usage scenarios painlessly and well. At least in those usage scenarios it is “polished”, people can develop mental models that let them correctly know what to expect, with frustrating “papercuts” few and far between.

Mature software also generally maintains backwards compatibility, with backwards breaking changes coming infrequently and in a well-managed way — but I think that’s a signal or effect of the software being mature, rather than a cause.  You could take software low on the “maturity” scale, and simply stop development on it, and thereby have a high degree of backwards compat in the future, but that doesn’t make it mature. You can’t force maturity by focusing on backwards compatibility, it’s a product of maturity.

So, Sufia and Samvera?

When trying to figure out how mature software is, we are used to taking certain signals as sort of proxy evidence for it.  There are about 4 years between the release of sufia 1.0 (April 2013) and Sufia 7.3 (March 2017; beyond this point the community’s attention turned from Sufia to Hyrax, which combined Sufia and CurationConcerns). Much of sufia is of course built upon components that are even older: ActiveFedora 1.0 was Feb 2009, and the hydra gem was first released in Jan 2010. This software stack has been under development for 7+ years,  and is used by several dozens of institutions.

Normally, one might take these as signs predicting a certain level of maturity in the software. But my experience has been that it was not as mature as one might expect from this history or adoption rate.

From the usage scenario/shared understanding bucket, I have not found that there is as high degree as I might have expected of easily accessible shared understanding of  “know how to use it, know its possibilities and limitations,” “know what the typical usage scenarios are, patterns, what are good usage scenarios for this technology so that it shows its best.”  Some people have this understanding to some extent, but this knowledge is not always very clear to newcomers or outsiders — and not what they may have expected. As in this blog post, things I may assume are standard usage scenarios that will work smoothly may not be.   Features I or my team assumed were long-standing, reliable, and finished sometimes are not. 

On the “polish” front, I honestly do feel like I am regularly “being caught off guard by frustrating, time stealing, papercuts,” and finding inconsistent and unparallel architecture and behavior that makes it hard to predict how easy or successful it will be to implement something in sufia; past experience is no guarantee of future results, because similar parts often work very differently. It often feels to me like we are working on something at a more proof-of-concept or experimental level of maturity, where you should expect to run into issues frequently.

To be fair, I am using sufia 7, which has been superceded by hyrax (1.0 released May 2017, first 2.0 beta released Sep 2017, no 2.0 final release yet), which in some cases may limit me to older versions of other samvera stack dependencies too. Some of these rough edges may have been filed off in hyrax 1/2, one would expect/hope that every release is more mature than the last. But even with Sufia 7 — being based on technology with 4-7 years of development history and adopted by dozens of institutions, one might have expected more maturity. Hyrax 1.0 was only released a few months ago after all.  My impression/understanding is that hyrax 1.0 by intention makes few architectural changes from sufia (although it may include some more bugfixes), and upcoming hyrax 2.0 is intended to have more improvements, but still most of the difficult architectural elements I run into in sufia 7 seem to be mostly the same when I look at hyrax master repo. My impression is that hyrax 2.0 (not quite released) certainly has improvements, but does not make huge maturity strides.

Does this mean you should not use sufia/hyrax/samvera? Certainly not (and if you’re reading this, you’ve probably already committed to it at least for now), but it means this is something you should take account of when evaluating whether to use it, what you will do with it, and how much time it will take to implement and maintain.  I certainly don’t have anything universally ‘better’ to recommend for a digital repository implementation, open source or commercial. But I was very frustrated by assuming/expecting a level of maturity that I then personally did not find to be delivered.  I think many organizations are also surprised to find sufia/hyrax/samvera implementation to be more time-consuming (which also means “expensive”, staff time is expensive) than expected, including by finding features they had assumed were done/ready to need more work than expected in their app; this is more of a problem for some organizations than others.  But I think it pays to take this into account when making plans and timelines.   Again, if you (individually or as an institution) are having more trouble setting up sufia/hyrax/samvera than you expected, it’s probably not just you.

Why and what next?

So why are sufia and other parts of the samvera stack at a fairly low level of software maturity (for those who agree they are not)?  Honestly, I’m not sure. What can be done to get things more mature and reliable and efficient (low TCO)?  I know even less.  I do not think it’s because any of the developers involved (including myself!) have anything but the best intentions and true commitment, or because they are “bad developers.” That’s not it.

Just some brainstorms about what might play into sufia/samvera’s maturity level. Other developers may disagree with some of these guesses, either because I misunderstand some things, or just due to different evaluations.

  • Digital repositories are just a very difficult or groundbreaking domain, and it just necessarily would take this number of years/developer-hours to get to this level of maturity. (I don’t personally subscribe to this really, but it could be)

 

  • Fedora and RDF are both (at least relatively) immature technologies themselves, that lack the established software infrastructure and best practices of more mature technologies (at the other extreme, SQL/rdbms, technology that is many decades old), and building something with these at the heart is going to be more challenging, time-consuming, and harder to get ‘right’.

 

  • I had gotten the feeling from working with the code and off-hand comments from developers who had longer that Sufia had actually taken a significant move backwards in maturity at some point in the past. At first I thought this was about the transition from fedora/fcrepo 3 to 4. But from talking to @mjgiarlo (thanks buddy!), I now believe it wasn’t so much about that, as about some significant rewriting that happened between Sufia 6 and 7 to: Take sufia from an app focused on self-deposit institutional repository with individual files, to a more generalized app involving ‘works’ with ‘members’ (based on the newly created PCDM model); that would use data in Fedora that would be compatible with other apps like Islandora (a goal that has not been achieved and looks to me increasingly unrealistic); and exploded into many more smaller purpose hypothetically decoupled component dependencies that could be recombined into different apps (an approach that, based on outcomes, was later reversed in some ways in Hyrax).
    • This took a very significant number of developer hours, literally over a year or two. These were hours that were not spent on making the existing stack more mature.
    • But so much was rewritten and reorganized that I think it may have actually been a step backward in maturity (both in terms of usage scenarios and polish), not only for the new usage scenarios, but even for what used to be the core usage scenario.
    • So much was re-written, and expected usage scenarios changed so much, that it was almost like creating an entirely new app (including entirely new parts of the dependency stack), so the ‘clock’ in judging how long Sufia (and some but not all other parts of the current dependency stack) has had to become mature really starts with Sufia 7 (first released 2016), rather than sufia 1.0.
    • But it wasn’t really a complete rewrite, “legacy” code still exists, some logic in the stack to this day is still based on assumptions about the old architecture that have become incorrect, leading to more inconsistency, and less robustness — less maturity.
    • The success of this process in terms of maturity and ‘total cost of ownership’ are, I think… mixed at best. And I think some developers are still dealing with some burnout as fallout from the effort.

 

  • Both sufia and the evolving stack as a whole have tried to do a lot of things and fit a lot of usage scenarios. Our reach may have exceeded our grasp. If an institution came with a new usage scenario (for end-users or for how they wanted to use the codebase), whether they come with a PR or just a desire, the community very rarely says no, and almost always then tries to make the codebase accommodate. Perhaps in retrospect without sufficient regard for the cost of added complexity. This comes out of a community-minded and helpful motivation to say ‘yes’. But it can lead to lack of clarity on usage scenarios the stack excels at, or even lack of any usage scenarios that are very polished in the face of ever-expanding ambition. Under the context of limited developer resources yes, but increased software complexity also has costs that can’t be handled easily or sometimes at all simply by adding developers either (see The Mythical Man-Month).

 

  • Related, I think, sufia/samvera developers have often aspired to make software that can be used and installed by institutions without Rails developers, without having to write much or any code. This has not really been accomplished, or if it has only in the sense that you need samvera developer(s) who are or become proficient in our bespoke stack, instead of just Rails developers. (Our small institution found we needed 1-2 developers plus 1 devops).  While motivated by the best intentions — to reduce Total Cost of Ownership for small institutions — the added complexity in pursuit of this ambitious and still unrealized goal may have ironically led to less maturity and increased TCO for institutions of all sizes.

 

  • I think most successfully mature open source software probably have one (or a small team of) lead developer/architect(s) providing vision as to the usage scenarios that are in or out, and to a consistent architecture to accomplish them. And with the authority and willingness to sometimes say ‘no’ when they think code might be taking the project in the wrong direction on the maturity axis. Samvera, due to some combination of practical resource limitations and ideology, has often not.

 

  • ActiveRecord is enormously complex software which took many many developer-hours to get to it’s current level of success and maturity. (I actually like AR okay myself).  The thought that it’s API could be copied and reimplemented as ActiveFedora, with much fewer developer-hour resources, without encountering a substantial and perhaps insurmountable “maturity gap” — may in retrospect have been mistaken. (See above about how basing app on Fedora has challenges to achieving maturity).

 

What to do next, or different, or instead?  I’m not sure!  On the plus side we have a great community of committed and passionate and developers, and institutions interested in cooperating to help each other.

I think improvements start with acknowledging the current level of maturity, collectively and in a public way that reaches non-developer stakeholders, decision-makers, and funders too.

We should be intentional about being transparent with the level of maturity and challenge the stack provides. Resisting any urge to “market” samvera or deemphasize the challenges, which is a disservice to people evaluating or making plans based on the stack, but also to the existing community too.We don’t all have to agree about this either; I know some developers and institutions do have similar analysis to me here (but surely with some differences), others may not. But we have to be transparent and public about our experiences, to all layers of our community as well as external to it. We all have to see clearly what is, in order to make decisions about what to do next.

Personally, I think we need to be much more modest about our goals and the usage scenarios (both developer and end-user) we can support. This is not necessarily something that will be welcome to decision-makers and funders, who have reasons to want  to always add on more instead.  But this is why we need to be transparent about where we truly currently are, so decision-makers can operate based on accurate understanding of our current challenges and problems as well as successes