Proposed Rails-based digital collections developer’s toolkit

In my last post, I explained my read of the lay of the samvera land, and why I’m interested in pursuing an alternate approach. We haven’t committed to the path I will outline, but are considering it.

First we should say that I am coming from the assumption of an institution that does want to do local development, to get full control over the application and be able to custom-fit it to their needs. Having a local development team requires resources and perhaps some organizational cultural changes to properly involve technical decision-makers in technical decisions. But it’s my own opinion that developing with hydra/samvera (or anything based on Rails) has always required a local development team; if you don’t want one, you would be better off looking at a more off-the-shelf, perhaps proprietary product.

So, reacting to experiences with what have seemed to me to be overly-complex architectures, one question to start from is — can a development team “just build a Rails app”?  Maybe, but I think some common patterns and needs in digital collections/repositories can be identified, to be turned into shared dependencies, to better organize the initial app and hopefully provide efficiencies for more apps over time.

If the design goal of an architecture of re-usable abstractions is to optimize between maximized code re-use (avoid re-inventing the wheel), and minimized architectural complexity (number of abstractions, complexity of abstractions, general number of novel mental-models) — I believe we can make a go of it by sticking as closely to Rails as possible, trying to add things carefully that have clear benefit for our domain.

One possibly counter-intuitive proposition here is that, rather than trying to share as much code as possible and minimize code in local apps, we can make end-to-end development more efficient and less costly by putting less code into shared dependencies, so we can spend more time making those shared modular components as well-designed for re-use and composability as possible.  Fighting with integration into a complex architecture (that includes many features your app doesn’t need) can easily erase any hypothetical gains from shared code.

So the proposal here is neither to provide a complete working app or “solution bundle” (which might be hyrax approach), nor to write a completely custom app with no domain-specific parts to be shared with the community (which might be what Steven Anderson was at one point suggesting for BPL, or what the “bespoke” Valkyrie-based apps lean towards at present), but instead to write a local app in tandem with some relatively small and tight shareable, re-usable, composable, modular components.  (This is similar to the original “hydra head” approach;  we think we can do this more succesfully as a result of what we’ve learned, and technical choices to keep things much more simpler, recognizing that a local development team will need to do development).

The hypothesis or proposition is that this can get us to an efficient cost-effective architecture for building a digital collections site ultimately quicker than trying to get there from iterating on existing code. Because we can try to be ruthless about minimizing abstractions and code complexity from the start, with the better knowledge of our domain we now have — avoiding legacy code that arguably already has some unneccessary complexity baked in at a fundamental level.

The trade-off, is that by starting from a “clean slate”, we have a long period of developing architecture with no immediate pay-off in adding features to our production app.  There is a hump of lowered productivity before reaching — if we are successful — the goal of overall increased productivity.  This is a real cost, and a real risk that our approach may not be as beneficial as we think — you have to take some risks to achieve great benefits.

This post will lay out a proposed plan in three parts:

  1. Some principles/goals, illustrating further what we mean by a Rails-based developer’s toolkit for our domain
  2. A high-level outline/blueprint of the architectural components I currently anticipate.
  3. Some analysis of the benefits and especially risks and potential reasons for failure of this plan.

If you are interested in this plan in any way, please do get in touch. 

Principles and Goals: What do we intend by “A Rails-based developer’s toolkit”?

The target audience is developers. Developers will use this tool to put together a digital collections/repository app.   It will not produce “shrinkwrap” software for non developers, this is tools for developers.

⇒ Realistically probably minimally requiring a team of 2-4 technical staff (likely including a devops/sysadminy role). I suspect this is what is generally required for a successful hyrax or other samvera project already/too.  By acknowledging there will be need for local development team, we can avoid the complexity of trying to make “declarative-config-only” layers.

⇒ “developers” doesn’t necessarily mean expert ruby/rails developers. We can try to target relatively beginner Rails devs, including those for whom this is their first Rails or ruby project. As with any endeavor, the more experience and skill you have, the more efficient and high-quality work you can do, but I am optimistic we can make the bar fairly low for getting started.

We try to add tools to make “just building a Rails app” easier for digital collections/repository domain, supplying components based on common needs in this domain.  But where possible, we do things “The Rails Way”, what is most natural or typical or common for Rails apps, providing enhancements to it rather than replacements.

⇒ Things people often do on top of Rails, such as “form objects” should be options that are ideally no harder to choose to use than with any other Rails app, but not required or assumed.

⇒ There’s already a lot to learn in just learning how to effectively build a performant, secure, and reliable app on Rails (or a web app in general). We will try to minimize extra abstractions and architectures.

⇒ Ideally making it feasible to add developers (or hire project-based consultancies) experienced in Rails but not our industry/domain to the project, and have them get up to speed relatively quickly (joining a legacy app is rarely easy even for experienced devs). And the path for beginner library developers learning to use the stack will, as much as possible, be learning Rails.

⇒ It’s not that we’re suggesting everything in Rails is designed perfectly or is always easy to work with or learn.  Rather we’re suggesting these kind of flexible APIs can be a hard design problem, and Rails is stable, mature, polished, well-understood, performance-tuned, with an ecosystem of tutorials and third-party code. Trying to reinvent “better” alternatives has significant development costs and will take time and iterations to reach Rails’ maturity, that are best avoided where possible.

This will be a library or toolkit of components, not a fully integrated “solution bundle” or framework. 

⇒ “You call a library, but a framework calls you“.  Rather than providing API to hook into and customize small parts of a large integrated stack, the library/component approach is to try and provide re-usable modular blocks (legos?) that you can put together to build what you (the developer) want.

⇒ In many cases, you’ll customize from the top down instead of from the bottom up. If you want to customize a view, you might override the entire view  and then re-build it, possibly using the same components the original was built with but composed and configured in different ways.  Instead of trying to figure out a way to ‘override’ a part inside without touching anything else.

⇒ Rather than be a forwards-compatibility nightmare, we think targetting this mode of work can actually reduce your local maintenance burden and upgrade paths, by having simpler more stable code in the shared dependencies. In our community experience, the alternate approach has not in practice seemed to result in better forwards-compatibility.

⇒ Out of the box without writing code, we do aim to give you something that runs, this isn’t just components dropped on the floor — but it’ll probably be a “proof of concept”, maybe not even an “MVP” you could deploy in production at all. Think Rails scaffolding, but specifically for our domain.  Many parts may need to be customized or added (and some parts will certainly need to be) to arrive at the app well-suited for your business and user needs, but scaffolding can be easier to get started with than a blank slate.

=> Again, we are assuming an institution interested in a local development team that can customize to your particular business and user needs. If you don’t want to do this, you might be better served by a more “shrinkwrap” ready to go (but less flexible/customizable) solution — that, IMO, is likely not historical or future samvera software either, or any Rails-based approach. It is likely a paid-development-and-support product, perhaps proprietary and/or hosted.

(Incidentally, this is in some ways a reversal of some things I myself thought some years ago when I participated in developing early versions of Blacklight. Then I thought any  generated code was a bad thing, because it could go out of date. And any copy-and-paste of code was a bad thing, for similar — the goal was to have the local app have as little code in it as possible.  I have come to realize while that goal seems appealing, achieving it succesfully is very hard, and a less successful attempt at it can actually be more costly, and it may counter-intuitively make sense to try to minimize the code in the shared dependency instead.

We prioritize solid modular building blocks over features.  There are very diverse needs in our domain. Rather than try to understand let alone build-in all features needed accross our community, we try to give development teams the tools to lower costs when developing what they have discovered about their local requirements and priorities. We will prioritize only the most common needs, and be hesitant of added complexity for less common needs — or needs we haven’t yet figured out how to meet with re-usable modular components.

⇒ This doesn’t get us out of understanding domain needs,  we still need to figure out the common needs in our domain that can be addressed by our tools, and if we’re really wrong we won’t successfully produce a toolkit that lowers development cost.

⇒ But by trying to focus on the most common needs where we can provide modular, composable tools — we have less code, and thus more resources per abstraction to try to design them successfully to be re-composable to your needs, flexible beyond what could have been planned in explicitly. Which takes skill and time, and is feasibly only by ruthlessly focusing on simplicity.  We will do our best to design our tools so it’s clear how you can use them to build additional things you need (a workflow engine?) on top. 

⇒ The number of “built-in” features and components and higher-level abstractions will, if the project is successful, probably grow over time. But you can’t build a good skyscraper without a good foundation, the base-level components need to be solidly designed for re-use, in order to build higher-level on top.

We are targeting a digital collections use case much like our organization’s. Staff-only editing, not self-deposit. Relatively small number of different pages.  There should be no problem scaling number of users or number of documents right from the start, but we are not scaling complexity of requirements, or user interfaces. Our needs are outwardly pretty simple (although the devil is in the details) — customize our metadata schemas and derivatives, let staff ingest and enter metadata, let users find and view and download items, in an app that works well with search engines and on the web in general, with modern best-practice performance characteristics. We are targeting other organizations with similar needs. 

⇒ While we aspire to provide a solid enough foundation you could build more complex requirements on top of this, the toolkit would be of course be a proportionally smaller aid in that situation. While we think the toolkit can grow to handle more complex use cases eventually, and we’re keeping that horizon in view, but it’s not the priority.

⇒ We think this simpler digital collections approach — in addition to conveniently matching our local needs — is often close to a strict subset of more complex needs. If you have complicated workflow needs (whether involving staff or patron work), you probably need most of our simple use case target plus more. Starting here can get us something that works for we think a lot of the community and can be initial steps for more.

⇒ The toolkit is primarily focused on providing tools to aid in the ingesting and management of digital assets and metadata.  I think this is what people have come to samvera for, and is in some sense the “hard part”. While end-user discovery is very important — and the toolkit will provide basic discovery — I think it’s in some ways an easier problem, with more well-understood practices and technique for development — especially if we can provide for it to be done in terms of ActiveRecord models.

We will think of developer and operational use cases from the start. We don’t want to have to be the experts in every end-user use case in the domain, but we want to try to be the experts in  developer and operational use cases in building apps in this domain.  If a design choice makes deployment or operations harder, it may not be a good choice. The users for this toolkit are developers, who use it to meet end-user needs.

⇒ We will plan for performance from the start (avoiding n+1 queries, providing clear paths to use standard Rails caching techniques, etc.).

⇒ We will plan for a clear “story” about deployment architecture from the start —  we will plan for cloud deployment from the start (such things as not requiring any persistent file systems–putting everything on eg S3–and not making assumptions about how many machines different services are divided upon) — but not multi-tenancy (too complicated). We will consider from the start ease of use and efficiency throughout your product life cycle.

⇒ With less code to document and maintain  in the toolkit, we hope it becomes more feasible to maintain good non-stale documentation, and components with very good long-term stability and compatibility — reducing total cost over time of both developing and maintaining not only the toolkit itself, but counter-intuitively an app based on the toolkit as well. In practice, less shared code can be more efficient use of both community and local developer resources than more.   

Outline of Architectural Components

Modelling/Persistence: json_attr-based

Use attr_json (which I wrote) to store object metadata in a schemaless fashion as serialized json. Supported nested/compound objects, with rails-style form support, dirty-tracking, and other features meant to let you use as you would an ordinary ActiveRecord model.  attr_json is at the heart of this plan.

Modelling will be based on PCDM (Collections; Objects; Files), but we may not feel the need to stick strictly to PCDM. For instance, we may make work->child relationship 1-to-many instead of n-to-m, if it significantly eases a robust implementation. Or allow an internal object representing a ‘file’ to be an element in the ‘member’ relation.

We may likely put all three core model types in one table using Rails Single Table Inheritance — that significantly eases ActiveRecord association modelling, if you want a Work/Object’s children to be both ordered, and polymorphically include other Work/Object’s or Files (again not strictly PCDM). attr_json addresses some of the ordinary disadvantages of STI, since varying data can be in the shared json column. (I believe valkyrie activerecord takes similar one-table approach, although it does not use AR associations — I think AR associations are a powerful tool I’m loathe to give up).

We will plan from the start to prioritize efficient rdbms querying over ontological purity. Dealing with avoiding n+1 (or worse) when doing expected things like displaying a list of objects/works with thumbnails, or displaying all a work’s members with thumbnails. In some cases we may resort to recursive CTEs; some solution will be built into the tool, this isn’t something to make developer-users figure out for themselves each implementation.

While attr_json doesn’t technically require postgres, we will likely require postgres for use of toolkit, so when providing out of the box solutions to common needs (including efficiency/performance), we can use features with database-specific elements like jsonb, recursive CTEs, and postgres full text search.

We may provide a six-alphanumeric-primary-key approach similar to sufia/hyrax/samvera, possibly implemented with postgres functions. One way or another, we need to migrate our existing data and keep URLs and possibly internal IDs consistent.

Controllers

Basic CRUD functionality, and possibly not a whole lot more, will be provided by controllers that look an awful lot like the out-of-the-box Rails scaffolding controllers.

Additional hooks will likely be included for customizing certain things (authorization, strong params) — but the goal is to keep the controller simple (and unchanging-over-time) enough that a developer-user wanting to customize will be safe to simply copy-paste the entire controller and modify it appropriately. Any complex or domain-specific functionality in the toolkit will be in service/helper objects that can be re-used in a local controller too, rather than the controller itself.

Config: chamber

Where our toolkit code needs to allow deployment/institution-specific configuration, I’m leaning toward using chamber. The built-in Rails stuff is kind of weird, and has been changing a lot with every Rails version (because it’s kind of weird and they keep trying to make it less so).  But chamber is suitably flexible that it should be able to meet almost any consuming institution’s needs. We may provide some extensions to chamber.

Asset management: shrine

In addition to metadata modelling and persistence, handling bytestreams/digital assets is the other fundamental core part of our domain. Both originals (often preservation copies) and derivatives.

Shrine is a “file attachment toolkit” for ruby, with Rails integration included. Shrine was motivated by lack of flexibility in some additional file attachment solutions. The result is something that is accurately called a toolkit — while it can be a bit harder to get started with in a fresh Rails app, it provides composable components that can be rearranged to meet domain needs, which makes it very well-suited for our toolkit, where flexible asset-handling is important to reducing total cost of development. While Rails has introduced ActiveStorage as a built-in solution, we don’t think it’s flexible enough for our asset-centered domain.

Shrine already handles storing files in a back-end agnostic way (we will focus on S3 for production and optionally local file system for dev, but it should be feasible for you choose other shrine adapters with minimized changes to other code needed). Shrine already handles streaming, full-file, and URL access APIs regardless of back-end. Shrine already has some architecture built out for ultra-flexible derivatives, storing checksums, etc. (Reflection on what derivatives are already there should make it easier to build out, say, a UI for menu of downloads based on what derivatives you have created and tagged as a download).

The current Shrine derivatives architecture (which it calls “versions”) doesn’t quite match my idea of requirements: I want to be able to optionally store derivatives in  a different backend than originals (eg different S3 bucket), or even different buckets for different derivatives. The shrine author has given me some advice on how to achieve that, and it also aligns in some ways with existing discussion on desired rewrite of the shrine “versions” plugin. I will likely spend significant time on developing (and documenting) a new shrine derivatives/versions plugin meeting our needs, and hopefully to be contributed back to shrine possibly as new standard plugin.

Additionally, while shrine itself prioritizes solid fairly low-level API (to achieve it’s composability/flexibility goals), I think our toolkit needs some slightly higher-level API, probably also developed as a shrine plugin, that lets you define derivatives in a more declarative way, making it easy and more DRY to do things like create/re-create/delete named derivatives for already existing objects, or declaratively specify foreground/background/on-the-fly processing.

The toolkit will likely provide it’s own recommended combination(s) of shrine config, as ‘scaffolding’, either as toolkit-specific shrine plugins, generated code, or both. It will be easy to change much of this config locally.

A metadata properties configuration system

In some current samvera platforms, when you add a new metadata property, there can be a dozen+ different files — in some cases depending on local architecture decisions — you need to edit to mention the new property. This is not only annoying, but error-prone to try to keep changes consistent across all those files.

I think this is a big enough barrier to developer ease of use, common enough to almost all uses, that it deserves an architectural solution.

We will provide an extensible properties configuration system that lets you configure multiple things about the property in the original model. It might look something like this:

class Article < Work
  our_tookit_property "title" do
     attr_json :string, array: true
     rdf_predicate "http://purl.org/dc/terms/"

     indexing do
       to_field "title_text", first_only
     end

     edit_form do
       position: 0
       simple_form wrapper: :small
     end

     include_in_results_list: true
     include_in_detail_page: true
  end
end

The biggest potential problem with such a “centralized” property configuration system is — what if you need different configuration in different situations? Say one form field type in one place and a different in another, or even multiple indexers for different contexts.

This system will be from the start designed to support that too — what is in the model could be considered defaults, but all components using these values from properties configuration will take the values in a clearly documented format such that a developer can alternately choose to pass in alternate values than what was in the model property registration.

Edit forms: simple_form

simple_form is a Rails form toolkit. We can use it to make forms ‘just work’ (with field-level error reporting etc), including for some of our custom toolkit features (nested models), in a composable and overrideable way.  We will follow in the footsteps of hyrax and many other software in using it, probably set up for Bootstrap 4.

We will provide a custom form builder that can use (in some cases automatically, from properties definitions, see above) a suite of custom inputs for features built into the toolkit or common to our domain. Including nested models, repeatable elements (perhaps using cocoon, which attr_json is already compatible with), vocabulary auto-complete (probably based on questioning_authority, but built out to save vocabularies and IDs in nested models, instead of just saving values; we may need to send some PRs to qa if it’s APIs aren’t suitable).

We’ll also provide custom simple_form inputs for file upload (supporting direct-to-s3 uploads in a shrine-compatible way; hopefully supporting browse_everything), and permissions editing (see below).

We will probably provide a wrapper form for all the inputs that can use information from the property registrations (see above) to specify order of inputs.

We’ll provide a custom simple_form input for setting relationships that lets you search for core model objects (collections, works, files) with an autocomplete-style UI, and assign them to associations.

If you want to customize this beyond the simple things you can do, you’ll just use your own form view where you compose the simple form inputs yourself. The built-in form can be considered scaffolding to quickly get started, although it may be sufficient for simple use cases.

Staff (back-end) UX

Staff UX needs go beyond just forms — for one thing you need a way to find/get things to click on to get their edit forms. We will provide some limited back-end UX: ability to list/search/sort collections/works/files, and limit to just things you have certain permissions on.

Back-end UX is going to be kept pretty limited. Specific applications can build it out themselves if they need something beyond the basic scaffold.

The toolkit will need UX for adding/removing/re-ordering members of a parent. It will probably also provide a batch ingest/edit function of some kind, because that does seem to be a common need, and is one my institution needs.

We’ll need to do a bit more local staff user/requirements analysis to see what the minimum we need is (info from others with similar use cases welcome too), and decide which additional parts the toolkit should provide or just make clear how you’d provide them yourself. We aspire to provide high-quality staff UX for the simple targetted use cases.

Authorization/Permissions

Flexible and sane permissions/authorization system is very challenging, many efforts have run aground on it, and complete analyses of requirements can be very complex.

But we’re going to try to create a system anyway.  It will be based on an ACL model. Each ACL entry will relate an object (collection, work, file; “object”), a user or group (“subject”), and an operation (read, write, etc). They’ll be represented as ordinary normalized schema db objects (three-way join).

In addition to user or group as subject, we’ll have special subjects for “all logged in users”, and “public” (don’t even need to be logged in).  These will probably still be just ACL entry rows.

We’ll have a built-in list of hiearchical operations (hieararchical meaning if you have a higher one, it implies all the lower ones).  Which will likely be, in order from least powerful to most powerful:

  • list (see in results and lists)
  • read (see show page)
  • download (download assets)
  • add_member (eg if Collection is object, gives you ability to add things to it)
  • edit (edit metadata but _not_ permissions)
  • own (edit permissions)

The built-in controllers (and end-user-facing discovery) will out of the box do the right things with these permissions, but these will also be developer-editable, you will be able to add additional operations to the hieararchical list, as well as use additional operations that are not in a hieararchical list but are just stand-alone.  You’d have to edit (ie, probably create your own) controllers/views to make use of your new permissions — again we plan all of this as scaffolding which can be modified by writing code using the tools we give you.

While there will be an out of the box input UI element allowing you to edit these permissions directly, it’s expected that many apps will instead set them as part of workflows or other events, within UI  constrained to less than the full flexibility of the ACL system. A developer would do this just by writing code to set things ‘automatically’ in controllers and/or writing a more limited UI component. This system is a low-level acl architecture, it should support higher-level application-specific architectures on top.

APIs need to support fetching from the db with arbitrary permission limits (custom AR scopes); fetching from Solr with arbitrary permission limits (not neccesarily using blacklight_access_controls, we may likely create a new thing that works more the way we’d like, perhaps using solr joins, perhaps not); as well as checking can? on in-memory objects (possibly built on access_granted, which seems simpler and more performant than cancancan while still achieving what we need)

There will also be a superuser or admin class of user that can do everything.

The trickiest thing we do want to include as a requirement is a way for objects to inherit permissions from another object. This needs to something set on the inheriting object, and generally be live,  updating automaticaly if the inherited-from object’s permissions change. This is pretty tricky to get right, with decent performance on both writes and reads. Not sure if we’ll do it in a way that inherited-from permissions are “cached” on inheriting object (and need to be updated on writes), or instead just a persisted “pointer” to the inherited-from object (which needs to be followed on access-checking reads).  This is really hard to figure out how to do simply and performant, but also, we think, a key domain requirement, with previous workarounds that have led to lots of code complexity, so it’s important to figure it out from the start and not try to shoehorn it in later.

(Inherited permissions are even harder taking account of the fact that an object may want to inherit from another object, where that other object itself inherits. (file inherits permissions from work inherits permissions from collection?))

End-user (front-end) UI: Blacklight

We will use Blacklight (and Solr) for end-user discovery interface. But we will try to keep it as loosely coupled as possible, in case individual implementations using the toolkit would rather use something else (or nothing at all), or in the case the toolkit later chooses to support something else.

We’re going to try to make the staff/back-end UX actually not use Solr/Blacklight at all. Instead searching can be supported by postgres full-text search. This means you won’t get facets in the out of the box back-end UX (but can have limits in the UI). It also means if you have a very different front-end desired, you won’t need to run Solr at all for back-end functionality.

We won’t generally be automatically customizing the Blacklight UI, implementers will do that themselves using ordinary Blacklight techniques without mediation of toolkit-specific abstractions.

But there’s one big exception.  While the results list is of course based on solr results, data shown in views (both results and what you get when you click on a result) will not be based on solr — the app will take the IDs returned in the Solr response, and re-fetch them from the rdbms, even for results lists.

This may sound odd, but is actually how the popular generic Rails Solr support gem sunspot works, so there’s some precedent. I think it will allow the software architecture and developer’s mental model to be much simpler, with less duplication or parallel solr vs rdbms implementation — and the performance hit of the extra db query are minimal, especially in the context of legacy samvera performance. This approach lets you deal with n+1 issues purely in terms of rdbms and not duplicatedly on the solr side — with the same techniques whether you are on a results list page or an individual show page. It also lets you index into solr only what you need for query results, and not try to also put enough in solr for efficient display — using solr for what it’s best at, and simplifying and focusing your solr indexing decisions.

BL will be customized/overridden just enough to have the controller do this extra fetch, and use views based on our AR models, while providing customization hooks for local apps to customize views on per-object basis.

Indexing to Solr

Indexing will use hooks into ActiveRecord model life-cycles adapted from sunspot. (Sunspot itself is way over-featured/heavyweight compared to what we need, and is looking for new maintainers, so we won’t be using it directly. But it’s a mature Rails/Solr integration solution that has had a lot of hours put into it, so we will be looking to it for ideas and in some cases code to copy).

Indexing will be based on traject. Some additional architecture in traject 3.0 (I am the principal traject developer)  will make it easier to integrate here, but may still need a few pieces of architecture (like a “reader” based on ActiveRecord objects, and some transformation tools based on ruby objects as source records).  Basing on traject should make it straightforward to have really performant bulk (re-)indexing routines, as well as the ordinary model-lifecycle-event indexing triggers.  You’ll be able to do simple indexing configuration in the model “properties registration”, or more complex stuff in standalone indexer objects.

You will of course easily be able to turn off indexing entirely if you aren’t using blacklight/solr, or, still using our AR-lifecycle-hooks, replace the indexer code with something entirely custom, say, for a non-solr indexing back-end.  You’ll probably turn it off simply by setting the indexer class to nil.

Indexing should be configurable to happen asynchronously or synchronously. (Synchronous is required you need it to be immediately reflected on the editor’s next page view;  which is one reason we’re trying to keep solr out of our back-end staff interfaces, because async makes things so much easier to do performantly). Ideally it should also be set up in a ‘batched’ way so multiple solr doc changes that happen on one save can be sent to solr in one request, but we may not achieve that in initial release, although we’ll keep in mind ways we might use traject APIs to achieve it.

To the extent we use Solr dynamic fields, we’ll try to use the ones already defined in default Solr schema. It will also be trivial to simply specify your custom solr field names. As much as possible, we’ll avoid need for any custom solr schemas.

Preservation-related features

Our current sufia-based app has only basic what we could consider preservation-related features, so the baseline minimum for an MVP 1.0 of the toolkit is likewise basic.

Mainly “fixity audit“. It is easy and documented to have shrine store checksum signatures (or with S3 direct upload and/or client-side calculation of checksums!). For checksums that can be calculated in a streaming fashion, we can even use shrine’s streaming API to make validating signatures much more efficient. We will support storing and checking a handful of checksums. The toolkit will support logging these checks and some UI visualization of them, based on the work I did to fix the feature i activefedora, and some of our local features. And tasks for bulk fixity checking, which can possibly be done an order of magnitude faster in hyrax. Where it makes sense, some work can be off-loaded to S3.

As far as backup copiesour own current (Sufia/fedora 4-in-postgres) system is pretty much limited to bog-standard postgresql backups, and ordinary file system/S3 backups of digital bytestream assets. The initial toolkit release will not support anything more here.

Down the road

There are some features that will not be included in the initial MVP toolkit release (or our initial local app based on it), but for which we made architectural choices to try to facilitate down the line.

While fedora has some support for versioning, with some UI for it in hyrax, I think it’s got some oddities, and is relatively little used in the samvera community (We don’t really use it at all). So it’s not targetted as an initial requirement. At a later point, we could use the standard Rails paper_trail gem for actual metadata versioning (not sure if fedora-sufia/hyrax supported that or not).  By using standard ActiveRecord, we can use paper_trail, which has many many developer-hours put into it. It’s still not without some rough edges, which shows how challenging this feature is. Particularly around associations (which aren’t always handled great in fedora/hyrax versioning either, but the way the data is modelled in fedora/af/hyrax makes some things easier here, while other things harder).  One reason we chose schemaless attr_json-based modelling is to limit the number of associations/relationships involved. For actual bytestream/asset versioning, it seems likely what is built into S3 is sufficient.

It would be great to have import/export based on (at least theoretically) implementation-independent formats, which can be used as a preservation copy. Eventually I would like to see round-trippable (export, and then import in our same software to restore) BagIt serialization.  Which could be used for preservation backup. This is pretty straightforward using a ‘just write ruby’ approach, but has some challenging parts around associations among other things. (Note theoretically implementation-independent doesn’t mean any existing alternate implementation necessarily exists that can import them, which is practically true for most of our community’s current attempts here; but it still has preservation value to try to make it easier to write such an alternate implementation). 

Other things not included (at least initially)

Let’s emphasize again, there will be no built-in workflow support. The goal is instead to provide an architecture that doesn’t get in your way when you want to write your own workflow implementation.

There will also be no “notification”/”activity feed” support in initial toolkit.

OAI-PMH is something our local app needs. It may or may not be included in the initial toolkit though, vs being a purely local implementation. Theoretically blacklight_oai_provider could be used (by a local implementation), but it may make some assumptions about the way you use your BL app that the typical toolkit app is unlikely to meet. The lower-level ruby-oai gem is also a possibility.

Embargoes/leases are probably not in the initial implementation, simply because our own app does not use them. If the toolkit comes to support them, I think they should be based on expressed boundary dates automatically enforced at read-time, rather than requiring a process to check leases/embargoes and set other access control information accordingly.

Our local app uses a custom locally-written JS pan-and-zoom paging viewer, instead of the popular UniversalViewer. The initial toolkit may not have built in support for either, but should have clear “developer stories” about how you’d add them.

At one point I had considered trying to use cells as a basis for the front-end, as I really like it’s architecture and think it would provide for easier customizability of initially shared resources — but I ultimately decided it was too new/unproven/new-to-me, and kind of violated our “when in doubt stick to rails” approach, and would raise the complexity (and difficulty of success) of the toolkit too much.

Similarly, I considered using a fancy modern Javascript view system for at least some parts of the UI, but decided against it on similar grounds. There’s just too much to figure out about best practice patterns in this context, at least from my experience, it would raise the difficulty of success too much.

In general, I don’t have a good handle on how we’re going to use the modern JS we will need in a Rails environment. Not sure if using the new Rails webpacker is the way to go, it might be, but I don’t have a good handle on it. The initial release may have a less than optimal javascript architecture.  

Analysis and Evaluation

If we are right that there’s a value proposition in a smaller, Rails-aligned, shared codebase (so we can make it really solid), and if we successfully figure out the right design of such a developer’s toolkit and pull off it’s implementation… then the proposition is that we’ll have a platform/toolkit for developing digital collections/repository applications very efficiently, throughout the entire application lifetime from your initial product launch, including operational infrastructure, and through continued maintenance and enhancement to meet evolving needs.

And further, while there are other approaches in our community that are in progress trying to reach this same goal, as discussed in the last post,  the proposition is that we can get to a mature, polished, efficient-developer-cost toolkit quicker starting over along these lines, compared to those other approaches.

But that’s a lot of ifs, and this is a potentially expensive project proposal. It’s hard to estimate, and more work needs to be done, but at this point I’d estimate 12-18 months to us having a 1.0 release of toolkit and an in-production app based on it.

How can one evaluate the chances of success? This post and the previous one tries to provide enough information and an argument that it is plausible. But ultimately, there’s no way around having experienced developers/engineers make a judgement call — like with all technical decisions, although this is a particularly weighty one. I could say that I personally have some demonstrated experience making developer tools which have proven to be low TCO over a long time period (bento_search, traject), but this is a much more ambitious plan, and it’s success is not guaranteed.  It’s maybe a bit of a “moon shot”, but when you want to go to the moon, sometimes it’s worth a gamble.

I think there’s no way for you, dear reader, to evaluate whether this is a worthwhile thing to possibly participate in, except the judgement of experienced developers/engineers. If your organization is making choices of technical platforms without basing them on understanding of local business/user needs, followed by significant decision-making weight given to the technical judgements of experienced engineers…. I would say you aren’t maximizing your chances of success in engaging in software development, and (I would argue) you are probably already doing software developoment if you are doing samvera-based apps.

There isn’t much to say about the “upside” other than that — these whole two articles have been an investigation of potential upside — but we can say a lot more about the risks and downsides. They are somewhat different for my own institution, starting with a sufia 7 app, and considering initiating this plan — than they might be for another institution considering coming on later in another context. And there are short-, medium-, and long-term categories of risk. I will try to deliniate some of those risks and costs.

Our Institutional Calculus

We have a sufia 7.4 app (on Rails 5.0, with sufia not supporting more recent Rails), and have to do something.  That effects our cost/benefit/risk calculation.

We could try to upgrade/migrate to Hyrax 2.2.0. This would definitely take less time than writing this new toolkit, but I think could still easily take several months. At the end of it, we’re still on the fedora/hyrax architecture that we’ve found so difficult to work with, so while we have more supported dependencies, we haven’t necessarily reduced our TCO or agility to produce new features. We could be hoping on hyrax eventually being valkyrie — this is more likely if we contribute development effort to get there. How much is hard to predict, as is, in my opinion, how much we’re going to like where we get when we get there.  And architectural work on hyrax to resolve some of the other challenges we’ve had with it goes beyond just getting it on valkyrie.

We could try rewriting our app based on valkyrie, plum->figgy style. This actually isn’t too different than the proposal here, we’re still rewriting an app from scratch. Using valkyrie instead of just active_record (and the attr_json gem which is already fairly polished) — it’s not clear to me this will make the development any easier. On the one hand, possibly we’d be able to share more code than just valkyrie with other valkyrie-using institutions — but it’s not entirely clear at this point how much or how mature/polished we can make it. On the other hand, developing on fairly young valkyrie instead of mature well-understood ActiveRecord will, I think, create additional cost, and, in my judgement, additional ongoing cost making the architecture more expensive to work with. If we’re not actually all that excited about ending up on valkyrie (having no desire to be able to switch to fedora), it’s not clear what the benefit here would be.

One surprising rarely-mentioned possibility: We could put in work to get sufia working on Rails 5.2 or upcoming 6.0, and do an unexpected additional sufia release. It’s not entirely clear how much work this would take, but it may be the cheapest option (still a couple several months?) — at the end, though, while we’ve solved our problem of being on a maintained Rails version, we’re still stuck with relatively unsupported/unmaintained software, using a stack we’re not happy with, with such complexity that it will probably continue to “degrade”.  This is sort of the decision to do the minimal amount of immediate work possible, avoid making a decision, but just push it down the road further.

If this proposed development toolkit plan works, we’ll be in a great spot. But it requires some significant time when we are spending development time on things that do not help our current production app, to get to a point where increased efficiency lets us catch up and ‘pay for’ the time we spent. And it involves some non-trivial risk.

I think for my local institution, if we want to, and believe we have the resources/context to, take the risks to be innovators and “thought leaders” here, it’s worth taking the bet on trying to develop this new architecture. If we discover it’s not working, we try something else — you can only get the greatest benefit by taking some risk. But if we want to play it safe and don’t think we can afford (politically, budgetarily) taking a risk, it may not make sense.

Short-term Risks: Failure to launch

It’s possible we simply could fail to finish this project. It could become apparent at some point in the development process that we will not achieve our goals, or that it’s going to take much longer than we hoped (like foreseeing ending up barely towards it after a year of effort).

This could be exacerbated if institutional needs require us to reduce the amount of time we spend on building out the architecture, or minimize the amount of time we can spend investing in something whose returns may be a year or more away.

It could possibly be addressed by trying to reduce the scope of the re-usable toolkit to just focus on getting our app launched (which one could say is the approach Princeton and PSU are taking), but there’s still some risk that as we try, we find it’s just not going to work.

Medium-Term Risks: Failure to Catch On

It makes sense to try to build a re-usable toolkit, rather than just our app, both as a service to the community, and for enlightened-self-interest reasons, to get (eventually) more developers/institutions working on it, and building a community of mutual-support around it.

We could get to the “finish line” of a deployed app, but we could find that other institutions/developers do not actually find it easy to learn or work with. It may not be as flexible as we aspire to for more generalized use cases, may not be applicable to as many other institutions’ business requirements, limiting the potential community.  As we’re intentionally prioritizing high-quality modular tools over features, our tools need to be successful at lowering cost of building software with them, or the project is not a success.

ActiveRecord may end up not being a good foundation for our needs — in proposing to use Single-Table Inheritance, we’re already using a feature that is sometimes considered a bit off the beaten path of Rails. ActiveRecord could end up getting in our way more than anticipated, and thus increasing development costs.  We think we can make a toolkit which a fairly beginner developer can use, but we may fail and end up with one that is confusing to beginners.

We could succesfully deploy our app based on the toolkit, but if we end up being the only institution using the toolkit, the cost/benefit proposition changes somewhat.

Succession Planning

Related to size of community is succession planning — if all your developers leave, and you need to bring new developers on to an existing project, how hard will that be?

There has been an assumption that by “doing what everyone else is doing”, you have a community of people who could step in. However, I think experience has shown this hope was very significantly inflated.

Taking over a ‘legacy’ project is always tough in software dev, always takes a somewhat experienced developer, and it’s a real issue to pay attention to in any software development effort.

The number of experienced-with-samvera developers who can easily take on a legacy project from another institution is fairly small, and most are centered in a few well-paying institutions. Last time we posted a developer position, we got no applicants with previous samvera experience at all, and almost no applicants with previous ruby or Rails experience.  Having to bring up a new developer with ruby, Rails, and samvera is not a low barrier. Both Esmé Cowles (“there are many more Rails developers than Samvera developers“) and Steven Anderson formerly of BPL (Have you tried showing someone else all that is involved in doing a good Hydra Head? …If either Eben or I left, it would take half a year for someone to become even “competent” in our stack.”spoke to the challenges of getting a developer previously familiar with Rails up to speed on existing samvera.

Regardless of whether a community were to develop around this toolkit, I actually feel pretty confident that succession planning is no worse and probably better for this approach than for all the other approaches being investigated discussed in the last post. This is one risk I think is actually not very high.

But one thing you get with samvera is also a community of people supporting your learning. Can we remain part of the samvera community of mutual-support doing a somewhat different thing? I hope so, and I think we’re all about to find out one way or another, because the community is trying different approaches with or without this one. But it’s not guaranteed.

Long-Term Risks: Failure to Support

Our proposition is that by creating simpler/smaller surface area toolkit, we can design it well enough that it can remain very stable and backwards compatible API. While I have some success there with bento_search and traject, we could fail, in a way we only realize years down the road.

One of the ways we propose to keep our toolkit simple is by relying on existing software. If (eg) shrine were to stop being maintained by it’s existing developer (or have releases with significant backward incompats), we’d be in some trouble, at the best significantly increasing cost of maintenance of the toolkit. Same with some of our other dependencies. While attr_json tries to use public and likely to remain stable Rails api, if Rails changes sufficiently to make it hard to keep attr_json working, that could also significantly increase development effort.  All of these things might only be discovered down the line.

On the other hand, the efficiencies of the toolkit could be enough to allow more of our institutional or community development time to be contributed back to general Rails dependencies, to help them stay maintained. The initial plan for instance involves contributing back a shrine plugin.

These long-term risks are to some extent common to any development project, and other approaches probably have similar levels of this kind of risk.

Snatching success from the jaws of failure?

Even if the project does not reach the level of success desired, in any of the ways outlined above, it might still provide ideas, patterns, approaches, that could influence other approaches in samvera community. In the best case, it could produce re-usable modular components that themselves could even be re-used in other approaches (valkyrie-based and/or hyrax? I’m not certain how plausible that is).

This can apply to “sharing with ourselves” too, if we decide to change approach in the middle or at any point, we may be able to re-use some of what we’ve already done anyway. (I think the shrine-based approach is a particular contender here; shrine itself doesn’t even require or depend on Rails).

Because we aim to produce reusable modular composable components with loose coupling, based on the commonality of Rails, it may increase the likelyhood of some code-sharing. On the other hand, if other approaches aren’t using ActiveRecord, both they and us may find ourselves more coupled to our persistence layer API than we’d like — it can be hard to avoid persistence-layer-approach coupling.

Interested? What’s next?

While the final portion of this post was investigating risks and possible disaster scenarios, I actually do feel positive about this approach. While there is no guarantee of success, I think it has the best chance of getting us to a place of minimized engineering costs (compared to alternatives) within a 1-2 year timeframe.

But we have not yet committed to this approach here at my institution. We will be aiming to decide our directions in the next month or so.

Interest from other institutions could effect our decision-making, and of course collaboration would be welcome if we do embark on this plan and we’re interested in identifying potential collaborators as soon as they reveal themselves.

Initially, collaboration probably woudln’t take the form of actually committing code. As Fred Brooks argues in The Mythical Man-Month adding more developers to a project doesn’t always result in shorter time-lines, and this is more true the more architectural design work is involved. And we’ve got a bunch right now. Initially collaboration would probably take the form of expressing interest, reviewing documentation or progress, maybe code review, and most importantly trying out the code as it is produced. Feedback on if it does look like something that would help you with your use cases, and that you’re still interested in using. But certainly eventually code collaboration opportunities as well, especially filling out certain use cases that you have but we may not.

So let us know if you find this plan exciting?

Additionally, if anyone has any ideas about grant opportunities, I guess it goes without saying that that could be useful.  Theoretically grant funding would be especially useful for relatively high-risk but high-potential-reward projects, they are the ones most likely to not be done without external support. The low-risk stuff, you’re going to do anyway!  I’m not sure granters in our sectors think that way, but be sure to let me know if you know of some who might.

I also really welcome anyone challenging or pushing back on anything in here, please feel free, in comments here, slack, email, whatever. From discussion and debate we spiral to a higher understanding.

Advertisements

2 thoughts on “Proposed Rails-based digital collections developer’s toolkit”

  1. Thanks for the feedback/discussion Esmé!

    I think it’s correct to pay attention to the potential maintenance burdern/risk of attr_json. But I think the fact that I already have several interested non-LAM Rails developers (at least one who’s already using it in production) is a good sign that it has _not_ been custom-fitted to our specific use cases, it is a general-purpose persistence solution similar to valkyrie. (Do check out attr_json before forming an opinion, if you can!)

    In that sense, it’s like Valkryie in that it’s a new thing (both at modelling/persistence layers), that will require ongoing maintenance/development to be sustainable. With Valkyrie you’ve already demonstrated multiple institutions presently interested in doing development on/with it. attr_json’s _long term_ sustainability will probably require the same. As with any software, there is no guarantee that present interest will predict future commitments though. Just part of the risk of doing software dev I think.

    I expect my proposed approach will have growing pains — just as I expect valkyrie to continue to. ANY new approach will. A variety of institutions have decided that the risks are worth several different new approaches. I’ve tried to explain in these posts why _I_ think this particular approach is preferable, and arguably lower cost overall, compared to other “new approaches”, but different technical teams will reasonably make different determinations there.

    I totally agree that at this point, our community needs experimentation with different directions. Thanks for saying that. I think the implications of that are that if you are an institution that is not prepared to do professional software development and have experienced engineers evaluating technical options…. things are going to be hard for you. One way or another, I think institutions realizing that doing open source like samvera requires professional software development teams is the necessary outcome — regardless of which directions those teams go in.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s