very rough benchmarking of Solr update batching performance characteristics

In figuring out how I want to integrate a synchronized Solr index into my Rails application, I am doing some very rough profiling/benchmarking of batching Solr adds vs not, just to get a general sense of it.

(This is all _very rough estimates_ and may depend a lot on your environment and Solr setup, including how many records you have in Solr, if Solr is being simultaneously used for queries, etc).

One thing some Solr (or ElasticSearch) integration packages sometimes end up concentrating on is batching multiple index-change-needed events into fewer Solr update requests.

Based on my observations, I think it’s not actually the separate HTTP requests that are expensive. (although I’m benchmarking with a solr on localhost).

But the commits are — if you are doing them. In my benchmarks reindexing a whole bunch of things, if I’m not doing any commits, whether I batch into fewer HTTP update requests to Solr or not has no appreciable effect on speed.

But sending a softCommit per record/update makes it around 2.5x slower.

Sending a (hard) commit per record makes it around 4x slower.

Even without explicit commit directives, if you have your solr setup to autocommit (soft or hard), it may of course occasionally pause to do some commits, so your measured time may depend on if you hit one of those.

So if you don’t care about realtime/near-realtime, you may not have to care about batching. I had already gotten the sense from Solr’s documentation that Solr will really like it better if the client never sends commits, but just lets Solr’s autoCommit/autoSoftCommit/commitWithin configuration to make sure updates become visible within a certain amount of maximum time. The reason to have the client send commits is generally because you need to guarantee that the updates will be visible to queries as soon as your code doing the update is finished.

The reason so many end up caring about batching updates might not because individual http requests to solr are a problem, but because too many _commits_ are. So if for some reason it was more convenient, only sending a commit per X records might be just as good as actually batching http requests — if you have to send commits from the client at all.

Advertisements

Our progress on new digital collections app, and introducing kithe

In September, I wrote a post on a “Proposed Rails-based digital collections developer’s toolkit”

What has happened since then?

Yes we decided to go ahead with a rewrite of our digital collections app, with the new app not based on Hyrax or Valkryie, but a persistence layer based on ActiveRecord (making use of postgres-specific features were appropriate), and exposing ActiveRecord models to the app as a whole.

No, we are not going forward with trying to make that entire toolkit”, with all the components mentioned there.

But Yes, unlike Alberta, we are taking some functionality and putting it in a gem that can be shared between institutions and applications. That gem is kithe. It includes some sharable modeling/persistence code, like Valkyrie (but with a very different approach than Valkyrie), but also includes some additional fundamental components too.

Scaling back the ambition—and abstraction—a bit

The total architecture outlined in my original post was starting to feel overwhelming to me. After all, we also need to actually produce and launch an app for ourselves, on a “reasonable” timeline, with fairly high chance of success.  I left my conversation with U Alberta (which was quite useful, thank you to the Alberta team!), concerned about potential over-reach and over-abstraction. Abstraction always has a cost and building shared components is harder and more time-consuming than building a custom app.

But, then, also informed by my discussion with Alberta,  I realized we basically just had to build a Rails app, and this is something I knew how to do, and we could, as we progressed, jetison anything that didn’t seem actually beneficial for that goal or seem feasible at the moment. And, also after discussion with a supportive local team, my anxiety about the project went down quite a bit — we can do this.

Even when writing the original proposal, I knew that some elements might be traps. Building a generalized ACL permissions system in an rdbms-based web app… many have tried, many have fallen. :)  Generalized controllers are hard, because they are a piece very tightly tied to your particular app’s UI flows, which will vary.

So we’ve scaled back from trying to provide a toolkit which can also be “scaffolding” for a complete starter app.  The goals of the original thought-experiment proposal — a toolkit which provides  pieces developers put together when building their own app — are better approached, for now, by scaling back and providing fewer shared tools, which we can make really solid.

After all, building shared code is always harder than building code for your app. You have more use cases to figure out and meet, and crucially, shared code is harder to change because it’s (potentially) got cross-institutional dependents, which you have to not break. For the code I am putting into kithe, I’m trying to make it solidly constructed and well-polished. In purely local code,  I’m more willing to do something experimental and hacky — it’s easy enough (comparatively!) to change local app code later.  As with all software, get something out there that works, iterating, using what you learn. (It’s just that this is a lot harder to do with shared dependencies without pain!)

So, on October 1st, we decided to embark on this project. We’re willing to show you our fairly informal sketch of a work plan, if you’d like to look.

Introducing kithe

But we’re not just building a local app, we are also trying to create some shareable components. While the costs and risks of shared code and abstractions are real,  I ultimately decided that “just Rails” would not get us to the most maintainable code after all. (And of course nothing is really just Rails, you are always writing code and using non-Rails dependencies; it’s a matter of degree, how much your app seems like a “typical” Rails app to developers).

It’s just too hard to model the data we ourselves already needed (including nested/compound/repeated models) in “just” ActiveRecord, especially in a way that lets you work with it sanely as “just” ActiveRecord, and is still performant. (So we use attr_json, which I also developed, for a No-SQLy approach without giving up rdbms or ActiveRecord benefits including real foreign-key-based associations). And in another example, ActiveStorage was not flexible/powerful enough for our file-handling needs (which are of course at the core of our domain!), and I wasn’t enthused about CarrierWave either — it makes sense to me to make some solid high-quality components/abstractions for some of our fundamental business/domain concerns, while being aware of the risks/costs.

So I’ve put into kithe the components I thought seemed appropriate on several considerations:

  • Most valuable to our local development effort
  • Handling the “trickiest” problems, most useful to share
  • Handling common problems, most likely to be shareable; and it’s hard to build a suite of things that work together without some modelling/persistence assumptions, so got to start there.
  • I had enough understanding of the use-cases (local and community) that I thought I could, if I took a reasonable amount of extra time, produce something well-polished, with a good developer experience, and a relatively stable API.

That already includes, in maybe not 1.0-production-ready but used in our own in-progress app and released (well-tested and well-documented) in kithe:

  • A modeling and persistence layer tightly coupled to ActiveRecord, with some postgres-specific features, and recommending use of attr_json, for convenient “NoSQL”-like modelling of your unique business data (in common with existing samvera and valkyrie solutions, you don’t need to build out a normalized rdbms schema for your data). With models that are samvera/PCDM-ish (also like other community solutions).
    • Including pretty slick handling of “representatives”, dealing with the performance issues in figuring out representative to display with constant query time (using some pg-specific SQL to look up and set “leaf” representative on save).
    • Including UUIDs as actual DB pk/fks, but also a friendlier_id feature for shorter public URL identifiers, with logic to automatically create such if you wish.
  • A nice helper for building Rails forms with repeatable complex embedded values. Compare to the relevant parts of hydra-editor, but (I think) lighter and more flexible.
  • A flexible file-handling architecture based on shrine — meaning transparent cloud-storage support out of the box.
    • Along with a new derivatives architecture, which seems to me to have the right level of abstraction and affordances to provide a “polished” experience.
    • All file-handling support based on assuming expensive things happen in the background, and “direct upload” from browser pre-form-submit (possibly to cloud storage)

It will eventually include some solr/blacklight support, including a traject-based indexing setup, and I would like to develop an intervention in blacklight so after solr results are returned, it immediately fetches the “hit” records from ActiveRecord (with specified eager-loading), so you can write your view code in terms of your actual AR models, and not need to duplicate data to solr and logic for dealing with it. This latter is taken from the design of sunspot.

But before we get there, we’re going to spend a little bit of time on purely local features, including export/import routines (to get our data into the new app; with some solid testing/auditing to be confident we have), and some locally bespoke workflow support (I think workflow is something that works best just writing the Rails). 

We do have an application deployed as demo/staging, with a basic more-than-just-MVP-but-not-done-yet back-end management interface (note: it does not use Solr/Blacklight at all which I consider a feature), but not yet any non-logged-in end-user search front-end. If you’d like a guest login to see it, just ask.

Technical Evaluation So Far

We’ve decided to tie our code to Rails and ActiveRecord. Unlike Valkyrie, which provides a data-mapper/repository pattern abstraction, kithe expects the dependent code to use ActiveRecord APIs (along with some standard models and modelling enhancements kithe gives you).

This means, unlike Valkyrie, our solution is not “persistence-layer agnostic”. Our app, and any potential kithe apps, are tied to Rails/ActiveRecord, and can’t use fedora or other persistence mechanisms. We didn’t have much need/interest in that, we’re happy tying our application logic and storage to ActiveRecord/postgres, and perhaps later focusing on regularly exporting our data to be stored for preservation purposes in another format, perhaps in OCFL.

It’s worth noting that the data-mapper/repository pattern itself, along the lines valkyrie uses, is favored by some people for reasons other than persistence-swapability. In the Rails and ruby web community at large, there is a contingent that think the data-mapper/repository pattern is better than what Rails gives you, and gives you better architecture for maintainable code. Many of this contingent is big on hanami, and the dry-rb suite.  (I have never been fully persuaded by this contingent).

And to be sure, in building out our approach over the last 4 months, I sometimes ran right into the architectural issues with Rails “model-based” architecture and some of what it encourages like dreaded callbacks.  But often these were hypothetical problems, “What if someone wanted to do X,” rather than something I actually needed/wanted to do now. Take a breath, return to agility and “build our app”.

And a Rails/ActiveRecord-focused approach has huge advantages too. ActiveRecord associations and eager-loading support are very mature and powerful tools, that when exposed to the app as an API give you very mature, time-tested tools to build your app flexibly and performantly (at least for the architectures our community are used to, where avoiding n+1 queries still sometimes seems like an unsolved problem!).  You have a whole Rails ecosystem to rely on, which kithe-dependent apps can just use, making whatever choices they want (use reform or not?) as with most any Rails app, without having to work out as many novel approaches or APIs. (To be sure, kithe still provides some constraints and choices and novelty — it’s a question of degree).

Trying to build up an alternative based on data-mapper/repository, whether in hanami or valkyrie, I think you have a lot of work to do to be competitive with Rails mature solutions, sometimes reproducing features already in ActiveRecord or it’s ecosystem. And it’s not just work that’s “time implementing”, it’s work figuring out the right APIs and patterns. Hanami, for instance, is probably still not as mature, as Rails, or as easy to use for a newcomer.

By not having to spend time re-inventing things that Rails already has solutions for, I could spend time on our actual (digital collections) domain-specific components that I wasn’t happy with existing solutions for. Like spending time on creating shareable file handling and derivatives solutions that seem to me to be well-polished, and able to be used for flexible use-cases without feeling like you’re fighting the system or being surprised by it. Components that hopefuly can be re-used by other apps too.

I think schneem’s thoughts on “polish” are crucial reading when thinking about the true costs of shared abstractions in our community.  There is a cost to additional abstractions: in initial implementation, ongoing maintenance, developer on-boarding, and just figuring out the right architectures and APIs to provide that polish. Sometimes these costs are worthwhile in delivered benefits, of course.

I’d consider our kithe-based approach to be somewhere in between U Alberta’s approach and valkryie, in the dimension of “how close do we stick to and tie our line to ‘standard’ Rails”.

Unlike Hyrax, we are building our own app, not trying to use a shared app or “solution bundle” like Hyrax. I would suggest we share that aspect with both the U Alberta approach as well as the several institutions building valkyrie-not-hyrax apps. But if you’ve had good experiences with the over-time maintenance costs of Hyrax, you have a use case/context where Hyrax has worked well for you — then that’s great, and there’s never anything wrong with doing what has worked for you.

Overall, 4 months in, while some things have taken longer to implement than I expected, and some unexpected design challenges have been encountered — I’m still happy with the approach we are taking.

If you are considering a based-on-valkyrie-no-hyrax approach, I think you might be in a good position to consider a kithe approach too.

How do we evaluate success?

Locally,

We want to have a replacement app launched in about a year.

I think we’re basically on target, although we might not hit it on the nose, I feel confident at this point that we’re going to succeed with a solid app, in around that timeline. (knock on wood).

When we were considering alternate approaches before committing to this one, we of course tried to compare how long this would take to various other approaches. This is very hard to predict, because you are trying to compare multiple hypotheticals, but we had to make some ballpark guesses (others may have other estimates).

Is this more or less time than it would have taken to migrate our sufia app to current hyrax? I think it’s probably taking more time to do it this new way, but I think migrating our sufia app to current hyrax (with all it’s custom functionality for current features) would not have been easy or quick — and we weren’t sure current hyrax was a place we wanted to end up.

Is it going to take more or less time than it would have taken to write an app on valkyrie, including any work we might contribute to valkyrie for features we needed? It’s always hard to guess these things, but I’d guess in the same ballpark, although I’m optimistic the “kithe” approach can lead to developer time-savings in the long-run.

(Of course, we hope if someone else wants to follow our path, they can re-use what’s now worked out in kithe to go quicker).

We want it to be an app whose long-term maintenance and continued development costs are good

In our sufia-based app, we found it could be difficult and time-consuming to add some of the features we needed. We also spent a lot of time trying to performance-tune to acceptable levels (and we weren’t alone), or figure out and work towards a manageable and cost-efficient cloud deployment architecture.

I am absolutely confident that our “kithe” approach will give us something with a lower TCO (“total cost of ownership”) than we had with sufia.

Will it be a lower TCO than if we were on the present hyrax (ignoring how to get there), with our custom features we needed? I think so, and that current hyrax isn’t different enough from sufia we are used to — but again this is necessarily a guess, and others may disagree. In the end, technical staff just has to make their best predictions based on experience (individual and community).  Hyrax probably will continue to improve under @no-reply’s steady leadership, but I think we have to make our decisions on what’s there now, and that potential rosey future also requires continued contribution by the community (like us) if it is to come to fruition, which is real time to be included in TCO too.   I’m still feeling good about the “write our own app” approach vs “solution bundle”.

Will we get a lower TCO than if we had a non-hyrax valkyrie-based app? Even harder to say. Valkryie has more abstractions and layers that have real ongoing maintenance costs (that someone has to do), but there’s an argument that those layers will lower your TCO over the long-term. I’m not totally persuaded by that argument myself, and when in doubt am inclined to choose the less-new-abstraction path, but it’s hard to predict the future.

One thing worth noting is the main thing that forced our hand in doing something with our existing sufia-based app is that it was stuck on an old version of Rails that will soon be out-of-support, and we thought it would have been time-consuming to update, one way or another.  (When Rails 6.0 is released, probably in the next few months, Rails maintenance policy says nothing before 5.2 will be supported.) Encouragingly, both kithe and attr_json dependency (also by me), are testing green on Rails 6.0 beta releases — and, I was gratified to see, didn’t take any code changes to do so, they just passed.  (Valkyrie 1.x requires Rails 5.1, but a soon-to-be-released 2.0 is planned to work fine up to Rails 6; latest hyrax requires Rails 5.1 as well, but the hyrax team would like to add 5.2 and 6 soon).

We want easier on-boarding of new devs for succession planning

All developers will leave eventually (which is one reason I think if you are doing any local development, a one-developer team is a bad idea — you are guaranteeing that at some point 100% of your dev team will leave at once).

We want it to be easier to on-board new developers. We share U Alberta’s goal that what we could call a “typical Rails developer” should be able to come on and maintain and enhance the app.

Are we there? Well, while our local app is relatively simple rails code (albeit using kithe API’s), the implementation of  kithe and attr_json, which a dev may have to delve into, can get a bit funky, and didn’t turn out quite as simple as I would have liked.

But when I get a bit nervous about this, I reassure myself remembering that:

  • a) Our existing sufia-based app is definitely high-barrier for new devs (an experience not unique to us), I think we can definitely beat that.
    • Also worth pointing out that when we last posted a position, we got no qualified applicants with samvera, or even Rails, experience. We did make a great hire though, someone who knew back-end web dev and knew how to learn new tools; it’s that kind of person that we ideally need our codebase to be accessible to, and the sufia-based one was not.
  • b) Recruiting and on-boarding new devs is always a challenge for any small dev shop, especially if your salaries are not seen as competitive.  It’s just part of the risk and challenge you accept when doing local development as a small shop on any platform. (Whether that is the right choice is out of scope for this post!)

I think our code is going to end up more accessible to actually-existing newly onboarded devs  than a customized hyrax-based solution would be. More than Valkyrie? I do think so myself, I think we have fewer layers of “specialty” stuff than valkyrie, but it’s certainly hard to be sure, and everyone must judge for themselves.

I do think any competent Rails consultancy (without previous LAM/samvera expertise) could be hired to deal with our kithe-based app no problem; I can’t really say if that would be true of a Valkyrie-based app (it might be); I do not personally have confidence it would be true of a hyrax-based app at this point, but others may have other opinions (or experience?).

Evaluating success with the community?

Ideally, we’d of course love it if some other institutions eventually developed with the kithe toolkit, with the potential for sharing future maintenance of it.

Even if that doesn’t happen, I don’t think we’re in a terrible place. It’s worth noting that there has been some non-LAM-community Rails dev interest in attr_json, and occasional PRs; I wouldn’t say it’s in a confidently sustainable place if I left, but I also think it’s code someone else could step into and figure out. It’s just not that many lines of code, it’s well-tested and well-documented, and and i’ve tried to be careful with it’s design — but take a look at and decide for yourself!. I can not emphasize enough my belief that if you are doing local development at all (and I think any samvera-based app has always been such), you should have local technical experts doing evaluation before committing to a platform — hyrax, valkyrie, kithe, entirely homegrown, whatever.

Even if no-one else develops with kithe itself, we’d consider it a success if some of the ideas from kithe influence the larger samvera and digital collections/repository communities. You are welcome to copy-paste-modify code that looks useful (It’s MIT licensed, have at it!). And even just take API ideas or architectural concepts from our efforts, if they seem useful.

We do take seriously participating in and giving back to the larger community, and think trying a different approach, so we and others can see how it goes, is part of that. Along with taking the extra time to do it in public and write things up, like this. And we also want to maintain our mutually-beneficial ties to samvera and LAM technologist communities; even if we are using different architectures, we still have lots of use-cases and opportunities for sharing both knowledge and code in common.

Take a look?

If you are considering development of a non-Hyrax valkyrie-based app, and have the development team to support that — I believe you have the development team to support a kithe-based approach too.

I would be quite happy if anyone took a look, and happy to hear feedback and have conversations, regardless of whether you end up using the actual kithe code or not. Kithe is not 1.0, but there’s definitely enough there to check it out and get a sense of what developing with it might be like, and whether it seems technically sound to you. And I’ve taken some time to write some good “guide” overview docs, both for potential “onboarding” of future devs here, and to share with you all.

We have a staging server for our in-development app based on kithe; if you’d like a guest login so you can check it out, just ask and I can share one with you.

Our local app also should also probably be pretty easy for you to get installed (with dependencies) from a git checkout, and just run it and see how it goes. See: https://github.com/sciencehistory/scihist_digicoll/

Hope to hear from you!

On code-craft, and writing code for other programmers to use

The New Yorker this week has a profile of Google programmer pair Jeff Dean and Sanjay Ghemawat — if the annoying phrase “super star programmer” applies to anyone it’s probably these guys, who among other things conceived and wrote the original Google Map Reduce implementation–  that includes some comments I find unusually insightful about some aspects of the craft of writing code. I was going to say “for a popular press piece”, but really even programmers talking to each other don’t talk about this sort of thing much. I recommend the article, but was especially struck by this passage:

At M.I.T., [Sanjay’s] graduate adviser was Barbara Liskov, an influential computer scientist who studied, among other things, the management of complex code bases. In her view, the best code is like a good piece of writing. It needs a carefully realized structure; every word should do work. Programming this way requires empathy with readers. It also means seeing code not just as a means to an end but as an artifact in itself. “The thing I think he is best at is designing systems,” Craig Silverstein said. “If you’re just looking at a file of code Sanjay wrote, it’s beautiful in the way that a well-proportioned sculpture is beautiful.”

…“Some people,” Silverstein said, “their code’s too loose. One screen of code has very little information on it. You’re always scrolling back and forth to figure out what’s going on.” Others write code that’s too dense: “You look at it, you’re, like, ‘Ugh. I’m not looking forward to reading this.’ Sanjay has somehow split the middle. You look at his code and you’re, like, ‘O.K., I can figure this out,’ and, still, you get a lot on a single page.” Silverstein continued, “Whenever I want to add new functionality to Sanjay’s code, it seems like the hooks are already there. I feel like Salieri. I understand the greatness. I don’t understand how it’s done.”

I aspire to write code like this, it’s a large part of what motivates me and challenges me.

I think it’s something that (at least for most of us, I don’t know about Dean and Ghemawat), can only be approached and achieved with practice — meaning both time and intention. But I think many of the environments that most working programmers work in are not conducive to this practice, and in some cases are actively hostile to it.  I’m not sure what to think or do about that.

It is most important when designing code for re-use, when designing libraries to be used in many contexts and by many people.  If you are only writing code for a particular business “seeing code not just as a means to an end but as an artifact in itself” may not be what’s called for.  It really is a means to an end of the business purposes. Spending too much time on “the artifact itself”, I think, has a lot of overlap with what is often derisively called “bike-shedding”.  But when creating an artifact that is intended to be used by lots of other programmers in lots of other contexts to build things to meet their business purposes — say, a Rails… or a samvera — “empathy with readers” (which is very well-said, and very related to:) and creating an artifact where “it seems like the hooks are already there” are pretty much indispensable to creating something successful at increasing the efficiency and success of those developers using the code.

It’s also not easy even if it is your intention, but without the intention, it’s highly unlikely to happen by accident. In my experience TDD can (in some contexts) actually be helpful to accomplishing it — but only if you have the intention, if you start from developer use-cases, and if you do the “refactor” step of “red-green-refactor”.  Just “getting the tests to pass” isn’t gonna do it. (And from the profile, I suspect Dean and Ghemawat may not write tests at all — TDD is neither necessary nor sufficient).  That empathy part is probably necessary — understanding what other programmers are going to want to do with your code, how they are going to come to it, and putting yourself in their place, so you can write code that anticipates their needs.

I’m not sure what to do with any of this, but I was struck by the well-written description of what motivates me in one aspect of my programming work.

Ruby Magic helps sponsor Rubyland News

I have been running the Rubyland.news aggregator for two years now, as just a hobby spare time thing. Because I wanted a ruby blog and news aggregator, and wasn’t happy with what was out there then,  and thought it would be good for the community to have it.

I am not planning or trying to make money from it, but it does have some modest monthly infrastructure fees that I like getting covered. So I’m happy to report that Ruby Magic has agreed to sponsor Rubyland.news for a modest $20/month for six months.

Ruby Magic is an email list you can sign up for for occasional emails about ruby. They also have an RSS feed, so I’ve been able to include them on Rubyland.news for some time.  I find their articles to often be useful introductions or refreshers to particular topics about ruby language fundamentals. (It tends not to be about Rails, I know some people appreciate some non-Rails-focused sources of ruby info).  Personally, I’ve been using ruby for years, and the way I got as comfortable with it as I am is by always asking “wait, how does that work then?” about things I run into, always being curious about what’s going on and what the alternatives are and what tools are available, starting with the ruby language itself and it’s stdlib.

These days, blogging, on a platform with an RSS feed too, seems to have become a somewhat rarer thing, so I’m also grateful that Ruby Magic articles are available through RSS feed, so I can include then in rubyland.news. And of course for the modest sponsorship of Rubyland.news, helping to pay infrastructure costs to keep the lights on.  As always, I value full transparency in any sponsorship of rubyland.news; I don’t intend it to affect any editorial policies (I was including Ruby Magic feed already); but I will continue to be fully transparent about any sponsorship arrangements and values, so you can judge for yourself (a modest $20/month from Ruby Magic; no commitment beyond a listing on About page, and this particular post you are reading now, which is effectively a sponsored post).

I also just realized I am two years into Rubyland.news. I don’t keep usage analytics (was too lazy to set it up, and not entirely clear how to do that in case where people might be consuming it as an RSS feed itself), although it’s got 156 followers on it’s twitter feed (all aggregated content is also syndicated to twitter, which I thought was a neat feature).  I’m honestly not sure how useful it is to anyone other than me, or what people changes people might want; feedback is welcome!

Some notes on what’s going on in ActiveStorage

I work in a library-archives-museum digital collections and preservation. This is of course a domain that is very file-centric (or “bytestream”-centric, as some might say). Keeping track of originals and their metadata (including digests/checksums), making lots of derivative files (or “variants” and/or “previews” as ActiveStorage calls them; of images, audio, video, or anything else)

So, building apps in this domain in Rails, I need to do a lot of things with files/bytestreams, ideally without having to re-invent wheels of basic bytestream management in rails, or write lots of boilerplate code. So I’m really interested in file attachment libraries for Rails. How they work, how to use them performantly and reliably without race conditions, how to use them flexibly to be able to write simple code to meet our business and user requirements.  I recently did a bit of a “deep dive” into some aspects of shrine;  now, I turn my attention to ActiveStorage.

The ActiveStorage guide (or in edge from master) is a great and necessary place to start (and you should read it before this; I love the Rails Guides), but there were some questions I had it didn’t answer. Here are some notes on just some things of interest to me related to the internals of ActiveStorage.

ActiveStorage is a-changing

One thing to note is that ActiveStorage has some pretty substantial changes in between the latest 5.2.1 release and master. Sadly there’s no way I could find to use github compare UI (which i love) limited just to the activestorage path in the rails repo.

If you check out Rails source, you can do: ​git diff v5.2.0...master activestorage. Not sure how intelligible you can make that output. You can also look at merged PR’s to Rails mentioning “activestorage” to try and see what’s been going on, some PR’s are more significant than others.

I’m mostly looking at 5.2.1, since that’s the one I’d be using were I use it (until Rails 6 comes out, I forget if we know when we might expect that?), although when I realize that things have changed, I make note of it.

The DB Schema

ActiveStorage requires no changes to the table/model of a thing that should have attached files. Instead, the attached files are implemented as ActiveRecord has_many (or the rare has_one in case of has_one_attached) associations to other table(s), using ordinary relational modeling designs.  Most of the fancy modelling/persistence/access features and APIs (esp in 5.2.1) are seem to be just sugar on top of ordinary AR associations (very useful sugar, don’t get me wrong).

ActiveStorage adds two tables/models.

The first we’ll look at is ActiveStorage::Blob, which actually represents a single uploaded file/bytestream/blob. Don’t be confused by “blob”, the bytestream itself is not in the db, rather there’s enough info to find it in whatever actual storage service you’ve configured. (local disk, S3, etc. Incidentally, the storage service configuration is app-wide, there’s no obvious way to use two different storage services in your app, for different categories of file).

The table backing ActiveStorage::Blob has a number of columns for holding information about the bytesteam.

  • id (ordinary Rails default pk type)
  • key: basically functions as a UID to uniquely identify the bytestream, and find it in the storage. Storages may translate this to actual paths or storage-specific keys differently, the Disk storage files in directories by key prefix, whereas the S3 service just uses the key without any prefixes.
    • The key is generated with standard Rails “secure token” functionality–pretty much just a good random 24 char token. 
    • There doesn’t appear to be any way to customize the path on storage to be more semantic, it’s just the random filing based on the random UID-ish key.
  • filename: the original filename of the file on the way in
  • content_type: an analyzed MIME/IANA content type
  • byte_size: what it says on the tin
  • metadata: a Json serialized hash of arbitrary additional metadata extracted on ingest by ActiveStorage. Default AS migrations just put this in a text column and use db-agnostic Rails functions to serialize/deserialize Json, they don’t try to use a json or jsonb column type.
  • created_at: the usual. There is no updated_at column, perhaps because these are normally expected to be immutable (which means not expected to add metadata after point of creation either?).

OK, so that table has got pretty much everything needed. So what’s the ActiveStorage::Attachment model?  Pretty much just a standard join table.  Using a standard Rails polymorphic association so it can associate an ActiveStorage::Blob with any arbitrary model of any class.  The purpose for this “extra” join table is presumably simply to allow you to associate one ActiveStorage::Blob with multiple domain objects. I guess there are some use cases for that, although it makes the schema somewhat more complicated, and the ActiveStorage inline comments warn you that “you’ll need to do your own garbage collecting” if you do that (A Blob won’t be deleted (in db or in storage) when you delete it’s referencing model(s), so you’ve got to, with your own code, make sure Blob’s don’t hang around not referenced by any models unless in cases you want them to).

These extra tables do mean there are two associations to cross to get from a record to it’s attached file(s).  So if you are, say, displaying a list of N records with their thumbnails, you do have an n+1 problem (or a 2n+1 problem if you will :) ). The Active Storage guide doesn’t mention this — it probably should — but AS some of the inline AS comment docs do, and scopes AS creates for you to help do eager loading.

Indeed a dynamically generated with_attached_avatar (or whatever your attachment is called) scope is nothing but a standard ActiveRecord includes  reaching across the join to the blog. (for has_many_attached or has_one_attached).

And indeed if I try it out in my console, the inclusion scope results in three db queries, in the usual way you expect ActiveRecord eager loading to work.

irb(main):019:0> FileSet.with_attached_avatar.all
  FileSet Load (0.5ms)  SELECT  "file_sets".* FROM "file_sets" LIMIT $1  [["LIMIT", 11]]
  ActiveStorage::Attachment Load (0.8ms)  SELECT "active_storage_attachments".* FROM "active_storage_attachments" WHERE "active_storage_attachments"."record_type" = $1 AND "active_storage_attachments"."name" = $2 AND "active_storage_attachments"."record_id" IN ($3, $4)  [["record_type", "FileSet"], ["name", "avatar"], ["record_id", 19], ["record_id", 20]]
  ActiveStorage::Blob Load (0.5ms)  SELECT "active_storage_blobs".* FROM "active_storage_blobs" WHERE "active_storage_blobs"."id" IN ($1, $2)  [["id", 7], ["id", 8]]
=> #<ActiveRecord::Relation [#<FileSet id: 19, title: nil, asset_data: nil, created_at: "2018-09-27 18:27:06", updated_at: "2018-09-27 18:27:06", asset_derivatives_data: nil, standard_data: nil>, #<FileSet id: 20, title: nil, asset_data: nil, created_at: "2018-09-27 18:29:00", updated_at: "2018-09-27 18:29:08", asset_derivatives_data: nil, standard_data: nil>]>

When is file created in storage, when are associated models created?

ActiveStorage expects your ordinary use case will be attaching files uploaded through a form user.avatar.attach(params[:avatar]), where params[:avatar] is a meaning you get the file as a ActionDispatch::Http::UploadedFile. You can also attach a file directly, in which case you are required to supply the filename (and optionally a content-type):  user.avatar.attach(io: File.open("whatever"), filename: "whatever.png").  Or you can also pass an existing ActiveStorage::Blob to ‘attach’.

In all of these case, ActiveStorage normalizes them to the same code path fairly quickly.

In Rails 5.2.1, if you call attach on an already persisted record, immediately (before any save), an ActiveStorage::Blob row and ActiveStorage::Attachment row have been persisted to the db, and the file has been written to your configured storage location.  There’s no need to call save on your original record, the update took place immediately. Your record will report it has (and of course ActiveStorage’s schema means no changes had to be saved for the row for your record itself — and your record does not think it has outstanding changes via changed?, since it does not).

If you call attach on a new (not yet persisted) record, the ActiveStorage::Blob row is _still_ created, and the bytestream is still persisted to your storage service. But an ActiveStorage::Attachment (join object) has not yet been created.  It will be when you save the record.

But if you just abandon the record without saving it… you have an ActiveStorage::Blob nothing is pointing to, along with the persisted bytestream in your storage service. I guess you’d have to periodically look for these and clean then up….

But master branch in Rails tries to improve this situation with a fairly sophisticated implementation of storing deltas prior to save. I’m not entirely sure if that applies to the “already persisted record” case too. In general, I don’t have a good grasp of how AS expects your record lifecycles to effect persistence of Blobs — like if the record you were attaching it to failed validation, is the Blob expected to be there anyway? Or how are you expected to have validation on the uploaded file itself (like only certain content types allowed, say). I believe the PR in Rails master is trying to improve all of that, I don’t have a thorough grasp of how successful it is at making things “just work” how you might expect, without leaving “orphaned” db rows or storage service files.

Metadata

Content-type

ActiveStorage stores the IANA Media Type (aka “MIME type” or “content type”) in the dedicated content_type column in ActiveStorage::Blob. It uses the marcel gem (from the basecamp team) to determine content type.  Marcel looks like it uses file-style magic bytes, but also uses the user-agent-supplied filename suffix or content-type when it decides it’s necessary — trusting the user-agent supplied content-type if all else fails.  It does not look like there is any way to customize this process;  likely most people wouldn’t need that, but I may be one of the few that maybe does. Compare to shrine’s ultra-flexible content-type-determination configuration.

For reasons I’m not certain of, ActiveStorage uses marcel to identify content-type twice.

When (in Rails 2.5.1) you call ​some_model.attach, it calls ActiveStorage::Blob#create_after_upload!, which calls ActiveStorage::Blob#build_after_upload, which calls ActiveStorage::Blob.upload, which sets the content_type attribute to the result of extract_content_type method, which calls marcel.

Additionally, ActiveStorage::Attachment (the join table) has an after_create_commit hook which calls :identify_blob, which calls blob.identify, defined in ActiveStorage::Blob::Identifiable mixin, which also ends up using marcel — only if it already hasn’t been identified (recorded by an identified key in the json serialized metadata column).   This second one only passes the first 4k of the file to marcel (perhaps because it may need to download it from remote storage), while the first one above seems to pass in the entire IO stream.

Normally this second marcel identify won’t be called at all, because the Blob model is already recorded as identified? as a result of the first one. In either case, the operations takes place in the foreground inline (not a bg job), although one of them in an after-commit hook with a second save. (Ah wait, I bet the second one is related to the direct upload feature which I haven’t dived into. Some inline comment docs would still be nice!)

In Rails master, we get an identify:false argument to attach, which can be used to skip which you can use to skip content-type-identification (it might just use the user-agent-supplied content-type, if any, in that case?)

Arbitrary Metadata

In addition to some file metadata that lives in dedicated database columns in the blob table, like content_type, recall that there is a metadata column with a serialized JSON hash, that can hold arbitrary metadata. If you upload an image, you’ll ordinarily find height and width values in there, for instance.  Which you can find eg with ‘model..avatar.metadata[“width”]’ or  ‘model.avatar.metadata[:width]’ (indifferent access, no shortcuts like ‘model.avatar.width’ though, so far as I know).

Where does this come from? It turns out ActiveStorage actually has a nice, abstract, content-type-specific, system for analyzer plugins.  It’s got a built-in one for images, which extracts height and width with MiniMagick, and one for videos, which uses ffprobe command line, part of ffmpeg.

So while this blog post suggests monkey-patching Analyzer::ImageAnalyzer to add in GPS metadata extracted from EXIF, in fact it oughta be possible in 5.2.1+ to use the analyzer plugin to add, remove, or replace analyzers to do your customization, no ugly forwards-compat-dangerous monkey-patching required.  So there are intentional API hooks here for customizing metadata extraction, pretty much however you like.

Unlike content-type-identification which is done inline on attach, metadata analysis is done by ActiveStorage in a background ActiveJob. ActiveStorage::Attachment (the join object, not the blog), has an after_create_commit hook (reminding us that ActiveStorage never expects you to re-use a Blob db model with an altered bytestream/file), which calls blob.analyze_later (unless it’s already been analyzed).   analyze_later simply launches a perform_later ActiveStorage::AnalyzeJob with the (in this case) ActiveStorage::Blob as an argument.  Which just calls analyze on the blob.

So it, at least in theory, this can accommodate fairly slow extraction, because it’s in the background. That does mean you could have an attachment which has not yet been analyzed; you can check to see if analyzation has happened yet with analyzed? — which in the end is just an analyzed: true key in the arbitrary json metadata hash. (Good reminder that ActiveRecord::Store exists, a convenience for making cover methods for keys in a serialized json hash).

This design does assume only one bg job per model that could touch the serialized json metadata column exists at a time — if there were two operating concurrency (even with different keys), there’d be a race condition where one of the sets of changes might get lost as both processes race to 1) load from db, 2) merge in changes to hash, 3) save serialization of merged to db.  So actually, as long as “identified: true” is recorded in content-type-extraction, the identification step probably couldn’t be a bg job either, without taking care of the race condition, which is tricky.

I suppose if you changed your analyzer(s) and needed to re-analyze everything, you could do something like ActiveStorage::Blob.find_each(&:analyze!). analyze! is implemented in terms of update!, so should persist it’s changes to db with no separate need to call save.

Variants

ActiveStorage calls “variants” what I would call “derivatives” or shrine (currently) calls “versions” — basically thumbnails, resizes, and other transformations of the original attachment.

ActiveStorage has a very clever way of handling these that doesn’t require any additional tracking in the db.  Arbitrary variants are created “on demand”, and a unique storage location is derived based on the transformation asked for.

If you call avatar.variant(resize: "100x100"), what’s returned is an ActiveStorage::Variant.  No new file has yet been created if this is the first time you asked for that. The transformation will be done when you call the processed method. (ActiveStorage recommends or expects for most use cases that this will be done in controller action meant to deliver that specific variant, so basically on-demand).   processed will first see if the variant file has already been created, by checking processed?. Which just checks if a file already exists in the storage with some key specific to the variant. The key specific to the variant is  “variants/#{blob.key}/#{Digest::SHA256.hexdigest(variation.key)}“. Gives it some prefixes/directory nesting, but ultimately makes a SHA256 digest of variation.key.  Which you can see the code in ActiveStorage::Variation, and follow it through ActiveStorage.verifier, which is just an instance of ActiveSupport::MessageVerifier — in the end we’re basically just taking a signed (and maybe encyrpted) digest of the serialization of the transformation arguments passed in in the first place,  `{ resize: “100×100” }`.

That is, basically through a couple of cryptographic digests and some crypto security too, were just taking the transformation arguments and turning them into a unique-to-those-arguments key (file path).

This has been refactored a bit in master vs 5.2.1 — and in master the hash that specifies the transformations, to be turned into a key, becomes anything supported by image_processing with either MiniMagick or vips processors instead of 5.2.1’s bespoke Minimagick-only wrapper. (And I do love me some vips, can be so much more performant for very large files).  But I think the basic semantics are fundamentally the same.

This is nice because we don’t need another database table/model to keep track of variants (don’t forget we already have two!) — we don’t in fact need to keep track of variants at all. When one is asked for, ActiveStorage can just check to see if it already exists in storage at the only key/path it necessarily would be at.

On the other hand, there’s no way to enumerate what variants we’ve already created, but maybe that’s not really something people generally need.

But also, as far as I can find there is no API to delete variants. What if we just created 100×100 thumbs for every product photo in our app, but we just realized that’s way too small (what is this, 2002?) and we really need something that’s 630×630. We can change our code and it will blithely create all those new 630×630 ones on demand. But what about all the 100x100s already created? They are there in our storage service (say S3).  Whatever ways there might be to find the old variants and delete them are going to be hacky, not to mention painful (it’s making a SHA256 digest to create filename, which is intentionally irreversible. If you want to know what transformation a given variant in storage represents, the only way is to try a guess and see if it matches, there’s no way to reverse it from just the key/path in storage).

Which seems like a common use case that’s going to come up to me? I wonder if I’m missing something. It almost makes me think you are intended to keep variants in a storage configured as a cache which deletes old files periodically (the variants system will just create them on demand if asked for again of course) — except the variants are stored in the same storage service as your originals, and you certainly don’t want to purge non-recently-used originals!  I’m not quite sure what people are doing with purging no-longer-used variants in the real world, or why it hasn’t come up if it hasn’t.

And something that maybe plenty of people don’t need, but I do — ability to create variants of files that aren’t images: PDFs, any sort of video or audio file, really any kind of file at all. There is a separate transformation system called previewing that can be used to create transformations of video and PDF out of the box — specifically to create thumbnails/poster images.  There is a plugin architecture, so I can maybe provide “previews” for new formats (like MS Word), or maybe I want to improve/customize the poster-image selection algorithm.

What I need aren’t actually “previews”, and I might need several of them. Maybe I have a video that was uploaded as an AVI, and I need to have variants as both mp4 and webm, and maybe choose to transcode to a different codec or even adjust lossy compression levels. Maybe I can still use ‘preview’ function nonetheless? Why is “preview” a different API than “variant” anyway? While it has a different name, maybe it actually does pretty much the same thing, but with previewer plugins? I don’t totally grasp what’s going on with previews, and am running out of steam.

I really gotta get down into the weeds with files in my app(s), in an ideal world, I would want to be able to express variants as blocks of whatever code I wanted calling out to whatever libraries I wanted, as long as the block returned an IO-like object, not just hashes of transformation-specifications. I guess one needs something that can be transformed into a unique key/path though. I guess one could imagine an implementation had blocks registered with unique keys (say, “webm”), and generated key/paths based on those unique keys.  I don’t think this is possible in ActiveStorage at the moment.

Will I use ActiveStorage? Shrine?

I suspect the intended developer-user of ActiveStorage is someone in a domain/business/app for which images and attachments  are kind of ancillary. Sure, we need some user avatars, maybe even some product images, or shared screenshots in our basecamp-like app. But we don’t care too much about the details, as long as it mostly works.  Janko of Shrine told me some users thought it was already an imposition to have to add a migration to add a data column to any model they wanted to attach to, when ActiveStorage has a generic migration for a couple generic tables and you’re done (nevermind that this means extra joins on every query whose results you’ll have to deal with attachments on!) — this sort of backs up that idea of the native of the large ActiveStorage target market.

On the other hand, I’m working in a domain where file management is the business/domain. I really want to have lots of control over all of it.

I’m not sure ActiveStorage gives it to me. Could I customize the key/paths to be a little bit more human readable and reverse-engineerable, say having the key begin with the id of the database model? (Which is useful for digital preservation and recovery purposes).Maybe? With some monkey-patches? Probably not?

Will ActiveStorage do what I need as far as no-boundaries flexibility to variant creation of video/audio/arbitrary file types?  Possibly with custom “previewer” plugin (even though a downsampled webm of an original .avi is really not a “preview”), if I’m willing to make all transformations expressable as a hash of specifications?  Without monkey-patching ActiveStorage? Not sure?

What if I have some really slow metadata generation, that I really don’t want to do inline/foreground?  I guess I could not use the built-in metadata extraction, but just make my own json column on some model somewhere (that has_one_attachment), and do it myself. Maybe I could do that variants too, with additional app-specific models for variants (that each have a has_one_attached with the variant I created).  I’d have to be careful to avoid adding too many more tables/joins for common use cases.

If I only had, say, paperclip and carrierwave, I might choose ActiveStorage anyway, cause they aren’t so flexible either. But, hey, shrine! So flexible! It still doesn’t do everything I need, and the way it currently handles variants/derivatives/versions isn’t suitable for me (not set up to support on-demand generation without race conditions, which I realize ironically ActiveStorage is) — but I think I’d rather build it on top of shrine, which is intended to let you build things on top of it, than ActiveStorage, where I’d likely have to monkey-patch and risk forwards-incompatible.

On the other hand, if ActiveStorage is “good enough” for many people… is there a risk that shrine won’t end up with enough user/maintainer community to stay sustainable? Sure, there’s some risk. And relatively small risk of ActiveStorage going away.  One colleague suggested to me that “history shows” once something is baked into Rails, it leads to a “slow death of most competitors”, and eventually more features in the baked-into Rails version. Maybe, but…. as it happens, I kind of need to architect a file attachment solution for my app(s) now.

As with all dependency and architectural choices, you pays yer money and you takes yer chances. It’s programming. At best, we hope we can keep things clearly delineated enough architecturally, that if we ever had to change file attachment support solutions, it won’t be too hard to change.  I’m probably going with shrine for now.

One thing that I found useful looking at ActiveStorage is some, apparently, “good enough” baselines for certain performance/architectural issues. For instance, I was trying to figure out a way to keep my likely bespoke derivatives/variants solution from requiring any additional tables/joins/preloads (as shrine out of the box now requires zero extra) — but if ActiveStorage requires two joins/preloads to avoid n+1, I guess it’s probably okay if I add one. Likewise, I wasn’t sure if it was okay to have a web architecture where every attachment image view is going to result in a redirect… but if that’s ActiveStorage’s solution, it’s probably good enough.

Notes on deep diving with byebug

When using byebug to investigate some code, as I did here, and regularly do to figure out a complex codebase (including but not limited to parts of Rails), a couple Rails-related tips.

If there are ActiveJobs involved, ‘config.active_job.queue_adapter = :inline’ is a good idea to make them easier to ‘byebug’.

If there are after_commit hooks involved (as there were here), turning off Rails transactional tests (aka “transactional fixtures” before Rails 5) is a good idea. Theoretically Rails treats after_commit more consistently now even with transactional tests, but I found debugging this one I was not seeing the real stuff until I turned off transactional tests.  In Rspec, you do this with ‘config.use_transactional_fixtures = false’  in the rails_helper.rb rspec config file.

Notes on study of shrine implementation

Developing software that is both simple and very flexible/composable is hard, especially in shared dependencies. Flexiblity and composability often lead to very abstract, hard to understand architecture. An architecture custom-fitted for particular use cases/domains has an easier time of remaining simple with few moving parts. I think this is a fundamental tension in software architecture.

shrine is a “File Attachment toolkit for Ruby applications”, developed with explicit goals of being more flexible than some of what came before. True to form, it’s internal architecture can be a bit confusing.

I want to work with shrine, and develop some new functionality based on it, related to versions/derivatives (hopefully for submission to shrine core), requiring some ‘under the hood’ work. When I want to understand some new complicated architecture (say, some part of Rails), one thing I do is trace through it with a debugger (while going back and forth with documentation and code-reading), and write down notes with a sort of “deep dive” tour through a particular code path. So that’s what I’ve done here, with shrine 2.12.0. It may or may not be useful to anyone else, part of the use for me is in writing it; but when I’ve done this before for other software others have found it useful, so I’ll publish it in case it is (and so I can keep finding it again later to refer to it myself, which I plan to do).

Some architectural overview

shrine uses a plugin system based on module mix-in overrides (basically, inheritance),  which is not my favorite form of extension (many others would agree). Most built-in shrine func is implemented as plugins, to support flexible configuration. This mixin-overridden-methods architecture can lead to some pretty tightly coupled and inter-dependent code, even in ostensibly independent plugins, and I think it has sometimes here.  Still, shrine has succeeded in already being more flexible than anything that’s come before (definitely including ActiveStorage). This is just part of the challenge of this kind of software development, I don’t think anyone else starting over is gonna get to a better overall place, I still think shrine is the best thing to work with at present if you need maximal flexibility in handling your uploaded assets.

Shrine has a design document that explains the different objects involved. I still found it hard to internalize a mental model, even with this document. After playing with shrine for a while, here’s my own current re-stating of some of the primary objects involved in shrine (hopefully my re-statement doesn’t have too many errors!).

An uploader (also called a “shrine” object, as the base class is just Shrine) is a  stateless object that knows how to take an IO stream and persist to some back-end.   You generally write a custom uploader class for your app, because a specific uploader is what has specifics about any validationtransformationmetadata extraction, etc, in ingesting a file. An uploader is totally  stateless though (or rather immutable, it may have some config state set on initialize) — it’s sort of a pipeline for going from an IO object to a persisted file.  When you write a custom uploader, it isn’t hard-coded to a particular persistent back-end, rather a specific storage object is injected into an individual uploader instance at runtime.

A shrine attacher is the object that has state for the file. An attacher knows about the model object the file is attached to (a specific attacher instance is associated with a specific model instance).  An attacher has two uploaders injected into it — one for the temporary cache storage and one for the permanent store storage. These are expected to be the same class of uploader, just with different storages injected.  An attacher has ORM plugins that handle actual persistance to the db, as well as tracking changes, and just everything that needs to be done regarding the state of a particular file attachment.

In a typical model, you can get access to the attacher instance for an asset called avatar from a method called avatar_attacher. The avatar method itself is essentially delegated through the attacher too. The attacher is the thing managing access and mutation of the attached files for the model.  If you ask for avatar_attacher.store or avatar_attacher.cache, you get back an uploader object corresponding to that form of storage — to be used to process and persist files to either of those storages.

How do those methods avatar and avatar_attacher wind up in the model?  A ruby module is mixed in to the model with those methods. Shrine calls this mix-in module an “attachment”. When you do include MyUploader::Attachment.new(:name_of_column) in your model, that’s returning an attachment module and mixing it into your model.  I find “attachment” not the most clear name for this, especially since shrine documentation also calls an individual file/bytestream an “attachment” sometimes, but there it is.

And finally, there’s the simple UploadedFile, which is simply a model object representing an uploaded file! It can let you get various information about the uploaded file, or access it (via stream, downloaded file, or url).  An UploadedFile is more or less immutable. It’s what you get returned to you from the (eg) avatar method itself.  An UploadedFile can be round-trip serialized to json — the json that is persisted in your model _data column. So an UploadedFile is basically the deserialized model representation of what’s in your _data column.

It’s important to remember that shrine uses a two-step file persistence approach. There is a temporary cache storage location that has files that may not pass validation and may not yet have been actually saved to a model (or may never be).  The file can be re-displayed to a user in a validation error when it’s in “cache” for instance. Then when the file is actually succesfully permanently persisted attached to a model, it’s in a different storage location, called the store.

Tracing what happens internally when you attach a file to an ActiveRecord model using shrine

Most of this will be relevant regardless of ActiveRecord, but I focused on an ActiveRecord implementation. My demonstration app used to step through uses a bog-standard Shrine uploader, with no plugins (but :activerecord).

class StandardUploader < Shrine
  plugin :activerecord
end

Just to keep things consistent, we attach to a model on the “standard_data” column, with accessor called “standard”.

  include StandardUploader::Attachment.new(:standard)

What is shrine doing under the hood, what are the different parts, when we assign a new file to the model?  We’ll first do model.standard = File.open("something"), and then model.save.

First model.standard = File.open("something")

The #standard= is provided by the attachment module mix-in, and it calls  asset_attacher.assign(io_object)

If it’s NOT a string, assign first does: `uploaded_file = [attacher.]cache!(value, action: :cache)` (What’s up with ‘not a string’? A string is assumed to be serialized json from a form representing an already existing file. The assign method assumes it’s either an IO object or serialized JSON from a form; there are other methods than `assign` to directly set an UploadedFile or what have you).

The cache! method calls uploaded_file = cache.upload(io)cache points to an instance of our StandardUploader configured to point at the configured ‘cache’ (temporary) storage, so we’re calling upload on an uploader.

[cache uploader]#upload calls processed to run the IO through any uploader-specific processing that is active on the “cache” stage.

Then it calls #store on itself, the uploader assigned as `cache`. “Uploads the file and returns an instance of Shrine::UploadedFile. By default the location of the file is automatically generated by #generate_location, but you can pass in `:location` to upload to a specific location. [ie path, the actual container storage location is fixed though]”  The implementation is via an indirection through #_store, which:

1.  calls get_metadata on itself (an uploader), which for a new IO object calls extract_metadata, which is overridden by custom metadata plugins. So metadata is normally assigned at the cache/assignment phase. This is perhaps so the metadata can be used in validation?  Not sure if there’s a way to make metadata be in the background, and/or be as part of the promotion step (when copying cache to store on save) instead. There’s some examples suggesting they are relevant here, but I don’t really understand them.

2. Calls #put on itself, the uploader. put by default does nothing but call #copy on the uploader, which actually calls #upload on the actual storage object itself (say a Shrine::Storage::FileSystem), to send the file to that storage adapter — in this case for the configured cache storage, since we started from cache on the attacher. (Some plugins may override put to do more than just call copy). 

3. Converts into a shrine UploadedFile object representing the persisted file, and returns that.

So at this point, after calling attacher.cache!, your file has been persisted to the temporary “cache” storage. attacher.cache! purely deals with the stateless uploader and persisting the file; next is making sure that is recorded in your model _data attribute.

[attacher].assign then does ‘[attacher.]set(uploaded_file)’, where uploaded_file is what was returned from the previous cache! call. set first stores the existing value (which could be nil or an an UploadedFile) in the attacher instance variable @old, (in part so it can be deleted from storage on model persistence, since it’s been replaced).  And then calls _set to convert the UploadedFile to a hash, and write it to the _data model attribute — so it’s there ready for persistence if/when the model is saved.

So after assignment (model.standard = File.open("whatever")), the file is persisted in your “cache” storage. The in-memory model has asset_data that points to it. But nothing about that is persisted to your model’s persistence/ORM.  If the model previously had a different file attached, it’s still there in the store storage.

Let’s see how persistence of the new file happens, by tracing the ActiveRecord ORM plugin specifically, when you call model.save.  First note the active_record plugin makes sure shrine’s validations get used by the model, so if they fail, ActiveRecord’s save is normally going to get a validation failure, and not go further. If we made it past there:

In an active_record before_save, it calls attacher.save if and only if the attacher is changed?, meaning has set the @old ivar of previous value (could be nil previous value, but the ivar is set). However, the default/core implementation of save doesn’t actually do anything — this seems mainly here as a place for shrine plugins to hook into actually “before_save”, in an ORM-agnostic way.  (Might have been less confusing to call it before_save, I dunno).  The file is not moved to the permanent storage (and the old file deleted from permanet storage) until after the model has been succesfully persisted.

Then ActiveRecord’s own save has happened — the file data representing the new file persisted in temporary cache has now been persisted to the database.

Then in an active_record after_commit, finalize is called on the attacher. finalize is only called if  @old  is set — so only if the attached file was changed, basically.

The [attacher.]finalize method itself immediately returns if there is no “@old” instance variable set. (So the check with changed? in the hook is actually redundant, even if you call finalize every time, it’ll just return. Unless plugins change this).

Then finalize calls [attacher.]replace. Which — if the @old instance variable is not nil (in which it’s an UploadedFile object), and the object was in the cache storage (it must be in store storage; checked simply by checking the storage_key in the data hash) deletes the old value. “replace” in this case actually means “delete old value” — it doesn’t do anything with the new value, whether the new value is in cache or store. (not to be confused with a different #replace method on UploadedFile, which actually only deals with uploading a new file. These are actually each two halves of what I’d think of as “replacement”, and perhaps would have best had entirely different names — especially cause they both sound similar to the different “swap” method). 

The finalize method removes the @old ivar from the attacher, so the attacher no longer thinks it has an un-persisted change. (would this maybe be safer AFTER the next step?)

finalize calls `_promote(action: :store) if cached?` — that is, if the current UploadedFile exists, and is associated with the cache store.   [attacher.]#_promote just immediately calls promote —  both of these methods can take an uploaded_file argument, but here they are not given one, and default to the current UploadedFile in this attacher, via get

[attacher.]promote does a `stored_file = store!(uploaded_file, **options)`.  Remember the `cache!` method above? `store!` is just the same, but on the uploader configured as `store` storage instead of `cache` storage — except this time we’re passing in an UploadedFile instead of some not-yet-imported io object. Metadata extraction isn’t performed a second time, because, get_metadata has special behavior for UploadedFile input, to just copy existing metadata instead of re-extracting it.

At this point, the file has been copied/moved to the ‘store’ storage — but another copy of the file may still exist in cache​ storage (in some cases where the cache and store storages are compatible, the file really was moved rather than copied though), and no state changes have been made at all to the model, either in-memory or persisted, to point to this new file in permanent storage.

So to deal with both those things, [attacher].promote calls [attacher.]swap, which is commented as “Calls #update, overriden in ORM plugins, and returns true if the attachment was successfully updated.” In fact, the over-ridden attacher.update in the activerecord plugin just calls super, and then saves the AR model with validate:false. (I am not a fan of the thing going around my validations, wonder what that’s about).

Default update(uploaded_file) just calls _set(uploaded_file).

_set pretty much just converts the UploadedFile to it’s serializable json, and then calls write.

write just sets the model attribute to the serializable data (it’s still not persisted, until it gets to the ORM-specific update, where as a last line the model with new data is persisted).

so I think attacher.swap actually just takes the UploadedFile, serializes it to the _data column in the model, and saves/persists the model. Not sure why this is called swap. I think it might be more clear as “update” — oops, but we already have an update, which is by default all that swap calls. I’m not sure the different intent between swap and update, when you should use one vs the other.  (This is maybe one place to intervene to try to use some kind of optimistic or pessimistic locking in some cases)

If swap returns a falsey value (meaning it failed), then promote will go and delete the file persisted to the store storage, to try and keep it from hanging around if  it wasn’t persisted to model.  I don’t totally understand in what cases swap will return a falsey value though. I guess the backgrounding plugin will make it return nil if it thinks the persisted data has changed in db (or the model has been deleted), so a promotion can’t be done.

overview cheatsheet

pseudo-code-ish chart of call stack of interesting methods, not real code

model.avatar=(io)   =>  avatar_attacher.assign(io)

↳ uploaded_file = avatar_attacher.cache!(io)

↳  avatar_attacher.cache.upload(io) => processes including extracting metadata and persists to storage, by calling avatar_attacher.cache.store(io)

↳ io = uploader.processed(io)

↳ io = uploader.store(io) => via uploader._store(io)

↳ get_metadata

↳ uploader.put(io) => actually file persists to storage

returns an UploadedFile

↳ avatar_attacher.set(uploaded_file)

↳ stores previous value in attacher ivar “@old”, puts serialized UploadedFile in-memory avatar_data attribute

model.save

an activerecord before_save triggers avatar_attacher.save iff attacher.changed? (has an @old ivar). Core attacher.save doesn’t do anything, but some plugins hook in.

activerecord does the save, and commit.

an active_record after_commit triggers avatar_attacher.finalize iff attacher.changed?

↳ attacher._promote/promote iff  attacher.changed?

↳ stored_file = avatar_attacher.store!( UploadedFile in-memory )

↳ see above at cache! — extra metadata, does other processing/transformation, persists file to store storage, updates in-memory UploadedFile and serialization.

 ↳ attacher.swap(newly persisted UploadedFile)

↳ attacher.update(newly persisted UploadedFile) => just calls _set(uploaded_file), which properly serializes it to in-memory data, and then in an activerecord plugin override, persists to db with activerecord.

Some notes

On method names/semantics

“Naming” things is often called (half-jokingly half-serious) one of the hardest problems in computer science, and that is truer the more abstract you get. Sometimes there just aren’t enough English words to go around, or words that correctly convey the meaning. In this architecture, I think both the replace methods probably should have been named something else to avoid confusion, as neither one does what I’d think of as a “replace” operation.

In general, if one needs to interact with some of these methods directly (rather than just through the existing plugins), either to develop a new plugin or to call some behavior directly without a plugin being involved — it’s not always clear to me which method to use. When I should use swap vs update , which in the base implementation kind of do the same thing, but which different plugins may change in different ways? I don’t understand the intended semantics, and the names aren’t helping me. (promote is similar, but with an UploadedFile which hasn’t yet been processed/persisted? Swap/update takes an UploadedFile which has already been persisted, for updating in model).

It is worth noting that all of these will both change the referenced attached file on a model and persist the whole model to the db. If you just want to set a new attached file in the in-memory model without persisting, you’d use “attacher.set(uploaded_file)” — which requires an UploadedFile object, not just an IO. Also if you call set multiple times without saving, only the penultimate one is in the @old variable — I’m not sure if that can lead to some persisted files not being properly deleted and being orphaned?

Shrine plugins do their thing by overriding methods in the core shrine — often the methods outlined above. Some particularly central/complicated plugins to look at are backgrounding and versions (although we’re hoping to change/replace “versions”) — they are very few lines of code, but so abstract I found it hard to wrap my head around.  I found that the understanding of what unadorned base shrine does above was necessary  to truly understand what these plugins were doing.

Are there ways to orphan attached files in shrine?  That is, a file still stored in a storage somewhere, but no longer referenced in a model?  For starters the “cache” storage is kind of designed to have orphaned files, and needs to have old files cleaned out periodically, like a “tmp” directory. While there is a plugin designed to try to clean up some files in “cache”, they can’t possibly catch everything — like a file in “cache” that was associated with a model that was never saved at all (perhaps cause of validation error) — so I personally wouldn’t bother with it, just assume you need to sweep cache, like the docs suggest you do.

Are there other ways for files to end up orphaned in shrine, including in the “store” storage? If an exception is raised at just the wrong time?  I’m not sure, but I’d like to investigate more. An orphaned file is gonna be really hard to discover and ever delete, I think.

 

Another round of citation features in a sufia app

I reported before on our implementation of an RIS export feature in our sufia 7.4 app.

Since then, we’ve actually nearly completely changed our implementation. Why? Well, it started with us moving on to our next goal: on-page human-readable citation. This was something our user analysis had determined portions of our audience/users wanted.

Turns out that what seemed “good enough” metadata for an RIS export (meeting or exceeding user expectations; users were used to citation exports not being that great, and having to hand-edit them themselves) seemed not at all good enough when actually placed on the page as a human-readable citation (in Chicago format).

We ended up first converting our internal metadata to citeproc-json format/schema. Then using that intermediate metadata as a source for our RIS export, as well as for conversion to human-readable citation with citeproc-ruby.  The conversion/production happens at display-time, from data in our Solr index, which required us to add some data to the Solr index that wasn’t previously there.

On metadata and citations

Turns out getting the right machine-interprable metadata for a really correct citation is pretty tricky.

It occurs to me that if citations is a serious use case, you should probably consider it when designing your metadata schema in the first place, to make sure you have everything you need in machine-readable/interprable format. (As unrealistic as this suggestion sounds for many actual projects in our sector). Otherwise can find you simply don’t have what you need for a reasonable citation.

We ended up adding a few metadata fields, including a “source” field for items in our digital collection that are excerpts from works (which are not in our collection), and need the container work identified in the citation.

In other cases, an excerpt is an independent work in our repo, but also has a ‘child’ relationship to a parent, that is it’s container for purposes of citation. But in yet other cases, there’s a work with a ‘parent’ work that is for organizational/arrangement purposes only, and is not a container for purposes of citation — but our metadata leaves the software no way to know which is which. (In this case we just treat them all like containers for purposes of citation, and tolerate the occasional not-really-correct-ness, as the “incorrect” citations still unambiguously identify the thing cited).

We also implemented a bunch of heuristics to convert various “just string” fields to parsed metadata. For instance our author (or publisher) names, while from FAST and other library vocabularies, are just in our system as plain single strings. The system doesn’t even record the original authority identifier. (I think this is typical for a sufia/hyrax app, while they use the qa gem to load terms, if the gem supplies identifiers from the original vocabulary, they aren’t recorded).

So, the name `Stayner, Heinrich, -1548` needs to be displayed in some parts of the citation (first author for instance) as Stayner, Heinrich, but in other parts (second author or publisher) as Heinrich Stayner, and in no case includes the dates in the citation, so we gotta try parsing it.  Which is harder than you’d think with all the stuff that can go into an AACR2-style name heading (question marks or the word “approximately”, or sometimes the word “active”, other idiosyncracies).  And then a corporate name like an imaginary design firm Jones, Smith, Garcia is never actually Garcia Jones, Smith or something like that.

Then there’s turning our dates from a custom schema into something that fits what a citation expects.

Our heuristics get good enough — in fact, I think our automatically-generated human readable citations end up as good or better as anything else I’ve seen automatically generated on the web, including from major publishers–but they are definitely far from perfect, and have lots of errors in many edge cases. Hopefully all errors that don’t change or confuse about the thing cited, which of course is the point.

CSL, CSL-json, and ruby-citeproc

CSL, the Citation Style Language, is a system for automatically generating human-readable citations according to XML stylesheets for various citation formats/styles.

While I believe CSL originally came out of zotero, some code has been extracted (and is open source like zotero itself), and the standard itself as an independent standard. Whether via the code or the schema/standard implemented in other and various code open source and not, it has been adopted by other software packages too (like Mendeley, which is not open source).

One part of CSL is a json format (defined with a json schema) to represent an individual “work to be cited”.  This also originally came from Zotero, and doesn’t seem to totally have a universal name yet, or a ton of documentation.  The schema in the repo is called “csl-data.json,” but I’ve also seen this format referred to as just “csl-json”, as well as “citeproc-json” (with or without the hyphens).  It also has even more adoption beyond zotero — it is one of the standard formats that CrossRef (and other DOI resolvers?) can return.  The common IANA/MIME “Content-Type” is `application/vnd.citationstyles.csl+json`, but historically another (incorrect?) form has sometimes been used, `application/citeproc+json`. Some of the names/content type(s) might confuse you into thinking this is a JSON representation of a CSL style (describing a citation format/style like “Chicago” or “MLA”), but it’s not, it’s a format of metadata about a particular “work to be cited”.  I kind of like to call it “csl-data-json” (after the schema URL) to avoid confusion.

Even apart from JSON serialization, this is a useful schema in that it separates out fields one will actually need to generate a citation (including machine-readable individual sub-elements for parts of a name or date).  It’s best available documentation, in addition to the JSON schema itself, seems to be this document written for the original Javascript implementation and not entirely applicable to generic implementations.

There is, amazingly, a ruby CSL processor in the citeproc-ruby gem.  Not only can it take input in csl-json and format it as an individual citation in a desired style, but, as a standard CSL processor, it can also format a complete bibliography and footnotes in the context of a complete document (where some citation styles call for appropriate ibid use in the context of multiple citations, etc).  I was only interested in formatting an individual citation though.

Initially, I wasn’t completely sure the citeproc-ruby gem would work out for me, for performance or other reasons. But I still decided to split processing into two steps: translating our internal metadata into a csl-json compatible format, and then formatting a human readable citation. This two step process just makes sense for manageable code, trying to avoid an unholy mess of nested if-elsifs all jumbled together. And gives you clear separation if you need to generate in multiple human-readable styles, or change your mind about what style(s) to generate. The csl-json schema is great for an intermediate format even if you are going to format as human-readable by non-CSL means, as it’s been road-tested and proven as having the right elements you need to generate a citation.

However, I did end up using citeproc-ruby in the end.  @inkshuk it’s author was amazingly helpful and giving in my questions on the GH issues. Initially it looked like there were some extreme performance problems, but using alternate citeproc-ruby API to avoid re-loading/parsing XML style documents from disk every time (with one PR by me to make this work for locale XML style docs too) avoided those.

Citeproc-ruby can’t yet handle formatting of date ranges in a citation (inkshuk has started on the first steps to an implementation in response to my filed issue).  So when I have a date range in a work-to-be-cited, I just format it myself in my own ruby code, and include it in the csl-data-json as a date “literal”.

CSL is amazing, and using a CSL processor handles all sorts of weird idiosyncratic edge cases for you. (One example, if a title already includes double-quotes, but is to be double-quoted in the citation, it changes the internal double quotes to single quotes for you. There are so many of these, that you’re not going to think of initially yourself in a custom hobbled-together unholy mess of if-elsif statement implementation).

Also, while I didn’t do it, you could hypothetically customize some of the existing styles in CSL XML if you need to for local context needs. I believe citeproc-ruby even gives you a way to override parts of an existing style in ruby code.

The particular and peculiar challenges of sufia/hyrax/samvera

There are two main, er, idiosyncracies of the sufia/hyrax/samvera architecture that provided additional challenges. One: the difficulty of efficiently determining the parent work of a work-in-hand, and (in sufia but not hyrax) the collection(s) that contain a work. Two: The split architecture between Solr index data (used at display-time), and fedora data (used at index time), and the need to write code very differently to get data in each of these sources/times.

Initially, I was worried about citeproc-ruby performance. So started out having our sufia app generate the human-readable citation at index time, and store it as text/html in the Solr index, so at display time it would just have to be retrieved and inserted on the page. Really, even if only takes 10ms to format a citation, wouldn’t it be better to not add 10ms to the page delivery time? (Granted, 10ms may be nothing to many slow sufia/hyrax apps).

However, to generate access to citations in our context, we need access to both the container collection (for archival arrangement/location when an archival item), and the parent work, for “container” for citation purposes. These are very slow to get out of fedora. (Changed/improved for fetching parent collections but not parent works in hyrax; we’re still sufia). Like, with our data and infrastructure, it was taking multiple seconds to get the answer from fedora to “what are the parent work(s) for this item-in-hand” (even trying to use the fedora API feature that seemed suited for this, whose name I now forget).  While one can accommodate more slowness at index-time than display-time, several-seconds-per-item was outside our tolerance — when re-indexing our ~20K item collection already can take many hours on an empty solr index.

So you want to get that info from the Solr index instead of fedora, but trying to access the Solr index in the indexing operation leads you to all sorts of problems when generating an initial index, with whether there’s already enough in the index to answer your question you need to index the item-in-hand. We want our indexing operation to always be usable starting from an empty index, for fault recovery purposes among others.  And even ignoring this issue, I found that the sufia ‘actor stack’ info actually led to the right info not being in the Solr index at the right time for a particular item-in-hand-to-index when changing the parent or collection membership for item(s).

Stopping myself as I got into trying to debug the actor stack yet again, I decided to switch to a pure display-time approach.  Just generate the citation on-demand, from the solr index.  At this point I already had a map-metadata-to-csl-json implementation based on doing it at index-time with info from fedora.  I had actually forgotten when I wrote that that I wasn’t leaving my options open to switch to display-time — so I had to rewrite the thing to retrieve the slightly different info in slightly different ways from the Solr index at display time using a sufia “show presenter”.

Also had to add some things to our Solr index so they could be used at display time — we were including in our solr index only the dates-of-work as strings we wanted to display to user on our pages, but the citation metadata transformer needed all our original structured metadata so it could determine how best to convert them (differently) to dates for inclusion in citation. (I stored our original data objects serialized to json, and then have the presenter “re-hydrate” them to our original ruby model objects without touching fedora).

Premature Abstraction

In our original implementation, I tried to provide a sort of generic “serialize to RIS”  base class, thinking it would make our code more readable, and potentially be of general use.

However, even originally it didn’t end up working quite as well as I’d hoped (needed custom logic more often than using the “built in” automatic mappings in the base class), and in fact this new implementation abandons it entirely. Instead, it first maps to CSL-json schema/format, and then the RIS serializer mostly just extracts the needed fields from there. (We wanted to take advantage of our improved citation data for on-screen human-readable to improve the RIS export too, of course).

No harm, no foul in our local codebase. You learn more about your requirements and you learn more about how particular architectural solutions work out, and you change your mind about implementation decisions and change them. This is a normal thing.

But if I had jumped to, say, add my “RIS Serializer base” abstraction to some shared codebase (say the hyrax gem, or even some kind of samvera-citations gem), it probably would have ended up not as generally useful as I thought at the time (it’s not even a good match for our needs/use case, it turns out!).  And it’s much harder to change your mind about an abstraction in a shared codebase, that many people may be relying upon, and can’t be changed without backwards incompatability problems. (That in a local codebase aren’t nearly as problematic, you just change all your code in your repo and commit it and you’re done, no need to worry about versioning or coordinating the work of various developers using the shared code).

It’s good to remember to be even more cautious with abstractions in shared code in general.  Ideally, abstractions in shared code (ie, a gem) should be based on a good understanding of the domain from some experience, and have been proven in one (or better more) individual app(s) over some amount of time, before being enshrined into a shared codebase. The first abstraction that seems to be working well for you in a particular codebase may not stand the test of time and diverse requirements/use cases, and “the wrong abstraction can be worse than no abstraction at all”—and the wrong abstraction can be very expensive and painful to undo in a gem/shared codebase.

Our implementation

You can see the Pull Request here.  (It’s possible there were some subsequent bug fixes postdating the PR).

We have a class called CitableAttributes, which takes a display-time ‘work show presenter’ (which as above has been customized to have access to some original component models), and formats it into data compatible with csl-data-json (retrievable via individual public accessors), as well as an actual JSON document that is csl-data-json.

Our RISSerializer uses a CitableAttributes object to extract individual metadata fields, and put them in the right place in an RIS document. It also needs it’s own logic for some things that aren’t quite the same in RIS and csl-data-json (different ‘type’ vocabulary, no ability to describe dates ranges machine-readably).  We wanted to take advantage of all the logic we had for transforming the metadata to something applicable to citations, to improve the RIS exports too.

Oh, one more interesting thing. We decided for photographs of “realia” (largely from our Museum‘s collection), it was more appropriate and useful to cite them as photographs (taken by us, dated the date of the photo), rather than try to cite “realia” itself, which most citation styles aren’t really set up to do, and some here thought was inappropriate for these objects as seen in our website anyhow. So we have some custom logic to determine when an item in our collection is such, and cite appropriately using some clever OO polymorphism. This logic now carries over to the RIS export, hooray.

And a simple Rails helper just uses a CitableAttributes to get a csl-data-json, and then feeds it to citeproc-ruby objects to convert to the human-readable Chicago-style citation we want on screen.

There are definitely still a variety of idiosyncratic edge cases it gets not quite right, from weird punctuation to semantics. But I believe it’s still actually one of the best on-screen automatically-generated human-readable citation implementations around!

Some live diverse examples:

attachment filename downloads in non-ascii encodings, ruby, s3

You tell the browser to force a download, and pick a filename for the browser to ‘save as’ with a Content-Disposition header that looks something like this:

Content-Disposition: attachment; filename="filename.tiff"

Depending on the browser, it might open up a ‘Save As’ dialog with that being the default, or might just go ahead and save to your filesystem with that name (Chrome, I think).

If you’re having the user download from S3, you can deliver an S3 pre-signed URL that specifies this header — it can be a different filename than the actual S3 key, and even different for different users, for each pre-signed URL generated.

What if the filename you want is not strictly ascii? You might just stick it in there in UTF-8, and it might work just fine with modern browsers — but I was doing it through the S3 content-disposition download, and it was resulting in S3 delivering an XML error message instead of the file, with the message “Header value cannot be represented using ISO-8859-1.response-content-disposition”.

Indeed, my filename in this case happened to have a Φ (greek phi) in it, and indeed this does not seem to exist as a codepoint in ISO-8859-1 (how do I know? In ruby, try `”Φ”.encode(“ISO-8859-1”)`, which perhaps is the (standard? de facto?) default for HTTP headers, as well as what S3 expects. If it was unicode that could be trans-coded to ISO-8859-1, would S3 have done that for me? Not sure.

But what’s the right way to do this?  Googling/Stack-overlowing around, I got different answers including “There’s no way to do this, HTTP headers have to be ascii (and/or ISO-8859-1)”, “Some modern browsers will be fine if you just deliver UTF-8 and change nothing else” [maybe so, but S3 was not], and a newer form that looks like filename*=UTF-8''#{uri-encoded ut8} [no double quotes allowed, even though they ordinarily are in a content-disposition filename] — but which will break older browsers (maybe just leading to them ignoring the filename rather than actually breaking hard?).

The golden answer appears to be in this stackoverflow answer — you can provide a content-disposition header with both a filename=$ascii_filename (where $filename is ascii or maybe can be ISO-8859-1?), followed by a filename*=UTF-8'' sub-header. And modern browsers will use the UTF-8 one, and older browsers will use the ascii one. At this point, are any of these “older browsers” still relevant? Don’t know, but why not do it right.

Here’s how I do it in ruby, taking input and preparing a) a version that is straight ascii, replacing any non-ascii characters with _, and b) a version that is UTF-8, URI-encoded.

ascii_filename = file_name.encode("US-ASCII", undef: :replace, replace: "_")
utf8_uri_encoded_filename = URI.encode(filename)

something["Content-Disposition"] = "attachment; filename=\"#{ascii_filename}\"; filename*=UTF-8''#{utf8_uri_encoded_filename}"

Seems to work. S3 doesn’t complain. I admit I haven’t actually tested this on an “older browser” (not sure how old one has to go, IE8?), but it does the right thing (include the  “Φ ” in filename) on every modern browser I tested on MacOS, Windows (including IE10 on Windows 7), and Linux.

One year of the rubyland.news aggregator

It’s been a year since I launched rubyland.news, my sort of modern take on a “planet” style aggregator of ruby news and blog RSS/atom feeds.

Is there still a place for an RSS feed aggregator in a social media world? I think I like it, and find it a fun hobby/side project regardless. And I’m a librarian by training and trade, and just feel an inner urge to collect, aggregate, and distribute information, heh. But do other people find it useful? Not sure!  You can (you may or may not have known) follow rubyland.news on twitter instead, and it’s currently got 86 followers, that’s probably a good sign. I don’t currently track analytics on visits to the http rubyland.news page. It’s also possible to follow rubyland.news through it’s own aggregated RSS feed, which would be additionally hard to track.

Do you use it or like it? I’d love for you to let me know.

Thoughts on a year of developing/maintaining rubyland.news

I haven’t actually done too much maintenance, it kind of just keeps on chugging. Which is great.  I had originally planned to add a bunch of features, mainly including an online form to submit suggested feeds to include, and an online admin interface for me to approve and otherwise manage feeds. Never got to it, haven’t really needed it — it would take a lot of work over the no-login-no-admin-screen thing that’s there now, and adding feeds with a rake task has worked out fine. heroku run rake feeds:add[http://some/feed.rss], no problem.  So just keep feeling free to email me if you have a suggestion please. So far, I don’t get too many such suggestions, but I myself keep an eye on /r/reddit and add blogs when I see an interesting post from one of them there. I haven’t yet removed any feeds, but maybe I should; inactivity doesn’t matter too much, but feeds sometimes drift to no longer be so much about ruby.

If I was going to do anything at this point, it’d probably trying to abstract the code a bit so I can use it for other aggregators, with their own names and CSS etc.

It’s kind of fun to have a very simple Rails app for a change. I’m not regretting using Rails here, I know Rails, and it works fine here (no performance problems, I’m just caching everything aggressively with Rails fragment caching, I don’t even bother with a CDN. Unless I set up cloudflare and forgot? I forget. The site only has like 4 pages!). I can do things like my first upgrade of an app to Rails 5.1 in a very simple but real testbed. (It was surprisingly not quite as trivial as I thought even to upgrade this very simple app from rails 5.0 to 5.1. Of course, that ended up not being just Rails 5.1, but doing things like switching to heroku’s supported free-for-hobby-dyno SSL endpoint (the hacky way it was doing it before no longer worked with rails 5.1), and other minor deferred maintenance.  Took a couple hours probably.

It’s fun working with RSS/Atom feeds, I enjoy it. Remember that dream of a “Web 2.0” world that was all about open information sharing through APIs?  We didn’t really get that, we got walled garden social media instead. (More like gated plantations than walled gardens actually, a walled garden sounds kind of nice and peaceful).

But somehow we’ve still got RSS and Atom, and they are still in fairly widespread use. So I get to kind of pretend I’m still in that world. They are in fairly widespread use… but usually as a sort of forgotten unmaintained stepchild.  There are lacks of specification in the specifications that will never be filled in, and we get to deal with it. (Can a ‘title’ be HTML, or must it be plain text?  If it’s HTML, is there any way to know it is? Nope, not really). I run into all kinds of weirdness — can links in a feed be relative urls? If so, they are supposed to be… relative to what? You might think the feed url… but that’s not always how they go. I get to try to work around them all, which is kinda fun. Or sometimes ‘fun’.

I wish people would offer more tagged/subsection feeds, those seem pretty rare still. I wish medium would offer feeds that worked at all, they don’t really — medium has feeds for a person, but they include both posts and comments with no ways to distinguish, and are thus pretty useless for an aggregator. (I don’t want your out of context two-line comments in my aggregator).

I also get to do fun HTTP/REST kind of stuff — one of the reasons I chose to use Rails with a database as a backend, so I can keep state, is so I can actually do conditional GET requests of feeds and only fetch if a feed has changed. Around 66% of the feed URLs actually provide etags or last-modified so I can try. Then every once in a while I see a feed which reports “304 Not Modified” but it’s a lie, there is new content, the server is just broken. I usually just ignore em.

Keeping state also lets me refuse to let a site post-date it’s entries to keep em at the top of the list, and generally lets me keep the aggregated list in a consistent and non-changing order even if people change their dates on their posts. Oh, dealing with dates is another ‘fun’ thing, people deliver dates in all sorts of formats, with and without timezones, with and without times (just dates), I got to try to normalize them all somewhat to keep things in a somewhat expected and persistent newest-on-top order. (in which state is also helpful, because I can know when I last fetched a feed, and what entries are actually new since then, to help me guess a “real” timestamp for screwy or timestamp-missing entries).

Anyway, it’s both fun and “fun”.

Modest Sponsorship from Honeybadger

Rubyland.news is hosted on heroku, cause it’s easy, and even fun, and this is a side project. It’s costs are low (one hobby dyno, a free postgres that I might upgrade to the lowest tier paid one at some point). Costs are low, but there are costs.

Fortunately covered by a modest $20/month sponsorship from Honeybadger. I think it’s important to be open about exactly how much they are paying, so you can decide for yourself if it’s likely influencing rubyland.news’s editorial decisions or whatever, and just everything is transparent. I don’t think it is, I do include honeybadger’s Developer Blog in the aggregator, but I think I’d stop if it started looking spammy.

When they first offered the modest sponsorship, I had no experience with honeybadger. But since then I’ve been using it both for rubyland.news (which has very few approaching zero uncaught exceptions) and a day job project (which has plenty). I’ve liked using it, I definitely recommend checking it out.  Honeybadger definitely keeps developing, adding and refining features, if there’s any justice I think it’ll be as successful in the market as bugsnag.  I think I like it better than bugsnag, although it’s been a while since I used bugsnag now. I think honeybadger pricing tends to be better than bugsnag’s, although it depends on your needs and sizes. They also offer a free “micro” plan for projects that are non-commercial open source, although you gotta email them to ask for it. Check em out!

Performance on a many-membered Sufia/Hyrax show page

We still run Sufia 7.3, haven’t yet upgraded/migrated to hyrax, in our digital repository. (These are digital repository/digital library frameworks, for those who arrived here and are not familiar; you may not find the rest of the very long blog post very interesting. :))

We have a variety of ‘manuscript’/’scanned 2d text’ objects, where each page is a sufia/hyrax “member” of the parent (modeled based on PCDM).  Sufia was  originally designed as a self-deposit institutional repository, and I didn’t quite realize this until recently, but is now known sufia/hyrax to still have a variety of especially performance-related problems with works with many members. But it mostly works out.

The default sufia/hyrax ‘show’ page displays a single list of all members on the show page, with no pagination. This is also where admins often find members to ‘edit’ or do other admin tasks on them.

For our current most-membered work, that’s 473 members, 196 of which are “child works” (each of which is only a single fileset–we use child works for individual “interesting” pages we’d like to describe more fully and have show up in search results independently).  In stock sufia 7.3 on our actual servers, it could take 4-6 seconds to load this page (just to get response from server, not including client-side time).  This is far from optimal (or even ‘acceptable’ in standard Rails-land), but… it works.

While I’m not happy with that performance, it was barely acceptable enough that before getting to worrying about that, our first priority was making the ‘show’ page look better to end-users.  Incorporating a ‘viewer’, launched by clicks on page thumbs, more options in a download menu, , bigger images with an image-forward kind of design, etc. As we were mostly just changing sizes and layouts and adding a few more attributes and conditionals, I didn’t think this would effect performance much compared to the stock.

However, just as we were about to reach a deadline for a ‘soft’ mostly-internal release, we realized the show page times on that most-membered work had deteriorated drastically. To 12 seconds and up for a server response, no longer within the bounds of barely acceptable. (This shows why it’s good to have some performance monitoring on your app, like New Relic or Skylight, so you have a chance to notice performance degradation as a result of code changes as soon as it happens. Although we don’t actually have this at present.)

We thus embarked on a week+ of most of our team working together on performance profiling to figure out what was up and — I’m happy to say — fixing it, perhaps even getting slightly better perf than stock sufia in the end. Some of the things we found definitely apply to stock sufia and hyrax too, others may not, we haven’t spend the time to completely compare and contrast, but I’ll try to comment with my advice.

When I see a major perf degradation like this, my experience tells me it’s usually one thing that’s caused it. But that wasn’t really true in this case, we had to find and fix several issues. Here’s what we found, how we found it, and our local fixes:

N+1 Solr Queries

The N+1 query problem is one of the first and most basic performance problems many Rails devs learn about. Or really, many web devs (or those using SQL or similar stores) generally.

It’s when you are showing a parent and it’s children, and end up doing an individual db fetch for every child, one-per-child. Disastrous performance wise, you need to find a way to do a single db fetch that gets everything you want instead.

So this was our first guess. And indeed we found that stock sufia/hyrax did do n+1 queries to Solr on a ‘show’ page, where n is the number of members/children.

If you were just fetching with ordinary ActiveRecord, the solution to this would be trivial, adding something like .includes(:members) to your ActiveRecord query.  But of course we aren’t, so the solution is a bit more involved, since we have to go through Solr, and actually traverse over at least one ‘join’ object in Solr too, because of how sufia/hyrax stores these things.

Fortunately Princeton University Library already had a local solution of their own, which folks in the always helpful samvera slack channel shared with us, and we implemented locally as well.

I’m not a huge fan of overriding that core member_presenters method, but it works and I can’t think of a better way to solve this.

We went and implemented this without even doing any profiling first, cause it was a low-hanging fruit. And were dismayed to see that while it did improve things measurably, performance was still disastrous.

Solrizer.solr_name turns out to be a performance bottleneck?(!)

I first assumed this was probably still making extra fetches to solr (or even fedora!), that’s my experience/intuition for most likely perf problem. But I couldn’t find any of those.

Okay, now we had to do some actual profiling. I created a test work in my dev instance that had 200 fileset members. Less than our slowest work in production, but should be enough to find some bottlenecks, I hoped. The way I usually start is by a really clumsy and manual deleting parts of my templates to see what things deleted makes things faster. I don’t know if this is really a technique I’d recommend, but it’s my habit.

This allowed me to identify that indeed the biggest perf problem at this time was not in fetching the member-presenters, and indeed was in the rendering of them. But as I deleted parts of the partial for rendering each member, I couldn’t find any part that speeded up things drastically, deleting any part just speeded things up proportional to how much I deleted. Weird. Time for profiling with ruby-prof.

I wrapped the profiling just around the portion of the template I had already identified as problem area. I like the RubyProf::GraphHtmlPrinter report from ruby-prof for this kind of work. (One of these days I’m going to experiment GraphViz or compatible, but haven’t yet).

Surprisingly, the top culprit for taking up time was — Solrizer.solr_name. (We use Solrizer 3.4.1; I don’t believe as of this date newer versions of solrizer or other dependencies would fix this).

It makes sense Solrizer.solr_name is called a lot. It’s called basically every time you ask for any attribute from your Solr “show” presenter. I also saw it being called when generating an internal app link to a show page for a member, perhaps because that requires attributes. Anything you have set up to delegate …, to: :solr_document probably  also ends up calling Solrizer.solr_name in the SolrDocument.

While I think this would be a problem in even stock Sufia/Hyrax, it explains why it could be more of a problem in our customization — we were displaying more attributes and links, something I didn’t expect would be a performance concern; especially attributes for an already-fetched object oughta be quite cheap. Also explains why every part of my problem area seemed to contribute roughly equally to the perf problem, they were all displaying some attribute or link!

It makes sense to abstract the exact name of the Solr field (which is something like ​​title_ssim), but I wouldn’t expect this call to be much more expensive than a hash lookup (which can usually be done thousands of times in 1ms).  Why is it so much slower? I didn’t get that far, instead I hackily patched Solrizer.solr_name to cache based on arguments, so all calls after the first with the same argument would be just a hash lookup. 

I don’t think this would be a great upstream PR, it’s a workaround. Would be better to figure out why Solrizer.solr_name is so slow, but my initial brief forays there didn’t reveal much, and I had to return to our app.

Because while this did speed up my test case by a few hundred ms, my test case was still significantly slower compared to an older branch of our local app with better performance.

Using QuestioningAuthority gem in ways other than intended

We use the gem commonly referred to as “Questioning Authority“, but actually released as a gem called qa for most of our controlled vocabularies, including “rights”.  We wanted to expand the display of “rights” information beyond just a label, we wanted a nice graphic and user-facing shortened label ala rightstatements.org.

It seemed clever some months ago to just add this additional metadata to the licenses.yml file already being used by our qa-controlled vocabulary.  Can you then access it using the existing qa API?  Some reverse-engineering led me to using CurationConcerns::LicenseService.new.authority.find(identifier).

It worked great… except after taking care of Solrizer.solr_name, this was the next biggest timesink in our perf profile. Specifically it seemed to be calling slow YAML.load a lot. Was it reloading the YAML file from disk on every call? It was!  And we were displaying licensing info for every member.

I spent some time investigating the qa gem. Was there a way to add caching and PR it upstream? A way that would be usable in an API that would give me what I wanted here? I couldn’t quite come up with anything without pretty major changes.  The QA gem wasn’t really written for this use case, it is focused pretty laser-like on just providing auto-complete to terms, and I’ve found it difficult in the past to use it for anything else. Even in it’s use case, not caching YAML is a performance mistake, but since it would usually be done only once per request it wouldn’t be disastrous.

I realized, heck, reading from a YAML is not a complicated thing. I’m going to leave it the licenses.yml for DRY of our data, but I’m just going to write my own cover logic to read the YAML in a perf-friendly way. 

That trimmed off a nice additional ~300ms out of 2-3 seconds for my test data, but the code was still significantly slower compared to our earlier branch of local app.

[After I started drafting this post, Tom Johnson filed an issue on QA on the subject.]

Sufia::SufiaHelperBehavior#application_name is also slow

After taking care of that one, the next thing taking up the most time in our perf profile was, surprisingly, Sufia::SufiaHelperBehavior#application_name (I think Hyrax equivalent is here and similar).

We were calling that #application_name helper twice per member… just in a data-confirm attr on a delete link! `Deleting #{file_set} from #{application_name} is permanent. Click OK to delete this from #{application_name}, or Cancel to cancel this operation. ` 

If the original sufia code didn’t have this, or only had application_name once instead of twice, that could explain a perf regression in our local code, if application_name is slow. I’m not sure if it did or not, but this was the biggest bottleneck in our local code at this time either way.

Why is application_name so slow? This is another method I might expect would be fast enough to call thousands of times on a page, in the cost vicinity of a hash lookup. Is I18n.t just slow to begin with, such that you can’t call it 400 times on a page?  I doubt it, but it’s possible. What’s hiding in that super call, that is called on every invocation even if no default is needed?  Not sure.

At this point, several days into our team working on this, I bailed out and said, you know what, we don’t really need to tell them the application name in the delete confirm prompt.

Again, significant speed-up, but still significantly slower than our older faster branch.

Too Many Partials

I was somewhat cheered, several days in, to be into actual generic Rails issues, and not Samvera-stack-specific ones. Because after fixing above, the next most expensive thing identifiable in our perf profile was a Rails ‘lookup_template’ kind of method. (Sorry, I didn’t keep notes or the report on the exact method).

As our HTML for displaying “a member on a show page” got somewhat more complex (with a popup menu for downloads and a popup for admin functions), to keep the code more readable we had extracted parts to other partials. So the main “show a member thumb” type partial was calling out to three other partials. So for 200 members, that meant 600 partial lookups.

Seeing that line in the profile report reminded me, oh yeah, partial lookup is really slow in Rails.  I remembered that from way back, and had sort of assumed they would have fixed it in Rails by now, but nope. In production configuration template compilation is compiled, but every render partial: is still a live slow lookup, that I think even needs to check the disk in it’s partial lookup (touching disk is expensive!).

This would be a great thing to fix in Rails, it inconveniences many people. Perhaps by applying some kind of lookup caching, perhaps similar to what Bootsnap does for $LOAD_PATH and require, but for template lookup paths. Or perhaps by enhancing the template compilation so the exact result of template lookups are compiled in and only need to be done on template compilation.  If either of these were easy to do, someone would probably have done them already (but maybe not).

In any event, the local solution is simple, if a bit painful to code legibility. Remove those extra partials. The main “show a member” partial is invoked with render collection, so only gets looked-up once and is not a problem, but when it calls out to others, it’s one lookup per render every time.  We inlined one of them, and turned two more into helper methods instead of partials. 

At this point, I had my 200-fileset test case performing as well or better as our older-more-performant-branch, and I was convinced we had it!  But we deployed to staging, and it was still significantly slower than our more-performant-branch for our most-membered work. Doh! What was the difference? Ah right, our most-membered work has 200 child works, my test case didn’t have child works.

Okay, new test case (it was kinda painful to figure out how to create a many-hundred-child-work test case in dev, and very slow with what I ended up with). And back to ruby-prof.

N+1 Solr queries again, for representative_presenter

Right before our internal/soft deadline, we had to at least temporarily bail out of using riiif for tiled image viewer and other derivatives too, for performance reasons.  (We ultimately ended up not using riiif, you can read about that too).

In the meantime, we added a feature switch to our app so we could have the riiif-using code in there, but turn it on and off.  So even though we weren’t really using riiif yet (or perf testing with riiif), there was some code in there preparing for riiif, that ended up being relevant to perf for works with child-works.

For riiif, we need to get a file_id to pass to riiif. And we also wanted the image height and width, so we could use lazysizes-aspect ratio so the image would be taking up the proper space on the screen even if waiting for a slow riiif server to deliver it. (lazysizes for lazy image loading, and lazysizes-aspectratio which can be used even without lazy loading — are highly recommended, they work great).

We used polymorphism, for a fileset member, the height, width and original_file_id were available directly on the solr object fetched corresponding to the member. But for a child work, it delegated to representative_presenter to get them. And representative_presenter, of course, triggered a solr fetch. Actually, it seemed to trigger three solr fetches, so you could actually call this a 3n+1 query!

If we were fetching from ActiveRecord, the solution to this would possibly be as simple as adding something like .includes("members", "members.representative") . Although you’d have to deal with some polymorphism there in some ways tricky for AR, so maybe that wouldn’t work out. But anyway, we aren’t.

At first I spent some time thinking through if there was a way to bulk-eager-load these representatives for child works similarly to what you might do with ActiveRecord. It was tricky, because the solr data model is tricky, the polymorphism, and solr doesn’t make “joins” quite as straighforward as SQL does.  But then I figured, wait, use Solr like Solr.   In Solr it’s typical to “de-normalize” your data so the data you want is there when you need it.

I implemented code to index a representative_file_id, representative_width, and representative_height directly on a work in Solr. At first it seemed pretty straightforward.  Then we discovered it was missing some edge cases (a work that has as it’s representative a child work, that has nothing set as it’s representative?), and that there was an important omission — if a work has a child work as a representative, and that child work changes it’s representative (which now applies to the first work), the first work needs to be reindexed to have it. So changes to one work need to trigger a reindex of another. After around 10 more frustrating dev hours, some tricky code (which reduces indexing performance but better than bad end-user performance), some very-slow and obtuse specs, and a very weary brain, okay, got that taken care of too. (this commit may not be the last word, I think we had some more bugfixes after that).

After a bulk reindex to get all these new values — our code is even a little bit faster than our older-better-performing-branch. And, while I haven’t spent the time to compare it, I wouldn’t be shocked if it’s actually a bit faster than the Stock sufia.  It’s not fast, still 4-5s for our most-membered-work, but back to ‘barely good enough for now’.

Future: Caching? Pagination?

My personal rules of thumb in Rails are that a response over 200ms is not ideal, over 500ms it’s time to start considering caching, and over 1s (uncached) I should really figure out why and make it faster even if there is caching.  Other Rails devs would probably consider my rules of thumb to already be profligate!

So 4s is still pretty slow. Very slow responses like this not only make the user wait, but load down your Rails server filling up it’s processing queue and causing even worse problems under multi-user use. It’s not great.

Under a more standard Rails app, I’d definitely reach for caching immediately. View or HTTP caching is a pretty standard technique to make your Rails app as fast as possible, even when it doesn’t have pathological performance.

But the standard Rails html caching approaches use something they call ‘russian doll caching’, where the updated_at timestamp on the parent is touched when a child is updated. The issue is making sure the cache for the parent page is refreshed when a child displayed on that page changes.

classProduct < ApplicationRecord
  has_many :games
end
classGame < ApplicationRecord
  belongs_to :product, touch: true
end

With touch set to true, any action which changes updated_at for a game record will also change it for the associated product, thereby expiring the cache.

ActiveFedora tries to be like ActiveRecord, but it does not support that “touch: true” on associations used in the example for russian doll caching. It might be easy to simulate with an after_save hook or something — but updating records in Fedora is so slow. And worse, I think (?) there’s no way to atomically update just the updated_at in fedora, you’ve got to update the whole record, introducing concurrency problems. I think this could be a whole bunch of work.

jcoyne in slack suggested that instead of russian-doll-style with touching updated_at, you could assemble your cache key from the updated_at values from all children.  But I started to worry about child works, this might have to be recursive, if a child is a child work, you need to include all it’s children as well. (And maybe File children of every FileSet?  Or how do fedora ‘versions’ effect this?).  It could start getting pretty tricky.  This is the kind of thing the russian-doll approach is meant to make easier, but it relies on quick and atomic touching of updated_at.

We’ll probably still explore caching at some point, but I suspect it will be much less straightforward to work reliably than if this were a standard rails/AR app. And the cache failure mode of showing end-users old not-updated data is, I know from experience, really confusing for everyone.

Alternately or probably additionally, why are we displaying all 473 child images on the page at once in the first place?  Even in a standard Rails app, this might be hard to do performantly (although I’d just solve it with cache there if it was the UX I wanted, no problem). Mostly we’re doing it just cause stock sufia did it and we got used to it. Admins use ctrl-f on a page to find a member they want to edit. I kind of like having thumbs for all pages right on the page, even if you have to scroll a lot to see them (was already using lazysizes to lazy load the images only when scrolled to).  But some kind of pagination would probably be the logical next step, that we may get to eventually. One or more of:

  • Actual manual pagination. Would probably require a ‘search’ box on titles of members for admins, since they can’t use cntrl-f anymore.
  • Javascript-based “infinite scroll” (not really infinite) to load a batch at a time as user scrolls there.
  • Or using similar techniques, but actually load everything with JS immediately on page load, but a batch at a time.  Still going to use the same CPU on the server, but quicker initial page load, and splitting up into multiple requests is better for server health and capacity.

Even if we get to caching or some of these, I don’t think any of our work above is wasted — you don’t want to use this technique to workaround performance bottlenecks on the server, in my opinion you want to fix easily-fixable (once you find them!) performance bottlenecks or performance bugs on the server first, as we have done.

And another approach some would be not rendering some/all of this HTML on the server at all, but switching to some kind of JS client-side rendering (react etc.). There are plusses and minuses to that approach, but it takes our team into kinds of development we are less familiar with, maybe we’ll experiment with it at some point.

Thoughts on the Hydra/Samvera stack

So. I find Sufia and the samvera stack quite challenging, expensive, and often frustrating to work with. Let’s get that out of the way. I know I’m not alone in this experience, even among experienced developers, although I couldn’t say if it’s universal.

I also enjoy and find it rewarding and valuable to think about why software is frustrating and time-consuming (expensive) to work with, what makes it this way, and how did it get this way, and (hardest of all), what can be done or done differently.

If you’re not into that sort of discussion, please feel free to drop out now. Myself, I think it’s an important discussion to have. Developing a successful collaborative open source shared codebase is hard, there are many things we (or nobody) has figured out, and I think it can take some big-picture discussion and building of shared understanding to get better at it.

I’ve been thinking about how to have that discussion in as productive a way as possible. I haven’t totally figured it out — wanting to add this piece in but not sure how to do it kept me from publishing this blog post for a couple months after the preceding sections were finished — but I think it is probably beneficial to ground and tie the big picture discussion in specific examples — like the elements and story above. So I’m adding it on.

I also think it’s important to tell beginning developers working with Samvera, if you are feeling frustrated and confused, it’s probably not you, it’s the stack. If you are thinking you must not be very good at programming or assuming you will have similar experiences with any development project — don’t assume that, and try to get some experience in other non-samvera projects as well.

So, anyhow, this experience of dealing with performance problems on a sufia ‘show’ page makes me think of a couple bigger-picture topics:  1) The continuing cost of using a less established/bespoke data store layer (in this case Fedora/ActiveFedora/LDP) over something popular with many many developer hours already put into it like ActiveRecord, and 2) The idea of software “maturity”.

In this post, I’m actually going to ignore the first other than that, and focus on the second “maturity”.

Software maturity: What is it, in general?

People talk about software being “mature” (or “immature”) a lot, but googling around I couldn’t actually find much in the way of a good working definition of what is meant by this. A lot of what you find googling is about the “Capability Maturity Model“. The CMM is about organizational processes rather than product, it’s came out of the context of defense department contractors (a very different context than collaborative open source), and I find it’s language somewhat bureaucratic.  It also has plenty of critique.  I think organizational process matters, and CMM may be useful to our context, but I haven’t figured out how to make use of CMM to speak to about software maturity in the way I want to here, so I won’t speak of it again here.

Other discussions I found also seemed to me kind of vague, hand-wavy, or self-referential, in ways I still didn’t know how to make use of to talk about what I wanted.

I actually found a random StackOverflow answer I happened across to be more useful than most, I found it’s focus on usage scenarios and shared understanding to be stimulating:

I would say, mature would add the following characteristic to a technology:

  1. People know how to use it, know its possibilities and limitations
  2. People know what the typical usage scenarios are, patterns, what are good usage scenarios for this technology so that it shows its best
  3. People have found out how to deal with limitations/bugs, there is a community knowledge and help out there
  4. The technology is trusted enough to be used not only by individuals but in productive commercial environment as well

In this way of thinking about it, mature software is software where there is shared understanding about what the software is for, what patterns of use it is best at and which are still more ‘unfinished’ and challenging; where you’re going to encounter those, and how to deal with them.  There’s no assumption that it does everything under the sun awesomely, but that there’s a shared understanding about what it does do awesomely.

I think the unspoken assumption here is that for the patterns of use the software is best at, it does a good job of them, meaning it handles the common use cases robustly with few bugs or surprises. (If it doesn’t even do a good job of those, that doesn’t seem to match what we’d want to call ‘maturity’ in software, right? A certain kind of ‘ready for use’; a certain assumption you are not working on an untested experiment in progress, but on something that does what it does well.).

For software meant as a tool for developing other software (any library or framework; I think sufia qualifies), the usage scenarios are at least as much about developers (what they will use the software for and how) as they are about the end-users those developers are ultimately develop software for.

Unclear understanding about use cases is perhaps a large part of what happened to me/us above. We thought sufia would support ‘manuscript’ use cases (which means many members per work if a page image is a member, which seems the most natural way to set it up) just fine. It appears to have the right functionality. Nothing in it’s README or other ‘marketing’ tells you otherwise. At the time we began our implementation, it may very well be that nobody else thought differently either.

At some point though, a year+ after the org began implementing the technology stack believing it was mature for our use case, and months after I started working on it myself —  understanding that this use case would have trouble in sufia/hyrax began to build,  we started realizing, and realizing that maybe other developers had already realized, that it wasn’t really ready for prime time with many-membered works and would take lots of extra customization and workarounds to work out.

The understanding of what use cases the stack will work painlessly for, and how much pain you will have in what areas, can be something still being worked out in this community, and what understanding there is can be unevenly distributed, and hard to access for newcomers. The above description of software maturity as being about shared understanding of usage scenarios speaks to me; from this experience it makes sense to me that that is a big part of ‘software maturity’, and that the samvera stack still has challenges there.

While it’s not about ‘maturity’ directly, I also want to bring in some of what @schneems wrote about in a blog post about “polish” in software and how he tries to ensure it’s present in software he maintains.

Polish is what distinguishes good software from great software. When you use an app or code that clearly cares about the edge cases and how all the pieces work together, it feels right.…

…User frustration comes when things do not behave as you expect them to. You pull out your car key, stick it in the ignition, turn it…and nothing happens. While you might be upset that your car is dead (again), you’re also frustrated that what you predicted would happen didn’t. As humans we build up stories to simplify our lives, we don’t need to know the complex set of steps in a car’s ignition system so instead, “the key starts the car” is what we’ve come to expect. Software is no different. People develop mental models, for instance, “the port configuration in the file should win” and when it doesn’t happen or worse happens inconsistently it’s painful.

I’ve previously called these types of moments papercuts. They’re not life threatening and may not even be mission critical but they are much more painful than they should be. Often these issues force you to stop what you’re doing and either investigate the root cause of the rogue behavior or at bare minimum abandon your thought process and try something new.

When we say something is “polished” it means that it is free from sharp edges, even the small ones. I view polished software to be ones that are mostly free from frustration. They do what you expect them to and are consistent…

…In many ways I want my software to be boring. I want it to harbor few surprises. I want to feel like I understand and connect with it at a deep level and that I’m not constantly being caught off guard by frustrating, time stealing, papercuts.

This kind of “polish” isn’t the same thing as maturity — schneems even suggests that most software may not live up to his standards of “polish”.

However, this kind of polish is a continuum.  On the dark opposite side, we’d have hypothetical software, where working with it is about near constant surprises, constantly “being caught off guard by frustrating, time-stealing papercuts”, software where users (including developer-users for tools) have trouble developing consistent mental models, perhaps because the software is not very consistent in it’s behavior or architecture, with lots of edge cases and pieces working together unexpectedly or roughly.

I think our idea of “maturity” in software does depend on being somewhere along this continuum toward the “polished” end. If we combine that with the idea about shared understanding of usage scenarios and maturity, we get something reasonable. Mature software has shared understanding about what usage scenarios it’s best at, generally accomplishing those usage scenarios painlessly and well. At least in those usage scenarios it is “polished”, people can develop mental models that let them correctly know what to expect, with frustrating “papercuts” few and far between.

Mature software also generally maintains backwards compatibility, with backwards breaking changes coming infrequently and in a well-managed way — but I think that’s a signal or effect of the software being mature, rather than a cause.  You could take software low on the “maturity” scale, and simply stop development on it, and thereby have a high degree of backwards compat in the future, but that doesn’t make it mature. You can’t force maturity by focusing on backwards compatibility, it’s a product of maturity.

So, Sufia and Samvera?

When trying to figure out how mature software is, we are used to taking certain signals as sort of proxy evidence for it.  There are about 4 years between the release of sufia 1.0 (April 2013) and Sufia 7.3 (March 2017; beyond this point the community’s attention turned from Sufia to Hyrax, which combined Sufia and CurationConcerns). Much of sufia is of course built upon components that are even older: ActiveFedora 1.0 was Feb 2009, and the hydra gem was first released in Jan 2010. This software stack has been under development for 7+ years,  and is used by several dozens of institutions.

Normally, one might take these as signs predicting a certain level of maturity in the software. But my experience has been that it was not as mature as one might expect from this history or adoption rate.

From the usage scenario/shared understanding bucket, I have not found that there is as high degree as I might have expected of easily accessible shared understanding of  “know how to use it, know its possibilities and limitations,” “know what the typical usage scenarios are, patterns, what are good usage scenarios for this technology so that it shows its best.”  Some people have this understanding to some extent, but this knowledge is not always very clear to newcomers or outsiders — and not what they may have expected. As in this blog post, things I may assume are standard usage scenarios that will work smoothly may not be.   Features I or my team assumed were long-standing, reliable, and finished sometimes are not. 

On the “polish” front, I honestly do feel like I am regularly “being caught off guard by frustrating, time stealing, papercuts,” and finding inconsistent and unparallel architecture and behavior that makes it hard to predict how easy or successful it will be to implement something in sufia; past experience is no guarantee of future results, because similar parts often work very differently. It often feels to me like we are working on something at a more proof-of-concept or experimental level of maturity, where you should expect to run into issues frequently.

To be fair, I am using sufia 7, which has been superceded by hyrax (1.0 released May 2017, first 2.0 beta released Sep 2017, no 2.0 final release yet), which in some cases may limit me to older versions of other samvera stack dependencies too. Some of these rough edges may have been filed off in hyrax 1/2, one would expect/hope that every release is more mature than the last. But even with Sufia 7 — being based on technology with 4-7 years of development history and adopted by dozens of institutions, one might have expected more maturity. Hyrax 1.0 was only released a few months ago after all.  My impression/understanding is that hyrax 1.0 by intention makes few architectural changes from sufia (although it may include some more bugfixes), and upcoming hyrax 2.0 is intended to have more improvements, but still most of the difficult architectural elements I run into in sufia 7 seem to be mostly the same when I look at hyrax master repo. My impression is that hyrax 2.0 (not quite released) certainly has improvements, but does not make huge maturity strides.

Does this mean you should not use sufia/hyrax/samvera? Certainly not (and if you’re reading this, you’ve probably already committed to it at least for now), but it means this is something you should take account of when evaluating whether to use it, what you will do with it, and how much time it will take to implement and maintain.  I certainly don’t have anything universally ‘better’ to recommend for a digital repository implementation, open source or commercial. But I was very frustrated by assuming/expecting a level of maturity that I then personally did not find to be delivered.  I think many organizations are also surprised to find sufia/hyrax/samvera implementation to be more time-consuming (which also means “expensive”, staff time is expensive) than expected, including by finding features they had assumed were done/ready to need more work than expected in their app; this is more of a problem for some organizations than others.  But I think it pays to take this into account when making plans and timelines.   Again, if you (individually or as an institution) are having more trouble setting up sufia/hyrax/samvera than you expected, it’s probably not just you.

Why and what next?

So why are sufia and other parts of the samvera stack at a fairly low level of software maturity (for those who agree they are not)?  Honestly, I’m not sure. What can be done to get things more mature and reliable and efficient (low TCO)?  I know even less.  I do not think it’s because any of the developers involved (including myself!) have anything but the best intentions and true commitment, or because they are “bad developers.” That’s not it.

Just some brainstorms about what might play into sufia/samvera’s maturity level. Other developers may disagree with some of these guesses, either because I misunderstand some things, or just due to different evaluations.

  • Digital repositories are just a very difficult or groundbreaking domain, and it just necessarily would take this number of years/developer-hours to get to this level of maturity. (I don’t personally subscribe to this really, but it could be)

 

  • Fedora and RDF are both (at least relatively) immature technologies themselves, that lack the established software infrastructure and best practices of more mature technologies (at the other extreme, SQL/rdbms, technology that is many decades old), and building something with these at the heart is going to be more challenging, time-consuming, and harder to get ‘right’.

 

  • I had gotten the feeling from working with the code and off-hand comments from developers who had longer that Sufia had actually taken a significant move backwards in maturity at some point in the past. At first I thought this was about the transition from fedora/fcrepo 3 to 4. But from talking to @mjgiarlo (thanks buddy!), I now believe it wasn’t so much about that, as about some significant rewriting that happened between Sufia 6 and 7 to: Take sufia from an app focused on self-deposit institutional repository with individual files, to a more generalized app involving ‘works’ with ‘members’ (based on the newly created PCDM model); that would use data in Fedora that would be compatible with other apps like Islandora (a goal that has not been achieved and looks to me increasingly unrealistic); and exploded into many more smaller purpose hypothetically decoupled component dependencies that could be recombined into different apps (an approach that, based on outcomes, was later reversed in some ways in Hyrax).
    • This took a very significant number of developer hours, literally over a year or two. These were hours that were not spent on making the existing stack more mature.
    • But so much was rewritten and reorganized that I think it may have actually been a step backward in maturity (both in terms of usage scenarios and polish), not only for the new usage scenarios, but even for what used to be the core usage scenario.
    • So much was re-written, and expected usage scenarios changed so much, that it was almost like creating an entirely new app (including entirely new parts of the dependency stack), so the ‘clock’ in judging how long Sufia (and some but not all other parts of the current dependency stack) has had to become mature really starts with Sufia 7 (first released 2016), rather than sufia 1.0.
    • But it wasn’t really a complete rewrite, “legacy” code still exists, some logic in the stack to this day is still based on assumptions about the old architecture that have become incorrect, leading to more inconsistency, and less robustness — less maturity.
    • The success of this process in terms of maturity and ‘total cost of ownership’ are, I think… mixed at best. And I think some developers are still dealing with some burnout as fallout from the effort.

 

  • Both sufia and the evolving stack as a whole have tried to do a lot of things and fit a lot of usage scenarios. Our reach may have exceeded our grasp. If an institution came with a new usage scenario (for end-users or for how they wanted to use the codebase), whether they come with a PR or just a desire, the community very rarely says no, and almost always then tries to make the codebase accommodate. Perhaps in retrospect without sufficient regard for the cost of added complexity. This comes out of a community-minded and helpful motivation to say ‘yes’. But it can lead to lack of clarity on usage scenarios the stack excels at, or even lack of any usage scenarios that are very polished in the face of ever-expanding ambition. Under the context of limited developer resources yes, but increased software complexity also has costs that can’t be handled easily or sometimes at all simply by adding developers either (see The Mythical Man-Month).

 

  • Related, I think, sufia/samvera developers have often aspired to make software that can be used and installed by institutions without Rails developers, without having to write much or any code. This has not really been accomplished, or if it has only in the sense that you need samvera developer(s) who are or become proficient in our bespoke stack, instead of just Rails developers. (Our small institution found we needed 1-2 developers plus 1 devops).  While motivated by the best intentions — to reduce Total Cost of Ownership for small institutions — the added complexity in pursuit of this ambitious and still unrealized goal may have ironically led to less maturity and increased TCO for institutions of all sizes.

 

  • I think most successfully mature open source software probably have one (or a small team of) lead developer/architect(s) providing vision as to the usage scenarios that are in or out, and to a consistent architecture to accomplish them. And with the authority and willingness to sometimes say ‘no’ when they think code might be taking the project in the wrong direction on the maturity axis. Samvera, due to some combination of practical resource limitations and ideology, has often not.

 

  • ActiveRecord is enormously complex software which took many many developer-hours to get to it’s current level of success and maturity. (I actually like AR okay myself).  The thought that it’s API could be copied and reimplemented as ActiveFedora, with much fewer developer-hour resources, without encountering a substantial and perhaps insurmountable “maturity gap” — may in retrospect have been mistaken. (See above about how basing app on Fedora has challenges to achieving maturity).

 

What to do next, or different, or instead?  I’m not sure!  On the plus side we have a great community of committed and passionate and developers, and institutions interested in cooperating to help each other.

I think improvements start with acknowledging the current level of maturity, collectively and in a public way that reaches non-developer stakeholders, decision-makers, and funders too.

We should be intentional about being transparent with the level of maturity and challenge the stack provides. Resisting any urge to “market” samvera or deemphasize the challenges, which is a disservice to people evaluating or making plans based on the stack, but also to the existing community too.We don’t all have to agree about this either; I know some developers and institutions do have similar analysis to me here (but surely with some differences), others may not. But we have to be transparent and public about our experiences, to all layers of our community as well as external to it. We all have to see clearly what is, in order to make decisions about what to do next.

Personally, I think we need to be much more modest about our goals and the usage scenarios (both developer and end-user) we can support. This is not necessarily something that will be welcome to decision-makers and funders, who have reasons to want  to always add on more instead.  But this is why we need to be transparent about where we truly currently are, so decision-makers can operate based on accurate understanding of our current challenges and problems as well as successes