OCFL and “source of truth” — two options

Some great things about conferences is how different sessions can play off each other, and how lots of people interested in the same thing are in the same place (virtual or real) at the same time, to bounce ideas off each other.

I found both of those things coming into play to help elucidate what I think is an important issue in how software might use the Oxford Common File Layout (OCFL). Prompted by the Code4Lib 2023 session The Oxford Common File Layout – Understanding the specification, institutional use cases and implementations, with presentations by Tom Wrobel, Stefano Cossu, and Arran Griffith. (recorded video here).

OCFL is a specification for laying files out in a disk-like storage system, in a way that is suitable for long-time preservation. With a standard simple layout that is both human- and machine-readable, and would allow someone (some software) at a future point to reconstruct digital objects and metadata from the record left on disk.

The role of OCFL in a software system: Two choices

After the conference presentation, Matt Lincoln from JStor Labs asked a question in Slack chat that had been rising up in my mind too, but which Matt said more clearly than was in my mind at the time! This prompted a discussion on Slack, largely but not entirely between me and Stefano Cossu, which I found to be very productive, and which I’m going to detail here with my own additional glosses, but first let’s start with Matt’s question.

(I will insert slack links to quotes in this piece; you probably can’t see the sources unless you are a member of the Code4Lib workspace).

For the OCFL talk, I’m still unclear what the relationship is/can/will be in these systems between the database supporting the application layer, and the filesystem with all the OCFL-laid-out objects. Does DB act as a source of truth and OCFL as a copy? OCFL as source of truth and DB as cache? No db at all, and just r/w directly to OCFL?  If I’m a content manager and edit an item’s metadata in the app’s web interface, does that request get passed to a DB and THEN to OCFL? Is the web app reading/writing directly to the OCFL filesystem without mediating DB representation? Something else?

Matt Lincoln

I think Matt, utilizing the helpful term “source of truth”, accurately identifies two categories of use of OCFL in a software system — and in fact, that different people in the OCFL community — even different presenters in this single OCFL conference presentation — had been taking different paths, and maybe assuming that everyone else was on the same page as them, or at least not frequently drawing out the difference and consequences of these two paths.

Stefano Cossu, one of the presenters from the OCFL talk at Code4Lib, described it this way in a Slack response:

IMHO OCFL can either act as a source from which you derive metadata, or a final destination for preservation derived from a management or access system, that you don’t want to touch until disaster hits. It all depends on how your ideal information flow is. I believe Fedora is tied to OCFL which is its source of truth, upon which you can build indices and access services, but it doesn’t necessarily need to be that way.

Stefano Cossu

It turns out that both paths are challenging in different ways; there is no magic bullet. I think this is a foundational question for the software engineering of systems that use OCFL for preservation, with significant implications on the practice of digital preservation as a whole.

First, let’s say a little bit more about what the paths are.

“OCFL as a source of truth”

If you are treating OCFL as a “source of truth”, the files stored in OCFL are the main primary location of your data.

When the software wants to add, remove, or change data, it will probably happen to the OCFL first, or at any rate won’t be considered a successful change until it is reflected in OCFL.

There might be other layers on top providing alternate access to the OCFL, some kind of “index” to OCFL for faster and/or easier access to the data, but these are considered “derivative”, and can always be re-created from just the OCFL. The OCFL is “the data”, everything else is “derivative” and can be re-created by an automated process from the OCFL on disk.

This may be what some of the OCFL designers were assuming everyone would do; as we’ll see, it makes certain things possible, and provides the highest level of confidence in our preservation activities.

“OCFL off to the side”

Alternately, you might write an application more or less using standard architectures for writing (eg) web applications. The data is probably in a relational database system (rdbms) like postgres or MySQL, or some other data store meant for supporting application development.

When the application makes a change to the data, it’s made to the primary data store.

Then the data is “mirrored” to OCFL. Possibly after every change, or possibly periodically. The OCFL can be thought of as a kind of “backup” — a backup in a specific standard format meant to support long-term preservation and interoperability. I’m calling this “off to the side”, Stefano aboves calls it “final destination”, in either case contrasted with “source of truth”.

It’s possible you haven’t stored all the data the application uses to OCFL, only the data you want to backup “for long-term preservation purposes”. (Stefano later suggests this is their practice, in fact). Maybe there is some data you think is necessary only for the particular present application’s functionalities (say, to support back-end accounts and workflows), which you think of as accidental, ephemeral, contextual, or system-specific and non-standard– and which you don’t see any use to storing for long-term preservation.

In this path, if ALL you have is the OCFL, you aren’t intending that you can necessarily stand your actual present application back up — maybe you didn’t store all the data you’d need for that; maybe you don’t have existing software capable of translating the OCFL back to the form the application actually needs it in to function. Of if you are intending that, the challange is greater to accomplish it, as we’ll see.

So why would you do this? Well, let’s start with that.

Why not OCFL as a source of truth?

There’s really only one reason — because it makes application development a lot harder. What do I mean by “a lot harder”? I mean, it’s going to take more development time, and more development care and decisions, you’re going to have more trouble achieving reasonable performance in a large-scale system — and you’re going to make more mistakes, have more bugs and problems, more initial deliveries that have problems. It’s not all “up-front” cost or known cost, but as you continue to develop the system, you’re going to keep struggling with these things. You honestly have increased chance of failure.

Why?

In the Slack thread, Stefano Cossu spoke up for OCFL to be a “final destination”, not the “source of truth” for the daily operating software:

I personally prefer OCFL to be the final destination, since if it’s meant to be for preservation, you don’t want to “stir” the medium by running indexing and access traffic, increasing the chances of corruption.

Stefano Cossu

If you’re using it as the actual data store for a running application, instead of leaving it off to the side as a backup, it perhaps increases the chances of bugs effecting data reliability.

The problem with that setup [OCFL as source of truth] is that a preservation system has different technical requirements from an access system. E.g. you may not want store (and index) versioning information in your daily-churn system. Or you may want to use a low-cost, low-performance medium for preservation

Stefano Cossu

OCFL is designed to rebuild knowledge (not only data, but also the semantic relationships between resources) without any supporting software. That’s what I intend for long-term preservation. In order to do that, you need to serialize everything in a way that is very inefficient for daily use.

Stefano Cossu

The form that OCFL prescribes is cumbersome to use for ordinary daily functionality. It makes it harder to achieve the goals you want for your actually running software.

I think Stefano is absolutely right about all of this, by the way, and also thank him for skillfully and clearly delineating a perspective that may, explicitly or not, actually be somewhat against the stream of some widespread OCFL assumptions.

One aspect of the cumbersomeness is that writes to OCFL need to be “synchronized” with regard to concurrency — the contents of a new version written to OCFL are as deltas on the previous version, so if another version is added while you are working on preparing your additional version — your version will be wrong. You need to use some form of locking, whether optimistic or naive pessimistic locks.

Whereas a relational database system is built on decades of work to ensure ACID (atomicity, consistency, isolation, durability) with regard to writes, while also trying to optimize performance within these constraints (which can be a real tension) — with OCFL we don’t have the built-up solutions (tools and patterns) for this to the same extent.

Application development gets a lot harder

In general, building a (say) web app on a relational database system is a known problem with a huge corpus of techniques, patterns, shared knowledge, and toolsets available. A given developer may be more or less experienced or skilled; different developers may disagree on optimal choices in some cases. But those choices are being made from a very established field, with deep shared knowledge on how to build applications rapidly (cheaply), with good performance and reliability.

When we switch to OCFL as the primary “source of truth” for an app, we in some ways are charting new territory and have to figure out and invent the best ways to do certain things, with much less support from tooling, the “literature” (even including blogs you find on google etc), and a much smaller community of practice.

The Fedora repository platform is in some sense meant to be a kind of “middleware” to make this lift easier. In its version 6 incarnation, it’s own internal data store is OCFL. It doesn’t give you a user-facing app. It gives you a “middleware” you can access over a more familiar HTTP API with clear semantics, and you don’t have to deal with the underlying OCFL (or in previous incarnations other internal formats) yourself. (Seth Erickson’s ocfl_index could be thought of as similar peer “middleware” in some ways, although it’s read-only, it doesn’t provide for writing).

But it’s still not the well-trodden path of rapid web application development on top of an rdbms.

I think that the samvera (née hydra) community really learned this to some extent the hard way, the way trying to build on top of this novel architecture really raised the complexity, cost, and difficulty of implementing the user-facing application (with implications on succession, hiring, and retention too). I’m not saying this happened becuase Fedora team did something wrong, I’m saying a novel architecture like this inherently and neccessarily raises the difficulty over a well-trodden architectural path. (although it’s possible to recognize the challenge and attempt to ameliorate with features that make things easier on developers, it’s not possible to eliminate).

Some samvera peer instititions have left the Fedora-based architecture, I think as a result of this experience. Where I work at Science History Institute, we left sufia/hydra/samvera to write a closer to “just plain Rails app”, and I believe it successfully and seriously increased our capacity to meet organizational and business needs within our available software engineering capacity. I personally would be really relutant to go back to attempting to use Fedora and/or OCFL as a “source of truth”, instead of more conventional web app data storage patterns.

So… that’s why you might not… but what do you lose?

What do you lose without OCFL as source of truth?

The trade-off is real though — I think some of the assumptions about what OCFL provides how are actually based on assumptions of OCFL as source of truth in your application.

Mike Kastellec’s Code4Lib presentation just before the OCFL one, on How to Survive a Disaster [Recovery] really got me thinking about backups and reliability.

Many of us have heard (or worse, found out ourselves the hard way) the adage: You don’t really know if you have a good backup unless you regularly go through the practice of recovery using it, to test it. Many have found that what they thought was their backup — was missing, was corrupt, or was not in a format suitable for supporting recovery. Because they hadn’t been verifying it would work for recovery, they were just writing to it but not using it for anything.

(Where I work, we try to regularly use our actual backups as the source of sync’ing from a production system to a staging system, in part as a method of incorporating backup recovery verification into our routine).

How is a preservation copy analogous? If your OCFL is not your source of truth, but just “off to the side” as a “preservation copy” — it can easily be a similar “write-only” copy. How do you know what you have there is sufficient to serve as a preservation copy?

Just as with backups, there are (at least) two categories of potential problem: It could be there are bugs in your synchronization routines, such that what you thought was being copied to OCFL was not, or not on the schedule you thought, or was getting corrupted or lost in transit. But the other category, even worse it could be that your design had problems, and what you chose to sync to OCFL left out some crucial things that these future consumers of your preservation copy would have needed to fully restore and access the data. Stefano also wrote:

We don’t put everything in OCFL. Some resources are not slated for long-term preservation. (or at least, we may not in the future, but we do now)

If you are using the OCFL as your daily “source of truth”, you at least know the data you have stored in OCFL is sufficient to run your current system. Or at least you haven’t noticed any bugs with it yet, and if anyone notices any you’ll fix them ASAP.

The goal of preservation is that some future system will be able to use these files to reconstruct the objects and metadata in a useful way… It’s good to at least know it’s sufficient for some system, your current system. If you are writing to OCFL and not using it for anything… it reminds us of writing to a backup that you never restore from. How do you know it’s not missing things, by bug or by misdesign?

Do you even intend the OCFL to be sufficient to bring up your current system (I think some do, some don’t, some haven’t thought about it), and if you do, how do you know it meets your intents?

OCFL and Completeness and Migrations

The OCFL web page lists as one of its benefits (which I think can also be understood as design goals for OCFL):

Completeness, so that a repository can be rebuilt from the files it stores

If OCFL is your applications “source of truth”, you have this necessarily, in the sense of that almost being the definition of OCFL being the “source of truth”. (maybe suggesting at least some OCFL designers were assuming it as source of truth).

But if your OCFL is “off to the side”… do you even have that? I guess it depends on if you intended the OCFL to be transformable back to your application’s own internal source of truth, and if that intention was successful. If we’re talking about data from your application being written “off to the side” to OCFL, and then later transformed back to your application — I think we’re talking about what is called “round-tripping” the data.

There was another Code4Lib presentation about repository migration at Stanford, in the Slack discussion happening about that presentation, Stanford’s Justin Coyne and Mike Giarlo wrote:

I don’t recommend “round trip mappings”. I was a developer on this project.  It’s very challenging to not lose data when going from A -> B -> A

Justin Coyne

We spent sooooo much time on getting these round-trip mappings correct. Goodness gracious.

Mike Giarlo

So, if you want to make your OCFL “off to the side” provide this quality of completeness via round-trippability, you probably have to be focusing on it intentionally, and then it’s still going to be really hard, maybe one of the hardest (most time-consuming, most buggy) aspects of your application, or at least it’s persistence layer.

I found this presentation about repository migration really connecting my neurons to the OCFL discussion generally — when i thought about this I realized, well, that makes sense, woah, is one description of “preservation” activities actually: a practice of trying to plan and provide for unknown future migrations not yet fully spec’d?

So, while we were talking about repository migrations on Slack, and how challenging the data migrations were (several conf presentations dealt with data migrations in repositories) Seth Erickson made a point about OCFL:

One of the arguments for OCFL is that the repository software should upgradeable/changeable without having to migrate the data… (that’s the aspiration, anyways)

Seth Erickson

If the vision is that with nothing more than an OCFL storage system, we can point new software to it and be up and running without a data migration — I think we can see this is basically assuming OCFL as the “source of truth”, and also talking about the same thing the OCFL webpage calls “completeness” again.

And why is this vision aspirational? Well, to begin with, we don’t actually have very many repository systems that use OCFL as a source of truth. We may only have Fedora — that is, systems that use Fedora as middleware. Or maybe ocfl_index too, although it being only read-only and also middleware that doesn’t necessarily have user-facing software built on it yet, it’s probably currently a partial entry at most.

If we had multiple systems that could already do this, we’d be a lot more confident it would work out — but of course, the expense and difficulty of building a system using OCFL as the “source of truth” is probably a large part of why we don’t!

OK, do we at least have multiple systems based on fedora? Well… yes. Even before Fedora was based on OCFL, it would hypothetically be possible to upgrade/change repository software without a data migration if both source and target software were based on Fedora… except, in fact, it was not possible to do this between Samvera sufia/hydra and Islandora, despite both being based on fedora, because even though they both used fedora, their metadata stored in Fedora (or OCFL) was not consistent. A whole giant topic we’re not going to cover here, except to point out it’s a huge challenge for that vision of “completeness” providing for software changes without data migration, a huge challenge that we have seen in practice, without necessarily seeing a success in practice. (Even within hyrax alone, there are currently two different possible fedora data layouts, using traditional activefedora with “wings” adapter or instead valkyrie-fedora adapter, requiring data migration between them!)

And if we think of the practice of preservation as being trying to maximize chances of providing for migration to future unknown systems with unknown needs… then we see it’s all aspirational (that far-future digital preservation is an aspirational endeavor is of course probably not a controversial thing to say either).

But the little bit of paradox here is that while “completeness” makes it more likely you will be able to easily change systems without data loss, the added cost of developing systems that achieve “completeness” via OCFL as “source of truth” means — you will probably have much fewer, if any, choices of suitable systems to change to, or resources available to develop them!

So… what do we do? Can we split the difference?

I think the first step is acknowledging the issue, the tension here between completeness via “OCFL as source-of-truth” and, well, ease of software development. There is no magic answer that optimizes everything, there are trade-offs.

That quality of “completeness” of data (“source of truth”) is going to make your software much more challenging to develop. Take longer, take more skill, have more chance of problems and failures. And another way to say this is: Within a given amount of engineering resources, you will be delivering fewer features that matter to your users and organization, because you are spending more of your resources on implementing on a more challenging architecture.

What you get out of this is aspirationally increased chances of successful preservation. This doesn’t mean you shouldn’t do it, digital preservation is neccessarily aspirational. I’m not sure one balances this cost and benefit — it might likely be different for different institutions — but I think we should be careful not to be routinely under-estimating the cost or over-estimating the size or confidence of benefits from the “source of truth” approach. Undoubtedly many institutions will still choose to develop OCFL as a source of truth, especially using middleware intended to ease the burden, like Fedora.

I will probably not be one of them at my current institution — the cost is just too high for us, we can’t give up the capacity to relatively rapidly meet other organizational and user needs. But I’d like to look at incorporating OCFL as “off to the side” preservation copy anyway in the future.

(And Stefano and me are definitely not the only ones considering this or doing it. Many institutions are using an “off to the side” “final destination” approach to preservation copies, if not with OCFL, than with some of it’s progenitors or peers like BagIt or Stanford’s MOAB — the “off to the side” approach is not unusual, and for good reasons! We can acknowledge it and talk about it without shame!)

If you are developing instead with OCFL as a “off to the side” (or “final destination”), are there things you can do to try to get closer to the benefits of OCFL as “source of truth”?

The main thing I can think of involves “round-trippability”

  • Yes, commit to storing all of your objects and metadata necessary to restore a working current system in your OCFL
  • And commit to storing it round-trippably
  • One way to ensure/enforce this would be — every time you write a new version to OCFL, run a job that serializes those objects and metadata to OCFL, and back to your internal format, and verify that it is still equivalent. Verify the round-trip.

Round-trippability doens’t just happen on it’s own, and ensuring it will definitely significantly increase the cost of your development — as the Stanford folks said from experience, round-trippability is a headache and a major cost! But, it could conceivably get you a lot of the confidence in “completeness” that “source of truth” OCFL gets you. And as it still is “off to the side”, it still allows you to write your application using whatever standard (or innovative in different directions) architectures you want, you don’t have the novel data persistence architecture design involved in all of your feature development to meet user and business needs.

This will perhaps arrive at a better cost/benefit balance for some institutions.

There may be other approaches or thoughts, this is hopefully the beginning of a long conversation and practice.

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s