In our digital collections app rewrite at Science History Institute, we took a moment to step back and be intentional about how we approach “fixity checking” features and UI, to make sure it’s well-supporting the needs it’s meant to. I think we do a good job of providing UI to let repository managers and technical staff get a handle on a reliable fixity checking service, that others may be interested in seeing as an example to consider. Much of our code was implemented by my colleague Eddie Rubeiz.
What is “fixity checking”?
In the field of digital preservation, “fixity” and “fixity checking” basically just means:
- Having a calculated checksum/digest value for a file
- Periodically recalculating that value and making sure it matches the recorded expected value, to make sure there has been no file corruption.
See more at the Digital Preservation Coalition’s Digital Preservation Handbook.
Do we really need fixity checking?
I have been part of some conversations with peers wondering if we really need to be doing all this fixity checking. Modern file/storage systems are pretty good at preventing byte corruption, whether on-premises or cloud PaaS, many have their own low-level “fixity checking” with built-in recovery happening anyway. And it can get kind of expensive doing all that fixity checking, whether in cloud platform fees or local hardware, or just time spent on the systems. Reports of actual fixity check failures (that are not false positives) happening in production are rare to possibly nonexistent.
However, I think everyone I’ve heard questioning is still doing it. We’re not sure we don’t need it, industry/field best practices still mostly suggest doing it, we’re a conservative/cautious bunch.
Myself, I was skeptical of whether we needed to do fixity checking — but when we did our data migration to a new system, it was super helpful to at least have the feature available to be able to help ensure all data was migrated properly. Now I think it’s probably worthwhile to have the feature in a digital preservation system; I think it’s probably good enough to “fixity check” files way less often than many of us do though, maybe as infrequently as once a year?
But, if we’re gonna do fixity checking, we might as well do it right, and reliably.
Pitfalls of Fixity Check Features, and Requirements
Fixity checks are something you need for reliablity, but might rarely use or even look at — and that means it’s easy to have them not working and have nobody notice. It’s a “requirements checklist” thing, institutions want to be able to say the app supports it, but some may not actually prioritize spending much time to make sure it’s working, or the exposed UI is good enough to accomplish the purpose of it.
And in fact, when we were implementing the first version of our app on sufia (the predecessor to hyrax) — we realized that the UI in sufia for reporting fixity check results on a given file object seemed to be broken, and we weren’t totally sure it was recording/keeping the results of it’s checks. (If a fixity check fails in a forest, and…) This may have been affecting other institutions who hadn’t noticed either, not sure. It’s sort of like thinking you have backups, but never testing them, it’s a pitfall of “just in case” reliability features. (I did spend a chunk of time understanding what was going on and submitting code to hyrax fix it up a bit.).
If you have an app that does regular fixity checking, it’s worth considering: Are you sure it’s happening, instead of failing to run (properly or at all) due to an error? How would you check that? Do you have the data and/or UX you need to be confident fixity checking is working as intended, in the absence of any fixity failures?
A fixity check system might send a “push” alert in case of a fixity check failure — but that will usually be rare to nonexistent. We decided that in addition to being able to look at current fixity check status on an individual File/Asset — we need some kind of “Fixity Health Summary” dashboard, that tells you how many fixity checks have been done, which Files (if any) lack fixity checks, if any haven’t gotten a fixity check in longer than expected, total count of any failing fixity check, etc.
This still relies on someone to look at it, but at least there is some way in the UI to answer the question “Are fixity checks happening as expected”.
Fixity Check record data model
Basically following the lead set by sufia/hyrax, we keep a history of multiple past fixity checks.
Having many instead of one fixity status on record did end up significantly complicating the code compared to keeping only the latest fixity check result. Because you often want to do SQL queries based on the date and/or status of the latest fixity check result, and needing to get this from “the record from the set of associated FixityChecks with the latest date” can be kind of tricky to do in SQL, especially when fetching or reporting over many/all of your Assets.
Still, it might be a good idea/requirement? I’m not really sure, or sure what I’d do if I had it to do over, but it ended up this way in our implementation.
Still, we don’t want to keep every past fixity check on record — it would eventually fill up our database if we’re doing regular fixity checks. So what do we want to keep? If a record keeps passing fixity every day, there’s no new info from keeping em all, we decided to mostly just keep fixity checks which established windows on status changes. (I think Hyrax does similar at present).
- The first fixity check
- The N most recent fixity checks (where N may be 1)
- Any failed checks.
- The check right before or right after any failed check, to establish the maximum window that the item may have been failing fixity, as a sort of digital provenance context. (The idea is that maybe something failed, and then you restored it from backup, and then it passed again).
We have some code that looks through all fixity checks for a given work, and deletes any checks not spec’d as keepable as above.Which we normally call after recording any additional fixity check.
Our “FixityCheck” database table includes a bunch of description of exactly what happened: the date of the fixity check, status (success or failure), expected and actual digest values, the location of the file checked (ie S3 bucket and path), as well as of course the foreign key to the Asset “model” object that the file corresponds to.
We also store the digest algorithm used. We use SHA512, due to general/growing understanding that MD5 and SHA1 are outdated and should not be used, and SHA512 is a good one. But want to record this in the database for record-keeping purposes, and to accomodate any future changes to digest algorithm, which may require historical data points using different algorithms to coexist in the database.
The Check: Use shrine API to calculate via streaming bytes from cloud storage
The process of doing a fixity check is pretty straightforward — you just have to compute a checksum!
Because we’re going to be doing this a lot, on some fairly big files (generally we store ~100MB TIFFs, but we have some even larger ones) we want the code that does the check to be as efficient as possible.
Our files are stored in S3, and we thought doing it as efficiently as possible means calculating the SHA512 from a stream of bytes being read from S3, without ever storing them to disk. Reading/writing from disk is actually a pretty slow thing for a process to do, and also risks clogging up disk IO pipelines if lots of processes are doing it at once. And by streaming, calculating iteratively based on the bytes as we fetch them them over the network (which the SHA512 algorithm and most other modern digesting algorithms are capable of), we can get a computation faster.
We are careful to use the proper shrine API to get a stream from our remote storage, avoid shrine caching read bytes to disk, and pass it to the proper ruby OpenSSL::Digest API to calculate the SHA512 from streamed bytes. Here is our implementation. (Shrine 3.0 may make this easier).
Calculate for 1/Nth of all assets per night
If our goal is to fixity check every file once every 7 days, then we want to spread that out by checking 1/7th of our assets every night. In fact we wanted to parameterize that to
N==7 for us at present, we want the freedom to make it a lot higher without a code rewrite. To keep it less confusing, I’ll keep writing as if N is 7.
At first, we considered just taking an arbitrary 1/7th of all Assets, just take the Asset PK, turn it into an integer with random distribution (say MD5 it, I dunno, whatever), and modulo 7.
But we decided that instead taking the 1/7th of Assets that have been least recently checked (or never checked; sort nulls first) has some nice properties. You always check the things most in need of being checked, including recently created assets without yet a check. If some error keeps some thing from being checked or having a check recorded, it’ll still be first in line for the next nightly check.
A little bit tricky to find that list of things to check in SQL cause of our data model, but a little “group by” will do it, here’s our code. We use ActiveRecord find_each to make sure we’re being efficient with memory use when iterating through thousands+ of records.
And some batching in postgres transactions writing results to try to speed things up yet further (not actually sure how well that works). Here’s our rake task for doing nightly fixity checking.— which can show a nice progress bar when run interactively. We think it’s important to have good “developer UI” for all this stuff, if you actually want it to be used regularly — the more frustrating it is to use, the less it will get used, developers are users too!
It ends up taking somewhere around 1-2s per File to check fixity and record check for our files which are typically 100MB or so each. The time it takes to fixity check a file mainly scales with the size of the file. We think mainly simply waiting on streaming the bytes from S3 to calculate a digest (even more than the CPU time of actually calculating the digest). So it should be pretty parallelizable, although we haven’t really tried parallelizing it, cause this is fast enough for us at our scale. (We have around ~25K Files, 1.5TB of total original content).
If a fixity check fails, we want a “push” notification, to actually contact someone and tell them it failed. Currently we do that with both an email and register an error to the Honeybadger error reporting service we already used. (since we already have honeybadger errors being reported to a slack channel with a honeybadger integration, this means it goes to our Slack too).
Admin UI for individual asset fixity status
In the administration page for an individual Asset, you want to be able to confirm the fixity check status, and when the last time a check happened was. Also, you might want to see when the earliest fixity check on record is, and look at the complete recorded history of fixity checks (what’s the point of keeping them around if you aren’t going to show them in any admin UI?)
That “Fixity check history” link is a little expand/contract collapsible control, the history underneath it does not start out expanded. Note it also confirms the digest algorithm used (sha512), and what the actual recorded digest checksum at that timestamp was.
As you can see we also give a “Schedule a check now” button — this actually queues up a fixity check as a background ActiveJob, it’s usually complete within 10 or 20 seconds. This “schedule now” button is useful if you have any concerns, or are trying to diagnose or debug something.
If there’s a failure, you might need a bit more information:
The actual as well as expected digest value; the postgres PK for the table recording this logged info, for a developer to really get into it; and a reverse engineered AWS S3 Console URL that will (after you login to your AWS account with privs) take you to the S3 console view of the key, so you can investigate the file directly from S3, download it, whatever.
(Yeah, all our files are in S3).
Fixity Health Dashboard Admin UI
As discussed, we decided it’s important not just to be able to see fixity check info for a specified known item, but to get a sense of general “fixity health”.
So we provide a dashboard that most importantly will tell you:
- If any assets have a currently failing fixity check
- If any assets haven’t been fixity checked in longer than expected (for us at present, last 7 days).
- But there may be db records for Assets that are still in process of ingesting; these aren’t expected to have fixity checks (although if they are days old and still not fully ingested, it may indicate a problem in our ingest pipeline!)
- And if an Asset was ingested only in the past 24 hours, maybe it just hasn;t gotten it’s first nightly check, so that’s also okay.
It gives some large red or green thumbs-up or thumbs-down icons based on these values, so a repository/collections manager that may not look at this page very often or be familiar with the details of what everything means can immediately know if fixity check health is good or bad.
In addition to the big green/red health summary info at the top, there’s some additional “Asset and Fixity Descriptive Statistics” that will help an administrator, especially a more technical staff member, get more of a sense of what’s going on with our assets and their fixity checks in general/summary, perhaps especially useful for diagnosing a ‘red’ condition.
Here’s another example from a very unhealthy development instance. You can see the list of assets failing fixity check is hyperlinked, so you can go to the administrative page for that asset to get more info, as above.
The nature of our schema (a one-to-many asset-to-fixity-checks instead of a single fixity check on record) makes it a bit tricky to write the SQL for this, involving
GROUP BYs and inner subqueries and such.
The SQL is also a bit expensive, despite trying to index what can be indexed — I think whole-table-aggregate-statistics are just inherently expensive (at least in postgres) — our fixity health summary report page can take ~2 seconds to return in production, which is not terrible by some standards, but not great — and we have much smaller corpus than some, it will presumably scale slower linearly with number of Assets. One approach to dealing with that I can think of is caching (possibly with calculation in a bg job), but it’s not bad enough for us to require that attention at present.
So, we’re pretty happy with this feature set — although fixity check features are something we don’t actually use or look at that much, so I’m not sure what being happy with it means — but if we’re going to do fixity checking, we might as well do our best to make it reliable and give collection/repository managers the info they need to know if it’s being done reliably and we think we’ve done pretty good, and better than a lot of things we’ve seen.
There are a couple outstanding mysteries in our code.
- While we thought we wrote things and set things up to fixity check 1/7th of the collection every night… it seems to be checking 100% of the collection every night instead. Haven’t spent the time to get to the bottom of that and find the bug.
- For a while, we were getting fixity check failures that seemed to be false positive failures. After a failure, if we went to the Asset detail page for the failed asset and clicked “schedule fixity check now” — it would pass. (This is one reason that “fixity check now” button is a useful feature!). Not sure if there’s a race condition or some other kind of bug in our code (or shrine code) that fetches bytes. OR it also could have just been byproduct of some of our syncing/migration logic that was in operation before we went totally live with new site — I don’t believe we’ve gotten any fixity failures since we actually cut over fully to the newly launched site, so possibly we won’t ever again and won’t have to worry about it. But in the interests of full disclosure, wanted to admit it.