Exploring and planning with Sufia/Hyrax/Fedora fixity validation

Fixity” validation — basically validating a checksum against bytes on disk to make sure a file still is exactly as it was on ingest — is an important part of any digital preservation collection, and my understanding is that it’s a key marketing point of the fedora+hydra stack.

However, I found it somewhat challenging to figure out how/if current Sufia/Hyrax supported this with already-built features. If there are reliable and up to date docs, I did not find them. So, since understanding what’s really going on here seems important for preservation responsibilities, I spent a couple days reverse engineering and debugging what’s there (thanks to various people in Hydra Slack channel for pointing me to the right places to look). What I found had some parts somewhat unexpected to me, and not necessarily quite right at least for what I understand as our needs.

I figured I’d write up what I discovered and what our current plans (for our local app) are based on what I discovered. As an aid to other people wanting to know what’s up, and as a discussion/planning aid in considering any changes to the shared gems.

Hydra component write-ups seem to be very version-sensitive, things tend to change a lot. This was investigated under Sufia 7.3.0, CurationConcerns 1.7.7, ActiveFedora 11.1.6. I believe it has not changed substantially in hyrax as of this date, except for class name changes (including generally using the term ‘fixity’ instead of ‘audit’ in class names, as well as Hyrax namespace of course), but am not totally sure.

There is an existing service to fixity audit a single FileSet, in CurationConcerns at FileSetAuditService.

CurationConcerns::FileSetAuditService.new(fs).audit

So you might run that on every fileset to do a bulk audit, like FileSet.find_each { |fs| CurationConcerns::FileSetAuditService.new(fs).audit } — which is just what (in Sufia rather than CC) Sufia::RepositoryAuditService does, nothing more nothing less.

CurationConcerns::FileSetAuditService actually uses several other objects to do the actual work, and later I’ll go into what they do how. But the final outcome will be:

  • an ActiveRecord ChecksumAuditLog row created — I believe one for every file checked, in cases where a fileset has multiple files. It seems to have a pass (integer column) of 1 if the object had a good checksum, or a 0 if not.
    • It cleans up after itself in that table, not leaving infinitely growing historical ChecksumAuditLog rows there; generally I think only the most recent two are kept, although may be more if there are failures. AuditJob calls ChecksumAuditJob.prune_history
    • While the ChecksumAuditLog record has columns for expected_result and actual_result, nothing in sufia/CC stack fills these out, all you get is the pass value (recall, we think 0 or 1), a file_set_id, a file_id, and version string.
      • I’m not sure what the version string is for, or if it gives you any additional unique data that file_id doesn’t, or if the version string is just a different representation uniquely identifying the same thing file_id does. A version string might look like: `http://127.0.0.1:8080/rest/dev/37/72/0c/72/37720c723/files/214f68af-e5ed-41bd-9898-b8923fd6d018/fcr:versions/version1`
  • On cases of failure, it sends an internal app message to the person listed as the depositor of the fileset — assuming the fixed email address of the depositor still matches the email address of Sufia account. This is set up by Sufia registering a CurationConcerns callback to run the Sufia::AuditFailureService; that callback is triggered by the CurationConcerns::AuditJob (AuditJob gets run by the FileSetAuditService).
    • The internal message does includes the FileSet title and file_set.original_file.uri.to_s. If the file set had multiple versions, which one (or how to get to it in UI) that failed the checksum is not included.
    • It’s not clear to me what use cases one wants the depositor (only if they (still) have a registered account) to be the only one that gets the fixity failure notice. It seems like an infrastructure problem, that we at least would want a notification sent instead to infrastructural admins who can respond to it — perhaps via an email or an external error-tracking service like Bugsnag or Honeybadger. Fortunately, the architecture makes it pretty easy to customize this.
    • The callback is using CurationConcerns::Callback::Registry which only supports one callback per event, so seting another one will replace the one by default set by Sufia. Which is fine.

I did intentionally corrupt a file on my dev copy, and then verify it was caught and that those things listed above things happened — basically, callback sends internal notification to depositor, and ChecksumAuditLog is stored in the database with a 0 value for pass​, and the relevant file_set_id and file_id.

While the ChecksumAuditLog objects are all created, there is no admin UI I could find for, say, “show me all ChecksumAuditLog records respresenting failed fixity checks”.

There is a an area on the FileSet “show” page that says Audit Status: Audits have not yet been run on this file. I believe this is intended to show information based on ChecksumAuditLog rows, possibly as a result of something in Sufia calling this line.  However, this appears broken in current sufia/hyrax, this line keeps saying “Audits have not yet been run” no matter how many times you’ve run audits. I found this problem had already been reported in November 2016 on Sufia issue tracker,  imported to Hyrax issue tracker.

So in current Sufia (and Hyrax?), although the ChecksumAuditLog AR records are created, I believe there is no UI that displays them in any way — a developer could manually interrogate them from a console, otherwise all you’ve got is the (by default) internal notification sent to depositor.

While past versions of Sufia may have run some fixity checks automatically on-demand when a file has been viewed, this functionality does not seem to still be in sufia 7.3/hyrax. I’m not sure if this is a desired function anyway — it seems to me you need to be running periodic bulk/mass audits anyway (you don’t want to avoid checking files that haven’t been viewed), and if you doing so, additional checking on-the-fly checking when viewed/downloaded seems superfluous.

Also note that the “checksum” displayed in the Fileset “show” view is not the checksum used by fedora internally. At least not in our setup, where we haven’t tried to customize this at all. The checksum displayed in the Sufia view is, we believe, calculated on upload even before fedora ingest, and appears to be an MD5, and does not match fedora’s checksum used for the fedora fixity service, which seems to be SHA1.

How is this implemented: Classes involved

The CurationConcerns::FileSetAuditService actually calls out to CurationConcerns::AuditJob to do the bulk of it’s work.

  • FileSetAuditService calls AuditJob as perform_later, there is no way to configure it to run synchronously.
    • When I ran an audit of every file on our staging server  (with a hand-edit to do them synchronously so I could time it more easily and clearly), it took about 3.8 hours to check 8077 FileSets on staging.
    • This means a bulk audit, using Resque bg jobs to do it — could clog up the resque job queue for up to 3.8 hours (less with more resque workers, not neccesarily scaling directly), making other jobs (like derivatives creation) take a long time to complete, perhaps at the end of the queue 3.8 hours later. Clogging up the job queue for a bulk fixity audit seems problematic. One could imagine changing it to use a different queue name with dedicated workers — but for bulk fixity check, I’m not sure if there is a reason for this to be in the bg job queue at all, doing it all synchronously seems fine/preferable.
    • It’s not entirely clear to me what rationale governs the split of logic between FileSetAuditService, and AuditJob, or if it’s completely rationale. I guess one thing is that the FileSetAuditService is for a whole FileSet, but the AuditJob for an individual file. The FileSetAuditService does schedule audits for every file version if there are more than one in the FileSet.

The CurationConcerns::AuditJob actually calls out to ActiveFedora::FixityService to actually do the fixity check.

How is the fixity check done?

  • ActiveFedora::FixityService simply asks Hydra for a fixity check on a URL (for an individual File, I think). The asset is not downloaded or examined by Hydra stack code, a simple HTTP request “do a fixity check on this file and tell me the result” is sent to Hydra.
    • This means we are trusting that A) even if the asset has been corrupted, Fedora’s stored checksum for the asset is still okay, and B) that the Fedora fixity service actually works. I guess these are safe assumptions for a reliable fixity service?
    • It looks at the RDF body returned by the hydra fixity service to interpret if the fixity check was good or not
    • While the Hydra fixity service RDF body response includes some additional information (such as original and current checksum), this information is not captured and sent up to the stack to be reported or logged — ActiveFedora::FixityService just returns true or false, (which explains why ChecksumAuditLog records always have blank expected_result and actual_result attributes).

What do we need or want locally that differs from standard setup

We decided that trusting the Fedora fixity service was fine — we know of no problems with it, if we did we’d report them upstream to Fedora, who would hopefully fix them quickly since fixity is kind of a key feature for preservation. Ideally, one might want to store a copy of the original checksums elsewhere to make sure they were still good in Fedora, but we decided we weren’t going to do this for now either. We will run some kind of bulk fixity-all-the-things task periodically, and do want to receive notifications.

  1. Different notification on fixity failure than the default internal notification to depositor. This should be easy to do in current architecture with a local setting though, hooray.
  2. Get the bulk fixity check not to create a bg job for every file audited, filling up the bg job queue. For a bulk fixity check in our infrastructure, just one big long-running foreground process seems fine.
  3. Get the hydra fixity check response details to be recorded and passed up the stack for ChecksumAuditLog inclusion and inclusion in notification. Expected checksum and actual checksum, at least. This requires changes to ActiveFedora, or using something new instead of what’s in ActiveFedora.  (The current fedora registered checksum may be neccessary for recovery, see below).  Not sure if there should be a way to mark a failed ChecksumAuditLog row as ‘resolved’, for ongoing admin overview of fixity status. Probably not if the same file gets a future ChecksumAuditLog row as ‘passing’, that’s enough indication of ‘resolved’.
  4. Ideally, fix bug where “Audit status” never gets updated and always says “no audits have yet been done”.
  5. Failed audits should be logged to standard rails log as well as other notification methods.
  6. It has been suggested that we might only want to be fixity-auditing the most recent version of any file, there’s no need to audit older versions. I’m not sure if this sound from a preservation standpoint, those old versions might be part of the archival history? But it might simply one recovery strategy, see below.
  7. Ideally, clean up the code a little bit in general. I don’t entirely understand why logic is split between classes as it is, and don’t understand what all the methods are doing. Don’t understand why ChecksumAuditLog has an integer pass instead of a boolean. Code is harder to figure out what it’s doing than seems necessary for relatively simple functionality here.
  8. Ideally, perhaps, an admin UI for showing “current failed fixity checks”, in case you missed the notification.

And finally, an area that I’ll give more than a bullet point to — RECOVERYWhile I expect fixity failures to be very rare, possibly we will literally never see one in the lifetime of this local project — doing fixity checks without having a tested process for recovery from a discovered corrupted file seems pointless.  What’s the point of knowing a file is corrupt if you can’t do anything about it?

I’m curious if any other hydra community people have considered this, and have a recovery process.

We do have disk backups of the whole fedora server. In order to try and recover an older non-corrupted version, we have to know where it is on disk. Knowing fedora’s internal computed SHA1 — which I think is the same thing it uses for fixity checking — seems like what you need to find the file on disk, they are filed on disk by the SHA1 taken at time of ingest.

Once you’ve identified a known-good passing-SHA1-checksum version backup (by computing SHA1’s yourself, in the same way fedora does, presumably) — how do you actually restore it? I haven’t been able to find anything in sufia/hyrax or fedora itself meant to help you here.

We can think of two ways. We could literally replace the file on disk in the fedora file system. This seems nice, but not sure if we should be messing with fedora’s internals like that. Or we could upload a new “version”, the known-good one, to sufia/fedora. This is not messing with fedora internals, but the downside is the old corrupt version is still there, and still failing fixity checks, and possibly showing up in your failed fixity check reports and notifications etc, unless you build more stuff on top to prevent them. False positive “fixity check failures” would be bad, and lead to admins ignoring “fixity check failure” notices as is human nature.

Curious if Fedora/fcrepo itself has any intended workflow here, for how you recover from a failed fixity check, when you have an older known-good version. Anyone know?

I think most of these changes, at least as options, would be good to send upstream — the current code seems not quite right in the generic case to me, I don’t think it’s any special use cases I have. The challenge with upstream PR here is that the code spans both Hyrax and ActiveFedora, which would need to be changed in a synchronized fashion. And that I’m not quite sure the intention of the existing code, what parts that look like weird architecture to me are actually used or needed by someone or something. Both of which make it more challenging, and more time-consuming, to send upstream. So not sure yet how much I’ll be able to send upstream, and how much will be just local.

Advertisements

2 thoughts on “Exploring and planning with Sufia/Hyrax/Fedora fixity validation”

  1. This gets harder the more I work with it. The different parts of the code are all making different assumptions about what’s going on — and meant to work in a variety of different situations, including fedora ‘versions’ turned on or off.

    The reason the ‘audit status’ doesn’t work is that it’s looking for ChecksumAuditLog rows with a `file_id` of `original_file`, but that never happens in current sufia, file_id’s look more like “c247ds31q/files/c3e8d90c-e47b-4c3b-9cff-50d20b5b0583”, it’s a remnant of some past architecture.

    So, we can look up which file is is the “original”? Well, at present _default_ Sufia/Hyrax only _has_ one file in a fileset, and it’s always original. But it might have multiple versions. What do we do to make this work for people who have extended Sufia/Hyrax to allow uploaded derivatives as additional files? Unclear, because who know how they extended it, or how we might figure out which is the ‘original’, and do we only want to audit or report status for ‘original’ anyway?

    A FileSet might have multiple files, either of which has multiple versions — the current code is wholly inadequate at translating that into some kind of report. Has to be entirely rethought and redone.

    Additonally, the `prune` functionality to keep only recent CheckSumAudit rows around — prunes all but the last two rows based on *file_id*, but if there are multiple versions, that might mean it’s deleting _current_ audit rows; say there are six versions and they all have current audit lines, it’ll only keep 2 of them.

    And the error callback — which by default sends an in-app notification to depository — only tells the depositor the FileSet that failed — not which file and which version in that file failed. So it’s not really even enough to tell you what failed. It doesn’t even pass enough information to the callback to report what’s needed stuff.

    Nearly every function of every part of the fixity-related code is behaving improperly.

  2. also while there’s logic in there for a fedora with versions turned off, I’m not totally sure it actually works, and not sure how to test it — without turning off versions in my fedora, which I guess I can figure out how to do. IT seems to be trying to treat a string as an Array to me.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s