Some notes on what’s going on in ActiveStorage

I work in a library-archives-museum digital collections and preservation. This is of course a domain that is very file-centric (or “bytestream”-centric, as some might say). Keeping track of originals and their metadata (including digests/checksums), making lots of derivative files (or “variants” and/or “previews” as ActiveStorage calls them; of images, audio, video, or anything else)

So, building apps in this domain in Rails, I need to do a lot of things with files/bytestreams, ideally without having to re-invent wheels of basic bytestream management in rails, or write lots of boilerplate code. So I’m really interested in file attachment libraries for Rails. How they work, how to use them performantly and reliably without race conditions, how to use them flexibly to be able to write simple code to meet our business and user requirements.  I recently did a bit of a “deep dive” into some aspects of shrine;  now, I turn my attention to ActiveStorage.

The ActiveStorage guide (or in edge from master) is a great and necessary place to start (and you should read it before this; I love the Rails Guides), but there were some questions I had it didn’t answer. Here are some notes on just some things of interest to me related to the internals of ActiveStorage.

ActiveStorage is a-changing

One thing to note is that ActiveStorage has some pretty substantial changes in between the latest 5.2.1 release and master. Sadly there’s no way I could find to use github compare UI (which i love) limited just to the activestorage path in the rails repo.

If you check out Rails source, you can do: ​git diff v5.2.0...master activestorage. Not sure how intelligible you can make that output. You can also look at merged PR’s to Rails mentioning “activestorage” to try and see what’s been going on, some PR’s are more significant than others.

I’m mostly looking at 5.2.1, since that’s the one I’d be using were I use it (until Rails 6 comes out, I forget if we know when we might expect that?), although when I realize that things have changed, I make note of it.

The DB Schema

ActiveStorage requires no changes to the table/model of a thing that should have attached files. Instead, the attached files are implemented as ActiveRecord has_many (or the rare has_one in case of has_one_attached) associations to other table(s), using ordinary relational modeling designs.  Most of the fancy modelling/persistence/access features and APIs (esp in 5.2.1) are seem to be just sugar on top of ordinary AR associations (very useful sugar, don’t get me wrong).

ActiveStorage adds two tables/models.

The first we’ll look at is ActiveStorage::Blob, which actually represents a single uploaded file/bytestream/blob. Don’t be confused by “blob”, the bytestream itself is not in the db, rather there’s enough info to find it in whatever actual storage service you’ve configured. (local disk, S3, etc. Incidentally, the storage service configuration is app-wide, there’s no obvious way to use two different storage services in your app, for different categories of file).

The table backing ActiveStorage::Blob has a number of columns for holding information about the bytesteam.

  • id (ordinary Rails default pk type)
  • key: basically functions as a UID to uniquely identify the bytestream, and find it in the storage. Storages may translate this to actual paths or storage-specific keys differently, the Disk storage files in directories by key prefix, whereas the S3 service just uses the key without any prefixes.
    • The key is generated with standard Rails “secure token” functionality–pretty much just a good random 24 char token. 
    • There doesn’t appear to be any way to customize the path on storage to be more semantic, it’s just the random filing based on the random UID-ish key.
  • filename: the original filename of the file on the way in
  • content_type: an analyzed MIME/IANA content type
  • byte_size: what it says on the tin
  • metadata: a Json serialized hash of arbitrary additional metadata extracted on ingest by ActiveStorage. Default AS migrations just put this in a text column and use db-agnostic Rails functions to serialize/deserialize Json, they don’t try to use a json or jsonb column type.
  • created_at: the usual. There is no updated_at column, perhaps because these are normally expected to be immutable (which means not expected to add metadata after point of creation either?).

OK, so that table has got pretty much everything needed. So what’s the ActiveStorage::Attachment model?  Pretty much just a standard join table.  Using a standard Rails polymorphic association so it can associate an ActiveStorage::Blob with any arbitrary model of any class.  The purpose for this “extra” join table is presumably simply to allow you to associate one ActiveStorage::Blob with multiple domain objects. I guess there are some use cases for that, although it makes the schema somewhat more complicated, and the ActiveStorage inline comments warn you that “you’ll need to do your own garbage collecting” if you do that (A Blob won’t be deleted (in db or in storage) when you delete it’s referencing model(s), so you’ve got to, with your own code, make sure Blob’s don’t hang around not referenced by any models unless in cases you want them to).

These extra tables do mean there are two associations to cross to get from a record to it’s attached file(s).  So if you are, say, displaying a list of N records with their thumbnails, you do have an n+1 problem (or a 2n+1 problem if you will :) ). The Active Storage guide doesn’t mention this — it probably should — but AS some of the inline AS comment docs do, and scopes AS creates for you to help do eager loading.

Indeed a dynamically generated with_attached_avatar (or whatever your attachment is called) scope is nothing but a standard ActiveRecord includes  reaching across the join to the blog. (for has_many_attached or has_one_attached).

And indeed if I try it out in my console, the inclusion scope results in three db queries, in the usual way you expect ActiveRecord eager loading to work.

irb(main):019:0> FileSet.with_attached_avatar.all
  FileSet Load (0.5ms)  SELECT  "file_sets".* FROM "file_sets" LIMIT $1  [["LIMIT", 11]]
  ActiveStorage::Attachment Load (0.8ms)  SELECT "active_storage_attachments".* FROM "active_storage_attachments" WHERE "active_storage_attachments"."record_type" = $1 AND "active_storage_attachments"."name" = $2 AND "active_storage_attachments"."record_id" IN ($3, $4)  [["record_type", "FileSet"], ["name", "avatar"], ["record_id", 19], ["record_id", 20]]
  ActiveStorage::Blob Load (0.5ms)  SELECT "active_storage_blobs".* FROM "active_storage_blobs" WHERE "active_storage_blobs"."id" IN ($1, $2)  [["id", 7], ["id", 8]]
=> #<ActiveRecord::Relation [#<FileSet id: 19, title: nil, asset_data: nil, created_at: "2018-09-27 18:27:06", updated_at: "2018-09-27 18:27:06", asset_derivatives_data: nil, standard_data: nil>, #<FileSet id: 20, title: nil, asset_data: nil, created_at: "2018-09-27 18:29:00", updated_at: "2018-09-27 18:29:08", asset_derivatives_data: nil, standard_data: nil>]>

When is file created in storage, when are associated models created?

ActiveStorage expects your ordinary use case will be attaching files uploaded through a form user.avatar.attach(params[:avatar]), where params[:avatar] is a meaning you get the file as a ActionDispatch::Http::UploadedFile. You can also attach a file directly, in which case you are required to supply the filename (and optionally a content-type):  user.avatar.attach(io: File.open("whatever"), filename: "whatever.png").  Or you can also pass an existing ActiveStorage::Blob to ‘attach’.

In all of these case, ActiveStorage normalizes them to the same code path fairly quickly.

In Rails 5.2.1, if you call attach on an already persisted record, immediately (before any save), an ActiveStorage::Blob row and ActiveStorage::Attachment row have been persisted to the db, and the file has been written to your configured storage location.  There’s no need to call save on your original record, the update took place immediately. Your record will report it has (and of course ActiveStorage’s schema means no changes had to be saved for the row for your record itself — and your record does not think it has outstanding changes via changed?, since it does not).

If you call attach on a new (not yet persisted) record, the ActiveStorage::Blob row is _still_ created, and the bytestream is still persisted to your storage service. But an ActiveStorage::Attachment (join object) has not yet been created.  It will be when you save the record.

But if you just abandon the record without saving it… you have an ActiveStorage::Blob nothing is pointing to, along with the persisted bytestream in your storage service. I guess you’d have to periodically look for these and clean then up….

But master branch in Rails tries to improve this situation with a fairly sophisticated implementation of storing deltas prior to save. I’m not entirely sure if that applies to the “already persisted record” case too. In general, I don’t have a good grasp of how AS expects your record lifecycles to effect persistence of Blobs — like if the record you were attaching it to failed validation, is the Blob expected to be there anyway? Or how are you expected to have validation on the uploaded file itself (like only certain content types allowed, say). I believe the PR in Rails master is trying to improve all of that, I don’t have a thorough grasp of how successful it is at making things “just work” how you might expect, without leaving “orphaned” db rows or storage service files.

Metadata

Content-type

ActiveStorage stores the IANA Media Type (aka “MIME type” or “content type”) in the dedicated content_type column in ActiveStorage::Blob. It uses the marcel gem (from the basecamp team) to determine content type.  Marcel looks like it uses file-style magic bytes, but also uses the user-agent-supplied filename suffix or content-type when it decides it’s necessary — trusting the user-agent supplied content-type if all else fails.  It does not look like there is any way to customize this process;  likely most people wouldn’t need that, but I may be one of the few that maybe does. Compare to shrine’s ultra-flexible content-type-determination configuration.

For reasons I’m not certain of, ActiveStorage uses marcel to identify content-type twice.

When (in Rails 2.5.1) you call ​some_model.attach, it calls ActiveStorage::Blob#create_after_upload!, which calls ActiveStorage::Blob#build_after_upload, which calls ActiveStorage::Blob.upload, which sets the content_type attribute to the result of extract_content_type method, which calls marcel.

Additionally, ActiveStorage::Attachment (the join table) has an after_create_commit hook which calls :identify_blob, which calls blob.identify, defined in ActiveStorage::Blob::Identifiable mixin, which also ends up using marcel — only if it already hasn’t been identified (recorded by an identified key in the json serialized metadata column).   This second one only passes the first 4k of the file to marcel (perhaps because it may need to download it from remote storage), while the first one above seems to pass in the entire IO stream.

Normally this second marcel identify won’t be called at all, because the Blob model is already recorded as identified? as a result of the first one. In either case, the operations takes place in the foreground inline (not a bg job), although one of them in an after-commit hook with a second save. (Ah wait, I bet the second one is related to the direct upload feature which I haven’t dived into. Some inline comment docs would still be nice!)

In Rails master, we get an identify:false argument to attach, which can be used to skip which you can use to skip content-type-identification (it might just use the user-agent-supplied content-type, if any, in that case?)

Arbitrary Metadata

In addition to some file metadata that lives in dedicated database columns in the blob table, like content_type, recall that there is a metadata column with a serialized JSON hash, that can hold arbitrary metadata. If you upload an image, you’ll ordinarily find height and width values in there, for instance.  Which you can find eg with ‘model..avatar.metadata[“width”]’ or  ‘model.avatar.metadata[:width]’ (indifferent access, no shortcuts like ‘model.avatar.width’ though, so far as I know).

Where does this come from? It turns out ActiveStorage actually has a nice, abstract, content-type-specific, system for analyzer plugins.  It’s got a built-in one for images, which extracts height and width with MiniMagick, and one for videos, which uses ffprobe command line, part of ffmpeg.

So while this blog post suggests monkey-patching Analyzer::ImageAnalyzer to add in GPS metadata extracted from EXIF, in fact it oughta be possible in 5.2.1+ to use the analyzer plugin to add, remove, or replace analyzers to do your customization, no ugly forwards-compat-dangerous monkey-patching required.  So there are intentional API hooks here for customizing metadata extraction, pretty much however you like.

Unlike content-type-identification which is done inline on attach, metadata analysis is done by ActiveStorage in a background ActiveJob. ActiveStorage::Attachment (the join object, not the blog), has an after_create_commit hook (reminding us that ActiveStorage never expects you to re-use a Blob db model with an altered bytestream/file), which calls blob.analyze_later (unless it’s already been analyzed).   analyze_later simply launches a perform_later ActiveStorage::AnalyzeJob with the (in this case) ActiveStorage::Blob as an argument.  Which just calls analyze on the blob.

So it, at least in theory, this can accommodate fairly slow extraction, because it’s in the background. That does mean you could have an attachment which has not yet been analyzed; you can check to see if analyzation has happened yet with analyzed? — which in the end is just an analyzed: true key in the arbitrary json metadata hash. (Good reminder that ActiveRecord::Store exists, a convenience for making cover methods for keys in a serialized json hash).

This design does assume only one bg job per model that could touch the serialized json metadata column exists at a time — if there were two operating concurrency (even with different keys), there’d be a race condition where one of the sets of changes might get lost as both processes race to 1) load from db, 2) merge in changes to hash, 3) save serialization of merged to db.  So actually, as long as “identified: true” is recorded in content-type-extraction, the identification step probably couldn’t be a bg job either, without taking care of the race condition, which is tricky.

I suppose if you changed your analyzer(s) and needed to re-analyze everything, you could do something like ActiveStorage::Blob.find_each(&:analyze!). analyze! is implemented in terms of update!, so should persist it’s changes to db with no separate need to call save.

Variants

ActiveStorage calls “variants” what I would call “derivatives” or shrine (currently) calls “versions” — basically thumbnails, resizes, and other transformations of the original attachment.

ActiveStorage has a very clever way of handling these that doesn’t require any additional tracking in the db.  Arbitrary variants are created “on demand”, and a unique storage location is derived based on the transformation asked for.

If you call avatar.variant(resize: "100x100"), what’s returned is an ActiveStorage::Variant.  No new file has yet been created if this is the first time you asked for that. The transformation will be done when you call the processed method. (ActiveStorage recommends or expects for most use cases that this will be done in controller action meant to deliver that specific variant, so basically on-demand).   processed will first see if the variant file has already been created, by checking processed?. Which just checks if a file already exists in the storage with some key specific to the variant. The key specific to the variant is  “variants/#{blob.key}/#{Digest::SHA256.hexdigest(variation.key)}“. Gives it some prefixes/directory nesting, but ultimately makes a SHA256 digest of variation.key.  Which you can see the code in ActiveStorage::Variation, and follow it through ActiveStorage.verifier, which is just an instance of ActiveSupport::MessageVerifier — in the end we’re basically just taking a signed (and maybe encyrpted) digest of the serialization of the transformation arguments passed in in the first place,  `{ resize: “100×100” }`.

That is, basically through a couple of cryptographic digests and some crypto security too, were just taking the transformation arguments and turning them into a unique-to-those-arguments key (file path).

This has been refactored a bit in master vs 5.2.1 — and in master the hash that specifies the transformations, to be turned into a key, becomes anything supported by image_processing with either MiniMagick or vips processors instead of 5.2.1’s bespoke Minimagick-only wrapper. (And I do love me some vips, can be so much more performant for very large files).  But I think the basic semantics are fundamentally the same.

This is nice because we don’t need another database table/model to keep track of variants (don’t forget we already have two!) — we don’t in fact need to keep track of variants at all. When one is asked for, ActiveStorage can just check to see if it already exists in storage at the only key/path it necessarily would be at.

On the other hand, there’s no way to enumerate what variants we’ve already created, but maybe that’s not really something people generally need.

But also, as far as I can find there is no API to delete variants. What if we just created 100×100 thumbs for every product photo in our app, but we just realized that’s way too small (what is this, 2002?) and we really need something that’s 630×630. We can change our code and it will blithely create all those new 630×630 ones on demand. But what about all the 100x100s already created? They are there in our storage service (say S3).  Whatever ways there might be to find the old variants and delete them are going to be hacky, not to mention painful (it’s making a SHA256 digest to create filename, which is intentionally irreversible. If you want to know what transformation a given variant in storage represents, the only way is to try a guess and see if it matches, there’s no way to reverse it from just the key/path in storage).

Which seems like a common use case that’s going to come up to me? I wonder if I’m missing something. It almost makes me think you are intended to keep variants in a storage configured as a cache which deletes old files periodically (the variants system will just create them on demand if asked for again of course) — except the variants are stored in the same storage service as your originals, and you certainly don’t want to purge non-recently-used originals!  I’m not quite sure what people are doing with purging no-longer-used variants in the real world, or why it hasn’t come up if it hasn’t.

And something that maybe plenty of people don’t need, but I do — ability to create variants of files that aren’t images: PDFs, any sort of video or audio file, really any kind of file at all. There is a separate transformation system called previewing that can be used to create transformations of video and PDF out of the box — specifically to create thumbnails/poster images.  There is a plugin architecture, so I can maybe provide “previews” for new formats (like MS Word), or maybe I want to improve/customize the poster-image selection algorithm.

What I need aren’t actually “previews”, and I might need several of them. Maybe I have a video that was uploaded as an AVI, and I need to have variants as both mp4 and webm, and maybe choose to transcode to a different codec or even adjust lossy compression levels. Maybe I can still use ‘preview’ function nonetheless? Why is “preview” a different API than “variant” anyway? While it has a different name, maybe it actually does pretty much the same thing, but with previewer plugins? I don’t totally grasp what’s going on with previews, and am running out of steam.

I really gotta get down into the weeds with files in my app(s), in an ideal world, I would want to be able to express variants as blocks of whatever code I wanted calling out to whatever libraries I wanted, as long as the block returned an IO-like object, not just hashes of transformation-specifications. I guess one needs something that can be transformed into a unique key/path though. I guess one could imagine an implementation had blocks registered with unique keys (say, “webm”), and generated key/paths based on those unique keys.  I don’t think this is possible in ActiveStorage at the moment.

Will I use ActiveStorage? Shrine?

I suspect the intended developer-user of ActiveStorage is someone in a domain/business/app for which images and attachments  are kind of ancillary. Sure, we need some user avatars, maybe even some product images, or shared screenshots in our basecamp-like app. But we don’t care too much about the details, as long as it mostly works.  Janko of Shrine told me some users thought it was already an imposition to have to add a migration to add a data column to any model they wanted to attach to, when ActiveStorage has a generic migration for a couple generic tables and you’re done (nevermind that this means extra joins on every query whose results you’ll have to deal with attachments on!) — this sort of backs up that idea of the native of the large ActiveStorage target market.

On the other hand, I’m working in a domain where file management is the business/domain. I really want to have lots of control over all of it.

I’m not sure ActiveStorage gives it to me. Could I customize the key/paths to be a little bit more human readable and reverse-engineerable, say having the key begin with the id of the database model? (Which is useful for digital preservation and recovery purposes).Maybe? With some monkey-patches? Probably not?

Will ActiveStorage do what I need as far as no-boundaries flexibility to variant creation of video/audio/arbitrary file types?  Possibly with custom “previewer” plugin (even though a downsampled webm of an original .avi is really not a “preview”), if I’m willing to make all transformations expressable as a hash of specifications?  Without monkey-patching ActiveStorage? Not sure?

What if I have some really slow metadata generation, that I really don’t want to do inline/foreground?  I guess I could not use the built-in metadata extraction, but just make my own json column on some model somewhere (that has_one_attachment), and do it myself. Maybe I could do that variants too, with additional app-specific models for variants (that each have a has_one_attached with the variant I created).  I’d have to be careful to avoid adding too many more tables/joins for common use cases.

If I only had, say, paperclip and carrierwave, I might choose ActiveStorage anyway, cause they aren’t so flexible either. But, hey, shrine! So flexible! It still doesn’t do everything I need, and the way it currently handles variants/derivatives/versions isn’t suitable for me (not set up to support on-demand generation without race conditions, which I realize ironically ActiveStorage is) — but I think I’d rather build it on top of shrine, which is intended to let you build things on top of it, than ActiveStorage, where I’d likely have to monkey-patch and risk forwards-incompatible.

On the other hand, if ActiveStorage is “good enough” for many people… is there a risk that shrine won’t end up with enough user/maintainer community to stay sustainable? Sure, there’s some risk. And relatively small risk of ActiveStorage going away.  One colleague suggested to me that “history shows” once something is baked into Rails, it leads to a “slow death of most competitors”, and eventually more features in the baked-into Rails version. Maybe, but…. as it happens, I kind of need to architect a file attachment solution for my app(s) now.

As with all dependency and architectural choices, you pays yer money and you takes yer chances. It’s programming. At best, we hope we can keep things clearly delineated enough architecturally, that if we ever had to change file attachment support solutions, it won’t be too hard to change.  I’m probably going with shrine for now.

One thing that I found useful looking at ActiveStorage is some, apparently, “good enough” baselines for certain performance/architectural issues. For instance, I was trying to figure out a way to keep my likely bespoke derivatives/variants solution from requiring any additional tables/joins/preloads (as shrine out of the box now requires zero extra) — but if ActiveStorage requires two joins/preloads to avoid n+1, I guess it’s probably okay if I add one. Likewise, I wasn’t sure if it was okay to have a web architecture where every attachment image view is going to result in a redirect… but if that’s ActiveStorage’s solution, it’s probably good enough.

Notes on deep diving with byebug

When using byebug to investigate some code, as I did here, and regularly do to figure out a complex codebase (including but not limited to parts of Rails), a couple Rails-related tips.

If there are ActiveJobs involved, ‘config.active_job.queue_adapter = :inline’ is a good idea to make them easier to ‘byebug’.

If there are after_commit hooks involved (as there were here), turning off Rails transactional tests (aka “transactional fixtures” before Rails 5) is a good idea. Theoretically Rails treats after_commit more consistently now even with transactional tests, but I found debugging this one I was not seeing the real stuff until I turned off transactional tests.  In Rspec, you do this with ‘config.use_transactional_fixtures = false’  in the rails_helper.rb rspec config file.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s