Memo on Technical Operational Considerations for IIIF in a Sufia/Hyrax app

(A rather lengthy, as is my wont, memo I wrote for internal use, which I also share with you)

IIIF (International Image Interoperability Framework) is a standard API for a server which delivers on-demand image transformations.

What sort of transformations are we interested in (and IIIF supports)?

  • Changing image formats
  • Resizing images (to produce thumbnails other various display or delivery sizes)
  • Creating tiled image sources to support high-res zoom-in without having to deliver enormous original source images. (such an operation will involve resizing too to create tiles at different zoom levels, as well as often format changes if the original source is not in JPG or other suitable web format)

@jcoyne has created Riiif, an IIIF server in ruby, using imagemagick to do the heavy-lifting, that is a Rails engine that can turn any an IIIF server. In addition to it being nice that we know ruby so can tweak it if needed, this also allows it to use your existing ruby logic for looking up original source images from app ids and access controls. It’s unclear how you’d handle these things with an external IIIF server in a sufia/hyrax app; to my knowledge, nobody is using anything but riiif.

Keep in mind that the reason you need tiled image source is only when the full-resolution image (or the image at the resolution you desire to allow zoom to) in a JPG format is going to be too large to deliver in it’s entirety to the browser (at least with reasonable performance). If this isn’t true, you can allow pan and zoom in a browser with JS without needing a tiled image source.

And keep in mind that the primary reason you need an on demand image transformation service (whether for tiled image source or other transfomrations), is when storing all the transformations you want is going to take more disk space than you can afford or is otherwise feasible. (There are digital repositories with hundreds of thousands or millions of images, each which need various transformations).

There is additionally some development/operational convenience to an on-demand transformation aside from disk space issues, but there is a trade-off in additional complexity in other areas — mainly in dealing with caching and performance.

The first step is defining what UI/UX we want for our app, before being able to decide if an on-demand image transformation server is useful in providing that. But here, we’ll skip that step, assume we’ve arrived at a point from UI/UX to wanting to consider an on-demand image transformation service, and move on to consider some operational issues with deploying RIIIF.

Server/VM seperation?

riiif can conceivably be quite resource-intensive. Lots of CPU taken calling out to imagemagick to transform images. Lots of disk IO in reading/writing images (effected by cache and access strategies, see below). Lots of app server http connections/threads taken by clients requesting images — some of which, depending on caching strategies, can be quite slow-returning requests.

In an ideal scenario, one wouldn’t want this running on the same server(s) handling ordinary Rails app traffic, one would want to segregate it so it does not interfere with the main Rails app, and so each can be scaled independently.

This would require some changes to our ansible/capistrano deploy scripts, and some other infrastructure/configuration/deploy setup. The riiif server would probably still need to be deployed as the entire app, so it has access to app-located authorization and retrieval logic; but be limited to only serving riiif routes. This is all do-able, just a bunch of tweaking and configuring to set up.

This may not be necessary even if strictly ideal.

Original image access

The riiif server needs access to the original image bytestreams, so it can tranasform them.

In the most basic setup, the riiif server somehow has access to the file system fedora bytestreams are stored on, and knows how to find a byestream for a particular fedora entity on disk.

The downsides of this are that shared file systems are… icky. As is having to reverse engineer
fedora’s file storage.

Alternately, riiif can be set up to request the original bytestreams from fedora via http, on demand, and cache them in the local (riiif server) file system. The downsides of this are:

  • performance — if a non-cached transformation is requested, and the original source image is also not in the local file system cache, riiif first must download it from fedora, before moving on to transform it, and only then delivering it to the client.
  • cache management. Cache management as a general rule can get surprisingly complicated. If you did not trim/purge the local ‘original image source’ file system cache at all, it would of course essentially grow to be the size of the complete corpus of images (which are quite large uncompressed TIFFs in our case). Kind of defeating the purpose of saving file space
    with an on-demand image transformer in the first place (the actual transformed products are almost always going to be in a compressed format and a fraction of the size of the original TIFFs).

    • There is no built-in routine to trim original source file cache, although the basic approach is straightforward, the devil can be in the details.
    • To do an LRU cache, you’d need your file system tracking access times. Linux file systems are not infrequently configured with ‘noatime’ for performance these days, which wouldn’t work. Or alternately, you’d need to add code to riiif to track last access time in some other means.
    • When trimming, you have to be careful not to trim sources currently being processed by an imagemagick transformation.
    • Even if trimming/purging regularly, there is a danger of bursts of access filling up the cache quickly, and possibly exceeding volume space (unless the volume is big enough to hold all original sources of course). For instance, if using riiif for derivatives, one could imagine googlebot or another web spider visiting much of the corpus fairly quickly. (A use case ideally we want to support, the site ought to be easily spiderable)
      • There is of course a trade-off between cache size and overall end-user responsiveness percentiles.

It is unclear to me how many institutions are using riiif in production, but my sense is that most or even all of them take the direct file system access approach rather than http access with local file cache. Anyone I could find using riiif at ahc was taking this approach, one way or another.

Transformed product caching

Recall a main motivation for using an on-demand image transformer is not having to store every possible derivative (including tiles) on disk.

But there can be a significant delay in producing a transformation. It can depend on size and characteristics of original image; on whether we are using local file system access or http downloading as above (and on whether the original is in local cache if latter); on network
speed, disk I/O speed, and imagemagick (cpu) speed.

  • It’s hard to predict what this latency would be, but in the worst case with a very large source
    image one could conceive of it being a few seconds — note that’s per image,
    and you could pay it each time you move from page to page in a multi-page work,
    or even, pathological case, each time you pan or zoom in a pan-and-zoom viewer.

As a result, riiif tries to cache it’s transformation output.

It uses an ActiveSupport::Cache::Store to do so, by default the one being used by your entire Rails app as Rails.cache. It probably makes sense to separate the riiif cache, so a large volume of riiif products isn’t pushing your ordinary app cache content out of the cache and vice versa, and both caches can be sized appropriately, and can even use different cache backends.

ActiveSupport::Cache::Store supports caching in file system, local app memory, or a Memcached instance; or hypothetically you can easily write an adapter for any back-end store you want. But for this use case, anything but file system probably doesn’t make sense, it would get too expensive for the large quantity of bytes involved. (Although one could consider things like an S3 store instead
of immediate file system, that has it’s own complications but could be considered).

So we have the same issues to consider we did with http original source cache: performance, and cache management.

  • Even when something is in the riiif image cache, it’s not going to be as fast as an ordinary web-server-served image. ActiveSupport::Cache::Store does not support streaming, so the entire product needs to be read from the cache into local app memory before a byte of it goes to the server. (One could imagine writing an ActiveSupport::Cache::Store adapter that extends the API to support streaming).
    • How much slower? Hard to say. I’d guess in the hundreds of ms, maybe less, probably not usually more but there could be pathological edge cases.
    • Not actually sure how this compares to serving from fedora, I don’t know for sure if the serving from fedora case also needs a local memory copy before streaming to browser. I know some people work around this with nginx tricks, where the nginx server also needs access to fedora filesystem.
  • And there is still a cache management issue, similar to cache management issues above.

Consider: Third-party CDN

Most commercial sector web apps these days use a third party (or at least external) CDN (Content Delivery Network) — certainly especially image-heavy ones.

A CDN is basically a third-party cloud-hosted HTTP cache, which additionally distributes the cache geographically to provide very fast access globally.

Using a CDN you effectively can “cache everything”, they usually have pricing structures (in some cases free) that do not limit your storage space significantly. One could imagine putting a CDN in front of some or all of our delivered image assets (originals, derivatives, and tile sources), You could actually turn off riiif’s own image caching, and just count on the CDN to cache everything.

This could work out quite well, and would probably be worth considering for our image-heavy site even if we were not using an on-demand IIIF image server — a specialized CDN can serve images faster than our Rails or local web server can.

Cloudflare is a very popular CDN (significant portions of the web are cached by cloudflare) which offers a free tier that would probably do everything we need.

One downside of a CDN are that it only works for public images, access-controlled images only available to some users don’t work in a CDN. In our app, where images are either public or still ‘in process’, one could imagine pointing at cloudflare CDN cached images for public images, but serving staff-only in-process images locally.

Another downside is it would make tracking download counts somewhat harder, although probably not insurmountable, there are ways.

Image-specializing CDN or cloud image transformation service

In addition to general purpose CDNs, there exist a number of fairly successful cloud-hosted on-demand image transformation services, that effectively function as image-specific CDNs, with on-demand transformations services. They basically give you what a CDN gives you (including virtually unlmited cache so they can cache everything), plus what an on-demand image transformation service gives you, combined.

One popular one I have used before is imgix. Imgix supports all the features a IIIF server like riiif gives you — although it does not actually support the IIIF API. Nonetheless, one could imagine using imgix instead of a local IIIF server, even with tools like JS viewers that expect IIIF, by writing a translation gateway, or writing a plugin to (eg) OpenSeadragon to read from imgix. (OpenSeadragon’s IIIF support was not original, and was contributed by hydra community). (One could even imagine convincing imgix.com to support IIIF API natively).

imgix is not free, but it’s pricing is pretty reasonable: “$3 per 1,000 master images accessed each month. 8¢ per GB of CDN bandwidth for images delivered each month.” It’s difficult for me to estimate how much bandwidth we’d end up paying for (recall our derivatives will be substantially smaller than the original uncompressed TIF sources).

An image transformation CDN like imgix would almost entirely get us out of worrying about cache management (it takes care of it for us), as well as managing disk space ourselves for storing derivatives, and CPU and other resource issues. It has the same access control and analytics issues as the general CDN.

Consider the lowest-tech solution

Is it possible we can get away without an on-demand image transformation service
at all?

For derivatives (alternate formats and sizes of the whole image), we can if
we can feasibly manage the disk space to simply store them all.

For pan-and-zoom, we only need a tile-source if our full-resolution (or as high
resolution as we desire to support zoom in a browser to) are too big to deliver
to a browser.

Note that in both cases (standard derivative or derived tile-soruce) JPGs we’re delivering
to the browser are significantly smaller than the uncompressed source TIFFs.
In one simple experiment a 100MB source TIF I chose from our corpus turned into
a 3.8MB JPG, and that’s without focusing on making the smallest usable/indistinguishable
JPG possible.

At least hypothetically, one could even pre-render and store all the sub-images neccesary
for a tiling pan-and-zoom viewer, without using an on-demand image transformation service.

(PS: We might consider storing our original source TIF’s as losslessly compressed. I believe they are entirely uncompressed now. Lossless compression could store the images with
substantially smaller footprints, losing no original data or resolution).

Conclusion

We have a variety of potentially feasible paths. It’s important to remember that none of them are going to be just “install it and flip the switch”, they are all going to take some planning and consideration, and some time spent configuring, tweaking, and/or developing.

I guess the exception would be installing riiif in the most naive way possible, and incurring the technical debt of dealing with problems (performance and/or resource consumption) later when they arrive. Although even this would still require some UI/UX design work.

 

Advertisements
This entry was posted in General. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s