On open source, consensus, vision, and scope

Around minute 27 of Building Rails ActionDispatch::SystemTestCase Framework from Eileen Uchitelle.

What is unique to open source is that the stakeholders you are trying to find consensus with have varying levels of investment in the end result…

…but I wasn’t prepared for all the other people who would care. Of course caring is good, I got a lot of productive and honest feedback from community members, but it’s still really overwhelming to feel like I needed to debate — everyone.

Rails ideologies of simplicity differ a lot from capybara’s ideology of lots of features. And all the individuals who were interested in the feature had differing opinions as well… I struggled with how to respect everyone’s opinions while building system tests, but also maintaining my sense of ownership.

I new that if I tried to please all groups and build systems tests by consensus, then I would end up pleasing no one. Everyone would end up unhappy because consensus is the enemy of vision. Sure, you end up adding everything everyone wants, but the feature will lose focus, and the code will lose style, and I will lose everything that I felt like was important.

I needed to figure out a way to respect everyone’s opinions without making systems tests a hodepodge of idoelogies of feeling like I threw out everything I cared about. I had to remind ourselves that we all had one goal: to integrate systems testing into rails. Even if we disagreed about the implementation, htis was our common ground.

With this in mind, there are a few ways you can keep your sanity when dealing with multiple ideologies in the open source world. One of the biggest things is to manage expectations. In open source there are no contracts, you can’t hold anyone else acountable (except for yourself) and nobody else is going to hold you accountable either… You are the person who has to own the scope, and you are the person who has to say ‘no’. There were a ton of extra features suggested for systems tests that I would love to see, but if I had implemented all of them it still wouldn’t be in rails today. I had to manage the scope and the expectations of everyone involved to keep the project in budget…

…When you are building open source features, you are building something for others. If you are open to suggestions the feature might change for the better. Even if you don’t agree, you have to be open to listening to the other side of things. It’s really easy to get cagey about the code that you’ve worked so hard to write. I still have to fight the urge to be really protective of systems test code… but I also have to remember that it’s no longer mine, and never was mine, it now belongs to everyone that uses Rails….

I new that if I tried to please all groups and build systems tests by consensus, then I would end up pleasing no one. Everyone would end up unhappy because consensus is the enemy of vision. Sure, you end up adding everything everyone wants, but the feature will lose focus, and the code will lose style…

Posted in General | Tagged | Leave a comment

On choices

In a blog essay about non-rational devotion to software choices (the author argues it’s inevitable), a quote sprang out at me that reminds me of many decisions I’ve seen made at large institutions, as well as in distributed open source development:

As Neo realizes in The Matrix: the problem is choice. The problem is always about choice. People don’t like to choose, because that makes them accountable. It’s far easier to make someone else make the choice and just follow, creating the delusion that you made a “rational” choice because “the group” validates it.

I don’t think avoiding the choices serves us well. (Or pretending to; there’s always a choice).  Even if the choices aren’t going to be somehow 100% verifiable rational or best (and that’s the thing with choices, they always involve some risk). We do our best, also trying to avoid putting more time into a choice than it’s worth.

I’m not sure you’re ever going to teach large institutional administrators that though. Avoiding accountability for choices seems to be good for their careers. Maybe for all of our careers in the current environment, which is part of the challenge. “Nobody ever got fired for choosing IBM” indeed.  Doesn’t mean there was no risk to your mission or purpose in choosing IBM; but perhaps minimized risk to your career.

Eventually going with what everyone else is going with (or what you thought they were), or going with a consultant/contractor to avoid accountability for the product… is going to result in a catastrophic failure.  And then maybe things will change. Or it won’t, or not one that harms anyone’s career, and then maybe they won’t.

Posted in General | Leave a comment

Memo on Technical Operational Considerations for IIIF in a Sufia/Hyrax app

(A rather lengthy, as is my wont, memo I wrote for internal use, which I also share with you)

IIIF (International Image Interoperability Framework) is a standard API for a server which delivers on-demand image transformations.

What sort of transformations are we interested in (and IIIF supports)?

  • Changing image formats
  • Resizing images (to produce thumbnails other various display or delivery sizes)
  • Creating tiled image sources to support high-res zoom-in without having to deliver enormous original source images. (such an operation will involve resizing too to create tiles at different zoom levels, as well as often format changes if the original source is not in JPG or other suitable web format)

@jcoyne has created Riiif, an IIIF server in ruby, using imagemagick to do the heavy-lifting, that is a Rails engine that can turn any an IIIF server. In addition to it being nice that we know ruby so can tweak it if needed, this also allows it to use your existing ruby logic for looking up original source images from app ids and access controls. It’s unclear how you’d handle these things with an external IIIF server in a sufia/hyrax app; to my knowledge, nobody is using anything but riiif.

Keep in mind that the reason you need tiled image source is only when the full-resolution image (or the image at the resolution you desire to allow zoom to) in a JPG format is going to be too large to deliver in it’s entirety to the browser (at least with reasonable performance). If this isn’t true, you can allow pan and zoom in a browser with JS without needing a tiled image source.

And keep in mind that the primary reason you need an on demand image transformation service (whether for tiled image source or other transfomrations), is when storing all the transformations you want is going to take more disk space than you can afford or is otherwise feasible. (There are digital repositories with hundreds of thousands or millions of images, each which need various transformations).

There is additionally some development/operational convenience to an on-demand transformation aside from disk space issues, but there is a trade-off in additional complexity in other areas — mainly in dealing with caching and performance.

The first step is defining what UI/UX we want for our app, before being able to decide if an on-demand image transformation server is useful in providing that. But here, we’ll skip that step, assume we’ve arrived at a point from UI/UX to wanting to consider an on-demand image transformation service, and move on to consider some operational issues with deploying RIIIF.

Server/VM seperation?

riiif can conceivably be quite resource-intensive. Lots of CPU taken calling out to imagemagick to transform images. Lots of disk IO in reading/writing images (effected by cache and access strategies, see below). Lots of app server http connections/threads taken by clients requesting images — some of which, depending on caching strategies, can be quite slow-returning requests.

In an ideal scenario, one wouldn’t want this running on the same server(s) handling ordinary Rails app traffic, one would want to segregate it so it does not interfere with the main Rails app, and so each can be scaled independently.

This would require some changes to our ansible/capistrano deploy scripts, and some other infrastructure/configuration/deploy setup. The riiif server would probably still need to be deployed as the entire app, so it has access to app-located authorization and retrieval logic; but be limited to only serving riiif routes. This is all do-able, just a bunch of tweaking and configuring to set up.

This may not be necessary even if strictly ideal.

Original image access

The riiif server needs access to the original image bytestreams, so it can tranasform them.

In the most basic setup, the riiif server somehow has access to the file system fedora bytestreams are stored on, and knows how to find a byestream for a particular fedora entity on disk.

The downsides of this are that shared file systems are… icky. As is having to reverse engineer
fedora’s file storage.

Alternately, riiif can be set up to request the original bytestreams from fedora via http, on demand, and cache them in the local (riiif server) file system. The downsides of this are:

  • performance — if a non-cached transformation is requested, and the original source image is also not in the local file system cache, riiif first must download it from fedora, before moving on to transform it, and only then delivering it to the client.
  • cache management. Cache management as a general rule can get surprisingly complicated. If you did not trim/purge the local ‘original image source’ file system cache at all, it would of course essentially grow to be the size of the complete corpus of images (which are quite large uncompressed TIFFs in our case). Kind of defeating the purpose of saving file space
    with an on-demand image transformer in the first place (the actual transformed products are almost always going to be in a compressed format and a fraction of the size of the original TIFFs).

    • There is no built-in routine to trim original source file cache, although the basic approach is straightforward, the devil can be in the details.
    • To do an LRU cache, you’d need your file system tracking access times. Linux file systems are not infrequently configured with ‘noatime’ for performance these days, which wouldn’t work. Or alternately, you’d need to add code to riiif to track last access time in some other means.
    • When trimming, you have to be careful not to trim sources currently being processed by an imagemagick transformation.
    • Even if trimming/purging regularly, there is a danger of bursts of access filling up the cache quickly, and possibly exceeding volume space (unless the volume is big enough to hold all original sources of course). For instance, if using riiif for derivatives, one could imagine googlebot or another web spider visiting much of the corpus fairly quickly. (A use case ideally we want to support, the site ought to be easily spiderable)
      • There is of course a trade-off between cache size and overall end-user responsiveness percentiles.

It is unclear to me how many institutions are using riiif in production, but my sense is that most or even all of them take the direct file system access approach rather than http access with local file cache. Anyone I could find using riiif at ahc was taking this approach, one way or another.

Transformed product caching

Recall a main motivation for using an on-demand image transformer is not having to store every possible derivative (including tiles) on disk.

But there can be a significant delay in producing a transformation. It can depend on size and characteristics of original image; on whether we are using local file system access or http downloading as above (and on whether the original is in local cache if latter); on network
speed, disk I/O speed, and imagemagick (cpu) speed.

  • It’s hard to predict what this latency would be, but in the worst case with a very large source
    image one could conceive of it being a few seconds — note that’s per image,
    and you could pay it each time you move from page to page in a multi-page work,
    or even, pathological case, each time you pan or zoom in a pan-and-zoom viewer.

As a result, riiif tries to cache it’s transformation output.

It uses an ActiveSupport::Cache::Store to do so, by default the one being used by your entire Rails app as Rails.cache. It probably makes sense to separate the riiif cache, so a large volume of riiif products isn’t pushing your ordinary app cache content out of the cache and vice versa, and both caches can be sized appropriately, and can even use different cache backends.

ActiveSupport::Cache::Store supports caching in file system, local app memory, or a Memcached instance; or hypothetically you can easily write an adapter for any back-end store you want. But for this use case, anything but file system probably doesn’t make sense, it would get too expensive for the large quantity of bytes involved. (Although one could consider things like an S3 store instead
of immediate file system, that has it’s own complications but could be considered).

So we have the same issues to consider we did with http original source cache: performance, and cache management.

  • Even when something is in the riiif image cache, it’s not going to be as fast as an ordinary web-server-served image. ActiveSupport::Cache::Store does not support streaming, so the entire product needs to be read from the cache into local app memory before a byte of it goes to the server. (One could imagine writing an ActiveSupport::Cache::Store adapter that extends the API to support streaming).
    • How much slower? Hard to say. I’d guess in the hundreds of ms, maybe less, probably not usually more but there could be pathological edge cases.
    • Not actually sure how this compares to serving from fedora, I don’t know for sure if the serving from fedora case also needs a local memory copy before streaming to browser. I know some people work around this with nginx tricks, where the nginx server also needs access to fedora filesystem.
  • And there is still a cache management issue, similar to cache management issues above.

Consider: Third-party CDN

Most commercial sector web apps these days use a third party (or at least external) CDN (Content Delivery Network) — certainly especially image-heavy ones.

A CDN is basically a third-party cloud-hosted HTTP cache, which additionally distributes the cache geographically to provide very fast access globally.

Using a CDN you effectively can “cache everything”, they usually have pricing structures (in some cases free) that do not limit your storage space significantly. One could imagine putting a CDN in front of some or all of our delivered image assets (originals, derivatives, and tile sources), You could actually turn off riiif’s own image caching, and just count on the CDN to cache everything.

This could work out quite well, and would probably be worth considering for our image-heavy site even if we were not using an on-demand IIIF image server — a specialized CDN can serve images faster than our Rails or local web server can.

Cloudflare is a very popular CDN (significant portions of the web are cached by cloudflare) which offers a free tier that would probably do everything we need.

One downside of a CDN are that it only works for public images, access-controlled images only available to some users don’t work in a CDN. In our app, where images are either public or still ‘in process’, one could imagine pointing at cloudflare CDN cached images for public images, but serving staff-only in-process images locally.

Another downside is it would make tracking download counts somewhat harder, although probably not insurmountable, there are ways.

Image-specializing CDN or cloud image transformation service

In addition to general purpose CDNs, there exist a number of fairly successful cloud-hosted on-demand image transformation services, that effectively function as image-specific CDNs, with on-demand transformations services. They basically give you what a CDN gives you (including virtually unlmited cache so they can cache everything), plus what an on-demand image transformation service gives you, combined.

One popular one I have used before is imgix. Imgix supports all the features a IIIF server like riiif gives you — although it does not actually support the IIIF API. Nonetheless, one could imagine using imgix instead of a local IIIF server, even with tools like JS viewers that expect IIIF, by writing a translation gateway, or writing a plugin to (eg) OpenSeadragon to read from imgix. (OpenSeadragon’s IIIF support was not original, and was contributed by hydra community). (One could even imagine convincing imgix.com to support IIIF API natively).

imgix is not free, but it’s pricing is pretty reasonable: “$3 per 1,000 master images accessed each month. 8¢ per GB of CDN bandwidth for images delivered each month.” It’s difficult for me to estimate how much bandwidth we’d end up paying for (recall our derivatives will be substantially smaller than the original uncompressed TIF sources).

An image transformation CDN like imgix would almost entirely get us out of worrying about cache management (it takes care of it for us), as well as managing disk space ourselves for storing derivatives, and CPU and other resource issues. It has the same access control and analytics issues as the general CDN.

Consider the lowest-tech solution

Is it possible we can get away without an on-demand image transformation service
at all?

For derivatives (alternate formats and sizes of the whole image), we can if
we can feasibly manage the disk space to simply store them all.

For pan-and-zoom, we only need a tile-source if our full-resolution (or as high
resolution as we desire to support zoom in a browser to) are too big to deliver
to a browser.

Note that in both cases (standard derivative or derived tile-soruce) JPGs we’re delivering
to the browser are significantly smaller than the uncompressed source TIFFs.
In one simple experiment a 100MB source TIF I chose from our corpus turned into
a 3.8MB JPG, and that’s without focusing on making the smallest usable/indistinguishable
JPG possible.

At least hypothetically, one could even pre-render and store all the sub-images neccesary
for a tiling pan-and-zoom viewer, without using an on-demand image transformation service.

(PS: We might consider storing our original source TIF’s as losslessly compressed. I believe they are entirely uncompressed now. Lossless compression could store the images with
substantially smaller footprints, losing no original data or resolution).

Conclusion

We have a variety of potentially feasible paths. It’s important to remember that none of them are going to be just “install it and flip the switch”, they are all going to take some planning and consideration, and some time spent configuring, tweaking, and/or developing.

I guess the exception would be installing riiif in the most naive way possible, and incurring the technical debt of dealing with problems (performance and/or resource consumption) later when they arrive. Although even this would still require some UI/UX design work.

 

Posted in General | Leave a comment

Exploring and planning with Sufia/Hyrax/Fedora fixity validation

Fixity” validation — basically validating a checksum against bytes on disk to make sure a file still is exactly as it was on ingest — is an important part of any digital preservation collection, and my understanding is that it’s a key marketing point of the fedora+hydra stack.

However, I found it somewhat challenging to figure out how/if current Sufia/Hyrax supported this with already-built features. If there are reliable and up to date docs, I did not find them. So, since understanding what’s really going on here seems important for preservation responsibilities, I spent a couple days reverse engineering and debugging what’s there (thanks to various people in Hydra Slack channel for pointing me to the right places to look). What I found had some parts somewhat unexpected to me, and not necessarily quite right at least for what I understand as our needs.

I figured I’d write up what I discovered and what our current plans (for our local app) are based on what I discovered. As an aid to other people wanting to know what’s up, and as a discussion/planning aid in considering any changes to the shared gems.

Hydra component write-ups seem to be very version-sensitive, things tend to change a lot. This was investigated under Sufia 7.3.0, CurationConcerns 1.7.7, ActiveFedora 11.1.6. I believe it has not changed substantially in hyrax as of this date, except for class name changes (including generally using the term ‘fixity’ instead of ‘audit’ in class names, as well as Hyrax namespace of course), but am not totally sure.

There is an existing service to fixity audit a single FileSet, in CurationConcerns at FileSetAuditService.

CurationConcerns::FileSetAuditService.new(fs).audit

So you might run that on every fileset to do a bulk audit, like FileSet.find_each { |fs| CurationConcerns::FileSetAuditService.new(fs).audit } — which is just what (in Sufia rather than CC) Sufia::RepositoryAuditService does, nothing more nothing less.

CurationConcerns::FileSetAuditService actually uses several other objects to do the actual work, and later I’ll go into what they do how. But the final outcome will be:

  • an ActiveRecord ChecksumAuditLog row created — I believe one for every file checked, in cases where a fileset has multiple files. It seems to have a pass (integer column) of 1 if the object had a good checksum, or a 0 if not.
    • It cleans up after itself in that table, not leaving infinitely growing historical ChecksumAuditLog rows there; generally I think only the most recent two are kept, although may be more if there are failures. AuditJob calls ChecksumAuditJob.prune_history
    • While the ChecksumAuditLog record has columns for expected_result and actual_result, nothing in sufia/CC stack fills these out, all you get is the pass value (recall, we think 0 or 1), a file_set_id, a file_id, and version string.
      • I’m not sure what the version string is for, or if it gives you any additional unique data that file_id doesn’t, or if the version string is just a different representation uniquely identifying the same thing file_id does. A version string might look like: `http://127.0.0.1:8080/rest/dev/37/72/0c/72/37720c723/files/214f68af-e5ed-41bd-9898-b8923fd6d018/fcr:versions/version1`
  • On cases of failure, it sends an internal app message to the person listed as the depositor of the fileset — assuming the fixed email address of the depositor still matches the email address of Sufia account. This is set up by Sufia registering a CurationConcerns callback to run the Sufia::AuditFailureService; that callback is triggered by the CurationConcerns::AuditJob (AuditJob gets run by the FileSetAuditService).
    • The internal message does includes the FileSet title and file_set.original_file.uri.to_s. If the file set had multiple versions, which one (or how to get to it in UI) that failed the checksum is not included.
    • It’s not clear to me what use cases one wants the depositor (only if they (still) have a registered account) to be the only one that gets the fixity failure notice. It seems like an infrastructure problem, that we at least would want a notification sent instead to infrastructural admins who can respond to it — perhaps via an email or an external error-tracking service like Bugsnag or Honeybadger. Fortunately, the architecture makes it pretty easy to customize this.
    • The callback is using CurationConcerns::Callback::Registry which only supports one callback per event, so seting another one will replace the one by default set by Sufia. Which is fine.

I did intentionally corrupt a file on my dev copy, and then verify it was caught and that those things listed above things happened — basically, callback sends internal notification to depositor, and ChecksumAuditLog is stored in the database with a 0 value for pass​, and the relevant file_set_id and file_id.

While the ChecksumAuditLog objects are all created, there is no admin UI I could find for, say, “show me all ChecksumAuditLog records respresenting failed fixity checks”.

There is a an area on the FileSet “show” page that says Audit Status: Audits have not yet been run on this file. I believe this is intended to show information based on ChecksumAuditLog rows, possibly as a result of something in Sufia calling this line.  However, this appears broken in current sufia/hyrax, this line keeps saying “Audits have not yet been run” no matter how many times you’ve run audits. I found this problem had already been reported in November 2016 on Sufia issue tracker,  imported to Hyrax issue tracker.

So in current Sufia (and Hyrax?), although the ChecksumAuditLog AR records are created, I believe there is no UI that displays them in any way — a developer could manually interrogate them from a console, otherwise all you’ve got is the (by default) internal notification sent to depositor.

While past versions of Sufia may have run some fixity checks automatically on-demand when a file has been viewed, this functionality does not seem to still be in sufia 7.3/hyrax. I’m not sure if this is a desired function anyway — it seems to me you need to be running periodic bulk/mass audits anyway (you don’t want to avoid checking files that haven’t been viewed), and if you doing so, additional checking on-the-fly checking when viewed/downloaded seems superfluous.

Also note that the “checksum” displayed in the Fileset “show” view is not the checksum used by fedora internally. At least not in our setup, where we haven’t tried to customize this at all. The checksum displayed in the Sufia view is, we believe, calculated on upload even before fedora ingest, and appears to be an MD5, and does not match fedora’s checksum used for the fedora fixity service, which seems to be SHA1.

How is this implemented: Classes involved

The CurationConcerns::FileSetAuditService actually calls out to CurationConcerns::AuditJob to do the bulk of it’s work.

  • FileSetAuditService calls AuditJob as perform_later, there is no way to configure it to run synchronously.
    • When I ran an audit of every file on our staging server  (with a hand-edit to do them synchronously so I could time it more easily and clearly), it took about 3.8 hours to check 8077 FileSets on staging.
    • This means a bulk audit, using Resque bg jobs to do it — could clog up the resque job queue for up to 3.8 hours (less with more resque workers, not neccesarily scaling directly), making other jobs (like derivatives creation) take a long time to complete, perhaps at the end of the queue 3.8 hours later. Clogging up the job queue for a bulk fixity audit seems problematic. One could imagine changing it to use a different queue name with dedicated workers — but for bulk fixity check, I’m not sure if there is a reason for this to be in the bg job queue at all, doing it all synchronously seems fine/preferable.
    • It’s not entirely clear to me what rationale governs the split of logic between FileSetAuditService, and AuditJob, or if it’s completely rationale. I guess one thing is that the FileSetAuditService is for a whole FileSet, but the AuditJob for an individual file. The FileSetAuditService does schedule audits for every file version if there are more than one in the FileSet.

The CurationConcerns::AuditJob actually calls out to ActiveFedora::FixityService to actually do the fixity check.

How is the fixity check done?

  • ActiveFedora::FixityService simply asks Hydra for a fixity check on a URL (for an individual File, I think). The asset is not downloaded or examined by Hydra stack code, a simple HTTP request “do a fixity check on this file and tell me the result” is sent to Hydra.
    • This means we are trusting that A) even if the asset has been corrupted, Fedora’s stored checksum for the asset is still okay, and B) that the Fedora fixity service actually works. I guess these are safe assumptions for a reliable fixity service?
    • It looks at the RDF body returned by the hydra fixity service to interpret if the fixity check was good or not
    • While the Hydra fixity service RDF body response includes some additional information (such as original and current checksum), this information is not captured and sent up to the stack to be reported or logged — ActiveFedora::FixityService just returns true or false, (which explains why ChecksumAuditLog records always have blank expected_result and actual_result attributes).

What do we need or want locally that differs from standard setup

We decided that trusting the Fedora fixity service was fine — we know of no problems with it, if we did we’d report them upstream to Fedora, who would hopefully fix them quickly since fixity is kind of a key feature for preservation. Ideally, one might want to store a copy of the original checksums elsewhere to make sure they were still good in Fedora, but we decided we weren’t going to do this for now either. We will run some kind of bulk fixity-all-the-things task periodically, and do want to receive notifications.

  1. Different notification on fixity failure than the default internal notification to depositor. This should be easy to do in current architecture with a local setting though, hooray.
  2. Get the bulk fixity check not to create a bg job for every file audited, filling up the bg job queue. For a bulk fixity check in our infrastructure, just one big long-running foreground process seems fine.
  3. Get the hydra fixity check response details to be recorded and passed up the stack for ChecksumAuditLog inclusion and inclusion in notification. Expected checksum and actual checksum, at least. This requires changes to ActiveFedora, or using something new instead of what’s in ActiveFedora.  (The current fedora registered checksum may be neccessary for recovery, see below).  Not sure if there should be a way to mark a failed ChecksumAuditLog row as ‘resolved’, for ongoing admin overview of fixity status. Probably not if the same file gets a future ChecksumAuditLog row as ‘passing’, that’s enough indication of ‘resolved’.
  4. Ideally, fix bug where “Audit status” never gets updated and always says “no audits have yet been done”.
  5. Failed audits should be logged to standard rails log as well as other notification methods.
  6. It has been suggested that we might only want to be fixity-auditing the most recent version of any file, there’s no need to audit older versions. I’m not sure if this sound from a preservation standpoint, those old versions might be part of the archival history? But it might simply one recovery strategy, see below.
  7. Ideally, clean up the code a little bit in general. I don’t entirely understand why logic is split between classes as it is, and don’t understand what all the methods are doing. Don’t understand why ChecksumAuditLog has an integer pass instead of a boolean. Code is harder to figure out what it’s doing than seems necessary for relatively simple functionality here.
  8. Ideally, perhaps, an admin UI for showing “current failed fixity checks”, in case you missed the notification.

And finally, an area that I’ll give more than a bullet point to — RECOVERYWhile I expect fixity failures to be very rare, possibly we will literally never see one in the lifetime of this local project — doing fixity checks without having a tested process for recovery from a discovered corrupted file seems pointless.  What’s the point of knowing a file is corrupt if you can’t do anything about it?

I’m curious if any other hydra community people have considered this, and have a recovery process.

We do have disk backups of the whole fedora server. In order to try and recover an older non-corrupted version, we have to know where it is on disk. Knowing fedora’s internal computed SHA1 — which I think is the same thing it uses for fixity checking — seems like what you need to find the file on disk, they are filed on disk by the SHA1 taken at time of ingest.

Once you’ve identified a known-good passing-SHA1-checksum version backup (by computing SHA1’s yourself, in the same way fedora does, presumably) — how do you actually restore it? I haven’t been able to find anything in sufia/hyrax or fedora itself meant to help you here.

We can think of two ways. We could literally replace the file on disk in the fedora file system. This seems nice, but not sure if we should be messing with fedora’s internals like that. Or we could upload a new “version”, the known-good one, to sufia/fedora. This is not messing with fedora internals, but the downside is the old corrupt version is still there, and still failing fixity checks, and possibly showing up in your failed fixity check reports and notifications etc, unless you build more stuff on top to prevent them. False positive “fixity check failures” would be bad, and lead to admins ignoring “fixity check failure” notices as is human nature.

Curious if Fedora/fcrepo itself has any intended workflow here, for how you recover from a failed fixity check, when you have an older known-good version. Anyone know?

I think most of these changes, at least as options, would be good to send upstream — the current code seems not quite right in the generic case to me, I don’t think it’s any special use cases I have. The challenge with upstream PR here is that the code spans both Hyrax and ActiveFedora, which would need to be changed in a synchronized fashion. And that I’m not quite sure the intention of the existing code, what parts that look like weird architecture to me are actually used or needed by someone or something. Both of which make it more challenging, and more time-consuming, to send upstream. So not sure yet how much I’ll be able to send upstream, and how much will be just local.

Posted in General | 2 Comments

On the graphic design of rubyland.news

I like to pay attention to design, and enjoy good design in the world, graphic and otherwise. A well-designed printed page, web page, or physical tool is a joy to interact with.

I’m not really a trained designer in any way, but in my web development career I’ve often effectively been the UI/UX/graphic designer of apps I work on, and I do my best, and always try to do the best I can (our users deserve good design), and to develop my skills by paying attention to graphic design in the world, reading up (I’d recommend Donald Norman’s The Design of Everyday Things, Robert Bringhurt’s The Elements of Typographic Style, and one free online one, Butterick’s Practical Typography), and trying to improve my practice, and I think my graphic design skills are always improving.   (I also learned a lot looking at and working with the productions of the skilled designers at Friends of the Web, where I worked last year).

Implementing rubyland.news turned out to be a great opportunity to practice some graphic and web design. Rubyland.news has very few graphical or interactive elements, it’s a simple thing that does just a few things. The relative simplicity of what’s on the page, combined with it being a hobby side project — with no deadlines, no existing branding styles, and no stakeholders saying things like “how about you make that font just a little bit bigger” — made it a really good design exercise for me, where I could really focus on trying to make each element and the pages as a whole as good as I could in both aesthetics and utility, and develop my personal design vocabulary a bit.

I’m proud of the outcome, while I don’t consider it perfect (I’m not totally happy with the typography of the mixed-case headers in Fira Sans), I think it’s pretty good typography and graphic design, probably my best design work. It’s nothing fancy, but I think it’s pleasing to look at and effective.  I think probably like much good design, the simplicity of the end result belies the amount of work I put in to make it seem straightforward and unsophisticated. :)

My favorite element is the page-specific navigation (and sometimes info) “side bar”.

Screenshot 2017-04-21 11.21.45

At first I tried to put these links in the site header, but there wasn’t quite enough room for them, I didn’t want to make the header two lines — on desktop or wide tablet displays, I think vertical space is a precious resource not to be squandered. And I realized that maybe anyway it was better for the header to only have unchanging site-wide links, and have page-specific links elsewhere.

Perhaps encouraged by the somewhat hand-written look (especially of all-caps text) in Fira Sans, the free font I was trying out, I got the idea of trying to include these as a sort of ‘margin note’.

Screenshot 2017-04-21 11.32.51

The CSS got a bit tricky, with screen-size responsiveness (flexbox is a wonderful thing). On wide screens, the main content is centered in the screen, as you can see above, with the links to the left: The ‘like a margin note’ idea.

On somewhat narrower screens, where there’s not enough room to have margins on both sides big enough for the links, the main content column is no longer centered.

Screenshot 2017-04-21 11.36.48.png

And on very narrow screens, where there’s not even room for that, such as most phones, the page-specific nav links switch to being above the content. On narrow screens, which are typically phones that are much higher than they are wide, it’s horizontal space that becomes precious, with some more vertical to spare.

Screenshot 2017-04-21 11.39.16

Note on really narrow screens, which is probably most phones especially held in vertical orientation, the margins on the main content disappear completely, you get actual content with it’s white border from edge-to-edge. This seems an obvious thing to do to me on phone-sized screens: Why waste any horizontal real-estate with different-colored margins, or provide a visual distraction with even a few pixels of different-colored margin or border jammed up against the edge?  I’m surprised it seems a relatively rare thing to do in the wild.

Screenshot 2017-04-21 11.39.36

Nothing too fancy, but I quite like how it turned out. I don’t remember exactly what CSS tricks I used to make it so. And I still haven’t really figured out how to write clear maintainable CSS code, I’m less proud of the actual messy CSS source code then I am of the result. :)

Posted in General | Tagged | Leave a comment

One way to remove local merged tracking branches

My git workflow involves creating a lot of git feature branches, as remote tracking branches on origin. They eventually get merged and deleted (via github PR), but i still have dozens of them lying around.

Via googling, getting StackOverflow answers, and sort of mushing some stuff I don’t totally understand together, here’s one way to deal with it, create an alias git-prune-tracking.  In your ~/.bash_profile:

alias git-prune-tracking='git branch --merged | grep -v "*" | grep -v "master" | xargs git branch -d; git remote prune origin'

And periodically run git-prune-tracking from a git project dir.

I do not completely understand what this is doing I must admit, and there might be a better way? But it seems to work. Anyone have a better way that they understand what it’s doing?  I’m kinda surprised this isn’t built into the git client somehow.

Posted in General | Tagged | 3 Comments

Use capistrano to run a remote rake task, with maintenance mode

So the app I am now working on is still in it’s early stages, not even live to the public yet, but we’ve got an internal server. We periodically have to change a bunch of data in our (non-rdbms) “production” store. (First devops unhappiness, I think there should be no scheduled downtime for planned data transformation. We’re working on it. But for now it happens).

We use capistrano to deploy. Previously/currently, the process for making these scheduled-downtime maintenance mode looked like:

  • on your workstation, do a cap production maintenance:enable to start some downtime
  • ssh into the production machine, cd to the cap-installed app, and run a bundle exec run a rake task. Which could take an hour+.
  • Remember to come back when it’s done and `cap production maintenance:disable`.

A couple more devops unhappiness points here: 1) In my opinion you should ideally never be ssh’ing to production, at least in a non-emergency situation.  2) You have to remember to come back and turn off maintenance mode — and if I start the task at 5pm to avoid disrupting internal stakeholders, I gotta come back after busines hours to do that! I also think every thing you have to do ‘outside business hours’ that’s not an emergency is a not yet done ops environment.

So I decided to try to fix this. Since the existing maintenance mode stuff was already done through capistrano, and I wanted to do it without a manual ssh to the production machine, capistrano seemed a reasonable tool. I found a plugin to execute rake via capistrano, but it didn’t do quite what I wanted, and it’s implementation was so simple that I saw no reason not to copy-and-paste it and just make it do just what I wanted.

I’m not gonna maintain this for the public at this point (make a gem/plugin out of it, nope), but I’ll give it to you in a gist if you want to use it. One of the tricky parts was figuring out how to get “streamed” output from cap, since my rake tasks use ruby-progressbar — it’s got decent non-TTY output already, and I wanted to see it live in my workstation console. I managed to do that! Although I never figured out how to get a cap recipe to require files from another location (I have no idea how I couldn’t make it work), so the custom class is ugly inlined in.

I also ordinarily want maintenance mode to be turned off even if the task fails, but still want a non-zero exit code in those cases (anticipating future further automation — really what I need is to be able to execute this all via cron/at too, so we can schedule downtime for the middle of the night without having to be up then).

Anyway here’s the gist of the cap recipe. This file goes in ./lib/capistrano/tasks in a local app, and now you’ve got these recipes. Any tips on how to organize my cap recipe better quite welcome.

Posted in General | Tagged | Leave a comment

Hash#map ?

I frequently have griped that Hash didn’t have a useful map/collect function, something allowing me to transform the hash keys or values (usually values), into another transformed hash. I even go looking for for it in ActiveSupport::CoreExtensions sometimes, surely they’ve added something, everyone must want to do this… nope.

Thanks to realization triggered by an example in BigBinary’s blog post about the new ruby 2.4 Enumerable#uniq… I realized, duh, it’s already there!

olympics = {1896 => 'Athens', 1900 => 'Paris', 1904 => 'Chicago', 1906 => 'Athens', 1908 => 'Rome'}
olympics.collect { |k, v| [k, v.upcase]}.to_h
# => => {1896=>"ATHENS", 1900=>"PARIS", 1904=>"CHICAGO", 1906=>"ATHENS", 1908=>"ROME"}

Just use ordinary Enumerable#collect, with two block args — it works to get key and value. Return an array from the block, to get an array of arrays, which can be turned to a hash again easily with #to_h.

It’s a bit messy, but not really too bad. (I somehow learned to prefer collect over it’s synonym map, but I think maybe I’m in the minority? collect still seems more descriptive to me of what it’s doing. But this is one place where I wouldn’t have held it against Matz if he had decided to give the method only one name so we were all using the same one!)

(Did you know Array#to_h turned an array of duples into a hash?  I am not sure I did! I knew about Hash(), but I don’t think I knew about Array#to_h… ah, it looks like it was added in ruby 2.1.0.  The equivalent before that would have been more like Hash( hash.collect {|k, v| [k, v]}), which I think is too messy to want to use.

I’ve been writing ruby for 10 years, and periodically thinking “damn, I wish there was something like Hash#collect” — and didn’t realize that Array#to_h was added in 2.1, and makes this pattern a lot more readable. I’ll def be using it next time I have that thought. Thanks BigBinary for using something similar in your Enumerable#uniq example that made me realize, oh, yeah.

 

Posted in General | Tagged | 1 Comment

“Polish”; And, What makes well-designed software?

Go check out Schneem’s post on “polish”. (No, not the country).

Polish is what distinguishes good software from great software. When you use an app or code that clearly cares about the edge cases and how all the pieces work together, it feels right. Unfortunately, this is the part of the software that most often gets overlooked, in favor of more features or more time on another project…

…When we say something is “polished” it means that it is free from sharp edges, even the small ones. I view polished software to be ones that are mostly free from frustration. They do what you expect them to and are consistent.…

…In many ways I want my software to be boring. I want it to harbor few surprises. I want to feel like I understand and connect with it at a deep level and that I’m not constantly being caught off guard by frustrating, time stealing, papercuts.

I definitely have experienced the difference between working with and on a project that has this kind of ‘polish’ and, truly, experiencing a deep-level connection of the code that lets me crazy effective with it — and working on or with projects that don’t have this.  And on projects that started out with it, but lost it! (An irony is that it takes a lot of time, effort, skill, and experience to design an architecture that seems like the only way it would make sense to do it, obvious, and as schneems says, “boring”!)

I was going to say “We all have experienced the difference…”, but I don’t know if that’s true. Have you?

What do you think one can do to work towards a project with this kind of “polish”, and keep it there?  I’m not entirely sure, although I have some ideas, and so does schneems. Tolerating edge-case bugs is a contraindication — and even though I don’t really believe in ‘broken windows theory’ when it comes to neighborhoods, I think it does have an application here. Once the maintainers start tolerating edge case bugs and sloppiness, it sends a message to other contributors, a message of lack of care and pride. You don’t put in the time to make a change right unless the project seems to expect, deserve, and accommodate it.

If you don’t even have well-defined enough behavior/architecture to have any idea what behavior is desirable or undesirable, what’s a bug– you’ve clearly gone down a wrong path incompatible with this kind of ‘polish’, and I’m not sure if it can be recovered from. A Fred Brooks “Mythical Man Month” quote I think is crucial to this idea of ‘polish’: “Conceptual integrity is central to product quality.”  (He goes on to say that having an “architect” is the best way to get conceptual integrity; I’m not certain, I’d like to believe this isn’t true because so many formal ‘architect’ roles are so ineffective, but I think my experience may indeed be that a single or tight team of architects, formal or informal, does correlate…).

There’s another Fred Brooks quote that now I can’t find and I really wish I could cause I’ve wanted to return to it and meditate on it for a while, but it’s about how the power of a system is measured by what you can do with it divided by the number of distinct architectural concepts. A powerful system is one that can do a lot with few architectural concepts.  (If anyone can find this original quote, I’ll buy you a beer or a case of it).

I also know you can’t do this until you understand the ‘business domain’ you are working in — programmers as interchangeable cross-industry widgets is a lie. (‘business domain’ doesn’t necessarily mean ‘business’ in the sense of making money, it means understanding the use-cases and needs of your developer users, as they try to meet the use cases and needs of their end-users, which you need to understand too).

While I firmly believe in general in the caution against throwing out a system and starting over, a lot of this caution is about losing the domain knowledge encoded in the system (really, go read Joel’s post). But if the system was originally architected by people (perhaps past you!) who (in retrospect) didn’t have very good domain knowledge (or the domain has changed drastically?), and you now have a team (and an “architect”?) that does, and your existing software is consensually recognized as having the opposite of the quality of ‘polish’, and is becoming increasingly expensive to work with (“technical debt”) with no clear way out — that sounds like a time to seriously consider it. (Although you will have to be willing to accept it’ll take a while to get feature parity, if those were even the right features).  (Interestingly, Fred Books was I think the originator of the ‘build one to throw away’ idea that Joel is arguing against. I think both have their place, and the place of domain knowledge is a crucial concept in both).

All of these are more vague hand wavy ideas than easy to follow directions, I don’t have any easy to follow directions, or know if any exist.

But I know that the first step is being able to recognize “polish”, a well-designed parsimoniously architected system that feels right to work with and lets you effortlessly accomplish things with it.  Which means having experience with such systems. If you’ve only worked with ball-of-twine difficult to work with systems, you don’t even know what you’re missing or what is possible or what it looks like. You’ve got to find a way to get exposure to good design to become a good designer, and this is something we don’t know how to do as well with computer architecture as with classic design (design school consists of exposure to design, right?)

And the next step is desiring and committing to building such a system.

Which also can mean pushing back on or educating managers and decision-makers.  The technical challenge is already high, but the social/organizational challenge can be even higher.

Because it is harder to build such a system than to not, designing and implementing good software is not easy, it takes care, skill, and experience.  Not every project deserves or can have the same level of ‘polish’. But if you’re building a project meant to meet complex needs, and to be used by a distributed (geographically and/or organizationally) community, and to hold up for years, this is what it takes. (Whether that’s a polished end-user UX, or developer-user UX, which means API, or both, depending on the nature of this distributed community).

Posted in General | Tagged | Leave a comment

Command-line utility to visit github page of a named gem

I’m working in a Rails stack that involves a lot of convolutedly inter-related dependencies, which I’m not yet all that familiar with.

I often want to go visit the github page of one of the dependencies, to check out the README, issues/PRs, generally root around in the source, etc.  Sometimes I want to clone and look at the source locally, but often in addition to that I’m wanting to look at the github repo.

So I wrote a little utility to let me do gem-visit name_of_gem and it’ll open up the github page in a browser, using the MacOS open utility to open your default browser window.

The gem doesn’t need to be installed, it uses the rubygems.org API (hooray!) to look up the gem by name, and look at it’s registered “Homepage” and “Source Code” links. If either one is a github.com link, it’ll prefer that. Otherwise, it’ll just send you to the Homepage, or failing that, to the rubygems.org page.

It’s working out well for me.

I wrote it in ruby (naturally), but with no dependencies, so you can just drop it in your $PATH, and it just works. I put ~/bin on my $PATH, so I put it there.

I’ll give to give you a gist of it (yes, i’m not maintaining it at present), but thinking about the best way to distribute this if I wanted to maintain it…. Distributing as a ruby gem doesn’t actually seem great, with the way people use ruby version switchers, you’d have to make sure to get it installed for every version of ruby you might be using to get the command line to work in them all.

Distributing as a homebrew package seems to actually make a lot of sense, for a very simple command-line utility like this. But homebrew doesn’t really like niche/self-submitted stuff, and this doesn’t really meet the requirements for what they’ll tolerate. But homebrew theoretically supports a third-party ‘tap’, and I think a github repo itself can be such a ‘tap’, and hypothetically I could set things up so you could cask tap my_github_url, and install (and later upgrade or uninstall!) with brew… but I haven’t been able to figure out how to do that from the docs/examples, just verified that it oughta be possible!  Anyone want to give me any hints or model examples?  Of a homebrew formula that does nothing but install a simple one-doc bash script, and is a third-party tap hosted  on github?

Posted in General | Tagged | 3 Comments