Exploring and planning with Sufia/Hyrax/Fedora fixity validation

Fixity” validation — basically validating a checksum against bytes on disk to make sure a file still is exactly as it was on ingest — is an important part of any digital preservation collection, and my understanding is that it’s a key marketing point of the fedora+hydra stack.

However, I found it somewhat challenging to figure out how/if current Sufia/Hyrax supported this with already-built features. If there are reliable and up to date docs, I did not find them. So, since understanding what’s really going on here seems important for preservation responsibilities, I spent a couple days reverse engineering and debugging what’s there (thanks to various people in Hydra Slack channel for pointing me to the right places to look). What I found had some parts somewhat unexpected to me, and not necessarily quite right at least for what I understand as our needs.

I figured I’d write up what I discovered and what our current plans (for our local app) are based on what I discovered. As an aid to other people wanting to know what’s up, and as a discussion/planning aid in considering any changes to the shared gems.

Hydra component write-ups seem to be very version-sensitive, things tend to change a lot. This was investigated under Sufia 7.3.0, CurationConcerns 1.7.7, ActiveFedora 11.1.6. I believe it has not changed substantially in hyrax as of this date, except for class name changes (including generally using the term ‘fixity’ instead of ‘audit’ in class names, as well as Hyrax namespace of course), but am not totally sure.

There is an existing service to fixity audit a single FileSet, in CurationConcerns at FileSetAuditService.

CurationConcerns::FileSetAuditService.new(fs).audit

So you might run that on every fileset to do a bulk audit, like FileSet.find_each { |fs| CurationConcerns::FileSetAuditService.new(fs).audit } — which is just what (in Sufia rather than CC) Sufia::RepositoryAuditService does, nothing more nothing less.

CurationConcerns::FileSetAuditService actually uses several other objects to do the actual work, and later I’ll go into what they do how. But the final outcome will be:

  • an ActiveRecord ChecksumAuditLog row created — I believe one for every file checked, in cases where a fileset has multiple files. It seems to have a pass (integer column) of 1 if the object had a good checksum, or a 0 if not.
    • It cleans up after itself in that table, not leaving infinitely growing historical ChecksumAuditLog rows there; generally I think only the most recent two are kept, although may be more if there are failures. AuditJob calls ChecksumAuditJob.prune_history
    • While the ChecksumAuditLog record has columns for expected_result and actual_result, nothing in sufia/CC stack fills these out, all you get is the pass value (recall, we think 0 or 1), a file_set_id, a file_id, and version string.
      • I’m not sure what the version string is for, or if it gives you any additional unique data that file_id doesn’t, or if the version string is just a different representation uniquely identifying the same thing file_id does. A version string might look like: `http://127.0.0.1:8080/rest/dev/37/72/0c/72/37720c723/files/214f68af-e5ed-41bd-9898-b8923fd6d018/fcr:versions/version1`
  • On cases of failure, it sends an internal app message to the person listed as the depositor of the fileset — assuming the fixed email address of the depositor still matches the email address of Sufia account. This is set up by Sufia registering a CurationConcerns callback to run the Sufia::AuditFailureService; that callback is triggered by the CurationConcerns::AuditJob (AuditJob gets run by the FileSetAuditService).
    • The internal message does includes the FileSet title and file_set.original_file.uri.to_s. If the file set had multiple versions, which one (or how to get to it in UI) that failed the checksum is not included.
    • It’s not clear to me what use cases one wants the depositor (only if they (still) have a registered account) to be the only one that gets the fixity failure notice. It seems like an infrastructure problem, that we at least would want a notification sent instead to infrastructural admins who can respond to it — perhaps via an email or an external error-tracking service like Bugsnag or Honeybadger. Fortunately, the architecture makes it pretty easy to customize this.
    • The callback is using CurationConcerns::Callback::Registry which only supports one callback per event, so seting another one will replace the one by default set by Sufia. Which is fine.

I did intentionally corrupt a file on my dev copy, and then verify it was caught and that those things listed above things happened — basically, callback sends internal notification to depositor, and ChecksumAuditLog is stored in the database with a 0 value for pass​, and the relevant file_set_id and file_id.

While the ChecksumAuditLog objects are all created, there is no admin UI I could find for, say, “show me all ChecksumAuditLog records respresenting failed fixity checks”.

There is a an area on the FileSet “show” page that says Audit Status: Audits have not yet been run on this file. I believe this is intended to show information based on ChecksumAuditLog rows, possibly as a result of something in Sufia calling this line.  However, this appears broken in current sufia/hyrax, this line keeps saying “Audits have not yet been run” no matter how many times you’ve run audits. I found this problem had already been reported in November 2016 on Sufia issue tracker,  imported to Hyrax issue tracker.

So in current Sufia (and Hyrax?), although the ChecksumAuditLog AR records are created, I believe there is no UI that displays them in any way — a developer could manually interrogate them from a console, otherwise all you’ve got is the (by default) internal notification sent to depositor.

While past versions of Sufia may have run some fixity checks automatically on-demand when a file has been viewed, this functionality does not seem to still be in sufia 7.3/hyrax. I’m not sure if this is a desired function anyway — it seems to me you need to be running periodic bulk/mass audits anyway (you don’t want to avoid checking files that haven’t been viewed), and if you doing so, additional checking on-the-fly checking when viewed/downloaded seems superfluous.

Also note that the “checksum” displayed in the Fileset “show” view is not the checksum used by fedora internally. At least not in our setup, where we haven’t tried to customize this at all. The checksum displayed in the Sufia view is, we believe, calculated on upload even before fedora ingest, and appears to be an MD5, and does not match fedora’s checksum used for the fedora fixity service, which seems to be SHA1.

How is this implemented: Classes involved

The CurationConcerns::FileSetAuditService actually calls out to CurationConcerns::AuditJob to do the bulk of it’s work.

  • FileSetAuditService calls AuditJob as perform_later, there is no way to configure it to run synchronously.
    • When I ran an audit of every file on our staging server  (with a hand-edit to do them synchronously so I could time it more easily and clearly), it took about 3.8 hours to check 8077 FileSets on staging.
    • This means a bulk audit, using Resque bg jobs to do it — could clog up the resque job queue for up to 3.8 hours (less with more resque workers, not neccesarily scaling directly), making other jobs (like derivatives creation) take a long time to complete, perhaps at the end of the queue 3.8 hours later. Clogging up the job queue for a bulk fixity audit seems problematic. One could imagine changing it to use a different queue name with dedicated workers — but for bulk fixity check, I’m not sure if there is a reason for this to be in the bg job queue at all, doing it all synchronously seems fine/preferable.
    • It’s not entirely clear to me what rationale governs the split of logic between FileSetAuditService, and AuditJob, or if it’s completely rationale. I guess one thing is that the FileSetAuditService is for a whole FileSet, but the AuditJob for an individual file. The FileSetAuditService does schedule audits for every file version if there are more than one in the FileSet.

The CurationConcerns::AuditJob actually calls out to ActiveFedora::FixityService to actually do the fixity check.

How is the fixity check done?

  • ActiveFedora::FixityService simply asks Hydra for a fixity check on a URL (for an individual File, I think). The asset is not downloaded or examined by Hydra stack code, a simple HTTP request “do a fixity check on this file and tell me the result” is sent to Hydra.
    • This means we are trusting that A) even if the asset has been corrupted, Fedora’s stored checksum for the asset is still okay, and B) that the Fedora fixity service actually works. I guess these are safe assumptions for a reliable fixity service?
    • It looks at the RDF body returned by the hydra fixity service to interpret if the fixity check was good or not
    • While the Hydra fixity service RDF body response includes some additional information (such as original and current checksum), this information is not captured and sent up to the stack to be reported or logged — ActiveFedora::FixityService just returns true or false, (which explains why ChecksumAuditLog records always have blank expected_result and actual_result attributes).

What do we need or want locally that differs from standard setup

We decided that trusting the Fedora fixity service was fine — we know of no problems with it, if we did we’d report them upstream to Fedora, who would hopefully fix them quickly since fixity is kind of a key feature for preservation. Ideally, one might want to store a copy of the original checksums elsewhere to make sure they were still good in Fedora, but we decided we weren’t going to do this for now either. We will run some kind of bulk fixity-all-the-things task periodically, and do want to receive notifications.

  1. Different notification on fixity failure than the default internal notification to depositor. This should be easy to do in current architecture with a local setting though, hooray.
  2. Get the bulk fixity check not to create a bg job for every file audited, filling up the bg job queue. For a bulk fixity check in our infrastructure, just one big long-running foreground process seems fine.
  3. Get the hydra fixity check response details to be recorded and passed up the stack for ChecksumAuditLog inclusion and inclusion in notification. Expected checksum and actual checksum, at least. This requires changes to ActiveFedora, or using something new instead of what’s in ActiveFedora.  (The current fedora registered checksum may be neccessary for recovery, see below).  Not sure if there should be a way to mark a failed ChecksumAuditLog row as ‘resolved’, for ongoing admin overview of fixity status. Probably not if the same file gets a future ChecksumAuditLog row as ‘passing’, that’s enough indication of ‘resolved’.
  4. Ideally, fix bug where “Audit status” never gets updated and always says “no audits have yet been done”.
  5. Failed audits should be logged to standard rails log as well as other notification methods.
  6. It has been suggested that we might only want to be fixity-auditing the most recent version of any file, there’s no need to audit older versions. I’m not sure if this sound from a preservation standpoint, those old versions might be part of the archival history? But it might simply one recovery strategy, see below.
  7. Ideally, clean up the code a little bit in general. I don’t entirely understand why logic is split between classes as it is, and don’t understand what all the methods are doing. Don’t understand why ChecksumAuditLog has an integer pass instead of a boolean. Code is harder to figure out what it’s doing than seems necessary for relatively simple functionality here.
  8. Ideally, perhaps, an admin UI for showing “current failed fixity checks”, in case you missed the notification.

And finally, an area that I’ll give more than a bullet point to — RECOVERYWhile I expect fixity failures to be very rare, possibly we will literally never see one in the lifetime of this local project — doing fixity checks without having a tested process for recovery from a discovered corrupted file seems pointless.  What’s the point of knowing a file is corrupt if you can’t do anything about it?

I’m curious if any other hydra community people have considered this, and have a recovery process.

We do have disk backups of the whole fedora server. In order to try and recover an older non-corrupted version, we have to know where it is on disk. Knowing fedora’s internal computed SHA1 — which I think is the same thing it uses for fixity checking — seems like what you need to find the file on disk, they are filed on disk by the SHA1 taken at time of ingest.

Once you’ve identified a known-good passing-SHA1-checksum version backup (by computing SHA1’s yourself, in the same way fedora does, presumably) — how do you actually restore it? I haven’t been able to find anything in sufia/hyrax or fedora itself meant to help you here.

We can think of two ways. We could literally replace the file on disk in the fedora file system. This seems nice, but not sure if we should be messing with fedora’s internals like that. Or we could upload a new “version”, the known-good one, to sufia/fedora. This is not messing with fedora internals, but the downside is the old corrupt version is still there, and still failing fixity checks, and possibly showing up in your failed fixity check reports and notifications etc, unless you build more stuff on top to prevent them. False positive “fixity check failures” would be bad, and lead to admins ignoring “fixity check failure” notices as is human nature.

Curious if Fedora/fcrepo itself has any intended workflow here, for how you recover from a failed fixity check, when you have an older known-good version. Anyone know?

I think most of these changes, at least as options, would be good to send upstream — the current code seems not quite right in the generic case to me, I don’t think it’s any special use cases I have. The challenge with upstream PR here is that the code spans both Hyrax and ActiveFedora, which would need to be changed in a synchronized fashion. And that I’m not quite sure the intention of the existing code, what parts that look like weird architecture to me are actually used or needed by someone or something. Both of which make it more challenging, and more time-consuming, to send upstream. So not sure yet how much I’ll be able to send upstream, and how much will be just local.

Advertisements

On the graphic design of rubyland.news

I like to pay attention to design, and enjoy good design in the world, graphic and otherwise. A well-designed printed page, web page, or physical tool is a joy to interact with.

I’m not really a trained designer in any way, but in my web development career I’ve often effectively been the UI/UX/graphic designer of apps I work on, and I do my best, and always try to do the best I can (our users deserve good design), and to develop my skills by paying attention to graphic design in the world, reading up (I’d recommend Donald Norman’s The Design of Everyday Things, Robert Bringhurt’s The Elements of Typographic Style, and one free online one, Butterick’s Practical Typography), and trying to improve my practice, and I think my graphic design skills are always improving.   (I also learned a lot looking at and working with the productions of the skilled designers at Friends of the Web, where I worked last year).

Implementing rubyland.news turned out to be a great opportunity to practice some graphic and web design. Rubyland.news has very few graphical or interactive elements, it’s a simple thing that does just a few things. The relative simplicity of what’s on the page, combined with it being a hobby side project — with no deadlines, no existing branding styles, and no stakeholders saying things like “how about you make that font just a little bit bigger” — made it a really good design exercise for me, where I could really focus on trying to make each element and the pages as a whole as good as I could in both aesthetics and utility, and develop my personal design vocabulary a bit.

I’m proud of the outcome, while I don’t consider it perfect (I’m not totally happy with the typography of the mixed-case headers in Fira Sans), I think it’s pretty good typography and graphic design, probably my best design work. It’s nothing fancy, but I think it’s pleasing to look at and effective.  I think probably like much good design, the simplicity of the end result belies the amount of work I put in to make it seem straightforward and unsophisticated. :)

My favorite element is the page-specific navigation (and sometimes info) “side bar”.

Screenshot 2017-04-21 11.21.45

At first I tried to put these links in the site header, but there wasn’t quite enough room for them, I didn’t want to make the header two lines — on desktop or wide tablet displays, I think vertical space is a precious resource not to be squandered. And I realized that maybe anyway it was better for the header to only have unchanging site-wide links, and have page-specific links elsewhere.

Perhaps encouraged by the somewhat hand-written look (especially of all-caps text) in Fira Sans, the free font I was trying out, I got the idea of trying to include these as a sort of ‘margin note’.

Screenshot 2017-04-21 11.32.51

The CSS got a bit tricky, with screen-size responsiveness (flexbox is a wonderful thing). On wide screens, the main content is centered in the screen, as you can see above, with the links to the left: The ‘like a margin note’ idea.

On somewhat narrower screens, where there’s not enough room to have margins on both sides big enough for the links, the main content column is no longer centered.

Screenshot 2017-04-21 11.36.48.png

And on very narrow screens, where there’s not even room for that, such as most phones, the page-specific nav links switch to being above the content. On narrow screens, which are typically phones that are much higher than they are wide, it’s horizontal space that becomes precious, with some more vertical to spare.

Screenshot 2017-04-21 11.39.16

Note on really narrow screens, which is probably most phones especially held in vertical orientation, the margins on the main content disappear completely, you get actual content with it’s white border from edge-to-edge. This seems an obvious thing to do to me on phone-sized screens: Why waste any horizontal real-estate with different-colored margins, or provide a visual distraction with even a few pixels of different-colored margin or border jammed up against the edge?  I’m surprised it seems a relatively rare thing to do in the wild.

Screenshot 2017-04-21 11.39.36

Nothing too fancy, but I quite like how it turned out. I don’t remember exactly what CSS tricks I used to make it so. And I still haven’t really figured out how to write clear maintainable CSS code, I’m less proud of the actual messy CSS source code then I am of the result. :)

One way to remove local merged tracking branches

My git workflow involves creating a lot of git feature branches, as remote tracking branches on origin. They eventually get merged and deleted (via github PR), but i still have dozens of them lying around.

Via googling, getting StackOverflow answers, and sort of mushing some stuff I don’t totally understand together, here’s one way to deal with it, create an alias git-prune-tracking.  In your ~/.bash_profile:

alias git-prune-tracking='git branch --merged | grep -v "*" | grep -v "master" | xargs git branch -d; git remote prune origin'

And periodically run git-prune-tracking from a git project dir.

I do not completely understand what this is doing I must admit, and there might be a better way? But it seems to work. Anyone have a better way that they understand what it’s doing?  I’m kinda surprised this isn’t built into the git client somehow.

Use capistrano to run a remote rake task, with maintenance mode

So the app I am now working on is still in it’s early stages, not even live to the public yet, but we’ve got an internal server. We periodically have to change a bunch of data in our (non-rdbms) “production” store. (First devops unhappiness, I think there should be no scheduled downtime for planned data transformation. We’re working on it. But for now it happens).

We use capistrano to deploy. Previously/currently, the process for making these scheduled-downtime maintenance mode looked like:

  • on your workstation, do a cap production maintenance:enable to start some downtime
  • ssh into the production machine, cd to the cap-installed app, and run a bundle exec run a rake task. Which could take an hour+.
  • Remember to come back when it’s done and `cap production maintenance:disable`.

A couple more devops unhappiness points here: 1) In my opinion you should ideally never be ssh’ing to production, at least in a non-emergency situation.  2) You have to remember to come back and turn off maintenance mode — and if I start the task at 5pm to avoid disrupting internal stakeholders, I gotta come back after busines hours to do that! I also think every thing you have to do ‘outside business hours’ that’s not an emergency is a not yet done ops environment.

So I decided to try to fix this. Since the existing maintenance mode stuff was already done through capistrano, and I wanted to do it without a manual ssh to the production machine, capistrano seemed a reasonable tool. I found a plugin to execute rake via capistrano, but it didn’t do quite what I wanted, and it’s implementation was so simple that I saw no reason not to copy-and-paste it and just make it do just what I wanted.

I’m not gonna maintain this for the public at this point (make a gem/plugin out of it, nope), but I’ll give it to you in a gist if you want to use it. One of the tricky parts was figuring out how to get “streamed” output from cap, since my rake tasks use ruby-progressbar — it’s got decent non-TTY output already, and I wanted to see it live in my workstation console. I managed to do that! Although I never figured out how to get a cap recipe to require files from another location (I have no idea how I couldn’t make it work), so the custom class is ugly inlined in.

I also ordinarily want maintenance mode to be turned off even if the task fails, but still want a non-zero exit code in those cases (anticipating future further automation — really what I need is to be able to execute this all via cron/at too, so we can schedule downtime for the middle of the night without having to be up then).

Anyway here’s the gist of the cap recipe. This file goes in ./lib/capistrano/tasks in a local app, and now you’ve got these recipes. Any tips on how to organize my cap recipe better quite welcome.

Hash#map ?

I frequently have griped that Hash didn’t have a useful map/collect function, something allowing me to transform the hash keys or values (usually values), into another transformed hash. I even go looking for for it in ActiveSupport::CoreExtensions sometimes, surely they’ve added something, everyone must want to do this… nope.

Thanks to realization triggered by an example in BigBinary’s blog post about the new ruby 2.4 Enumerable#uniq… I realized, duh, it’s already there!

olympics = {1896 => 'Athens', 1900 => 'Paris', 1904 => 'Chicago', 1906 => 'Athens', 1908 => 'Rome'}
olympics.collect { |k, v| [k, v.upcase]}.to_h
# => => {1896=>"ATHENS", 1900=>"PARIS", 1904=>"CHICAGO", 1906=>"ATHENS", 1908=>"ROME"}

Just use ordinary Enumerable#collect, with two block args — it works to get key and value. Return an array from the block, to get an array of arrays, which can be turned to a hash again easily with #to_h.

It’s a bit messy, but not really too bad. (I somehow learned to prefer collect over it’s synonym map, but I think maybe I’m in the minority? collect still seems more descriptive to me of what it’s doing. But this is one place where I wouldn’t have held it against Matz if he had decided to give the method only one name so we were all using the same one!)

(Did you know Array#to_h turned an array of duples into a hash?  I am not sure I did! I knew about Hash(), but I don’t think I knew about Array#to_h… ah, it looks like it was added in ruby 2.1.0.  The equivalent before that would have been more like Hash( hash.collect {|k, v| [k, v]}), which I think is too messy to want to use.

I’ve been writing ruby for 10 years, and periodically thinking “damn, I wish there was something like Hash#collect” — and didn’t realize that Array#to_h was added in 2.1, and makes this pattern a lot more readable. I’ll def be using it next time I have that thought. Thanks BigBinary for using something similar in your Enumerable#uniq example that made me realize, oh, yeah.

 

“Polish”; And, What makes well-designed software?

Go check out Schneem’s post on “polish”. (No, not the country).

Polish is what distinguishes good software from great software. When you use an app or code that clearly cares about the edge cases and how all the pieces work together, it feels right. Unfortunately, this is the part of the software that most often gets overlooked, in favor of more features or more time on another project…

…When we say something is “polished” it means that it is free from sharp edges, even the small ones. I view polished software to be ones that are mostly free from frustration. They do what you expect them to and are consistent.…

…In many ways I want my software to be boring. I want it to harbor few surprises. I want to feel like I understand and connect with it at a deep level and that I’m not constantly being caught off guard by frustrating, time stealing, papercuts.

I definitely have experienced the difference between working with and on a project that has this kind of ‘polish’ and, truly, experiencing a deep-level connection of the code that lets me crazy effective with it — and working on or with projects that don’t have this.  And on projects that started out with it, but lost it! (An irony is that it takes a lot of time, effort, skill, and experience to design an architecture that seems like the only way it would make sense to do it, obvious, and as schneems says, “boring”!)

I was going to say “We all have experienced the difference…”, but I don’t know if that’s true. Have you?

What do you think one can do to work towards a project with this kind of “polish”, and keep it there?  I’m not entirely sure, although I have some ideas, and so does schneems. Tolerating edge-case bugs is a contraindication — and even though I don’t really believe in ‘broken windows theory’ when it comes to neighborhoods, I think it does have an application here. Once the maintainers start tolerating edge case bugs and sloppiness, it sends a message to other contributors, a message of lack of care and pride. You don’t put in the time to make a change right unless the project seems to expect, deserve, and accommodate it.

If you don’t even have well-defined enough behavior/architecture to have any idea what behavior is desirable or undesirable, what’s a bug– you’ve clearly gone down a wrong path incompatible with this kind of ‘polish’, and I’m not sure if it can be recovered from. A Fred Brooks “Mythical Man Month” quote I think is crucial to this idea of ‘polish’: “Conceptual integrity is central to product quality.”  (He goes on to say that having an “architect” is the best way to get conceptual integrity; I’m not certain, I’d like to believe this isn’t true because so many formal ‘architect’ roles are so ineffective, but I think my experience may indeed be that a single or tight team of architects, formal or informal, does correlate…).

There’s another Fred Brooks quote that now I can’t find and I really wish I could cause I’ve wanted to return to it and meditate on it for a while, but it’s about how the power of a system is measured by what you can do with it divided by the number of distinct architectural concepts. A powerful system is one that can do a lot with few architectural concepts.  (If anyone can find this original quote, I’ll buy you a beer or a case of it).

I also know you can’t do this until you understand the ‘business domain’ you are working in — programmers as interchangeable cross-industry widgets is a lie. (‘business domain’ doesn’t necessarily mean ‘business’ in the sense of making money, it means understanding the use-cases and needs of your developer users, as they try to meet the use cases and needs of their end-users, which you need to understand too).

While I firmly believe in general in the caution against throwing out a system and starting over, a lot of this caution is about losing the domain knowledge encoded in the system (really, go read Joel’s post). But if the system was originally architected by people (perhaps past you!) who (in retrospect) didn’t have very good domain knowledge (or the domain has changed drastically?), and you now have a team (and an “architect”?) that does, and your existing software is consensually recognized as having the opposite of the quality of ‘polish’, and is becoming increasingly expensive to work with (“technical debt”) with no clear way out — that sounds like a time to seriously consider it. (Although you will have to be willing to accept it’ll take a while to get feature parity, if those were even the right features).  (Interestingly, Fred Books was I think the originator of the ‘build one to throw away’ idea that Joel is arguing against. I think both have their place, and the place of domain knowledge is a crucial concept in both).

All of these are more vague hand wavy ideas than easy to follow directions, I don’t have any easy to follow directions, or know if any exist.

But I know that the first step is being able to recognize “polish”, a well-designed parsimoniously architected system that feels right to work with and lets you effortlessly accomplish things with it.  Which means having experience with such systems. If you’ve only worked with ball-of-twine difficult to work with systems, you don’t even know what you’re missing or what is possible or what it looks like. You’ve got to find a way to get exposure to good design to become a good designer, and this is something we don’t know how to do as well with computer architecture as with classic design (design school consists of exposure to design, right?)

And the next step is desiring and committing to building such a system.

Which also can mean pushing back on or educating managers and decision-makers.  The technical challenge is already high, but the social/organizational challenge can be even higher.

Because it is harder to build such a system than to not, designing and implementing good software is not easy, it takes care, skill, and experience.  Not every project deserves or can have the same level of ‘polish’. But if you’re building a project meant to meet complex needs, and to be used by a distributed (geographically and/or organizationally) community, and to hold up for years, this is what it takes. (Whether that’s a polished end-user UX, or developer-user UX, which means API, or both, depending on the nature of this distributed community).

Command-line utility to visit github page of a named gem

I’m working in a Rails stack that involves a lot of convolutedly inter-related dependencies, which I’m not yet all that familiar with.

I often want to go visit the github page of one of the dependencies, to check out the README, issues/PRs, generally root around in the source, etc.  Sometimes I want to clone and look at the source locally, but often in addition to that I’m wanting to look at the github repo.

So I wrote a little utility to let me do gem-visit name_of_gem and it’ll open up the github page in a browser, using the MacOS open utility to open your default browser window.

The gem doesn’t need to be installed, it uses the rubygems.org API (hooray!) to look up the gem by name, and look at it’s registered “Homepage” and “Source Code” links. If either one is a github.com link, it’ll prefer that. Otherwise, it’ll just send you to the Homepage, or failing that, to the rubygems.org page.

It’s working out well for me.

I wrote it in ruby (naturally), but with no dependencies, so you can just drop it in your $PATH, and it just works. I put ~/bin on my $PATH, so I put it there.

I’ll give to give you a gist of it (yes, i’m not maintaining it at present), but thinking about the best way to distribute this if I wanted to maintain it…. Distributing as a ruby gem doesn’t actually seem great, with the way people use ruby version switchers, you’d have to make sure to get it installed for every version of ruby you might be using to get the command line to work in them all.

Distributing as a homebrew package seems to actually make a lot of sense, for a very simple command-line utility like this. But homebrew doesn’t really like niche/self-submitted stuff, and this doesn’t really meet the requirements for what they’ll tolerate. But homebrew theoretically supports a third-party ‘tap’, and I think a github repo itself can be such a ‘tap’, and hypothetically I could set things up so you could cask tap my_github_url, and install (and later upgrade or uninstall!) with brew… but I haven’t been able to figure out how to do that from the docs/examples, just verified that it oughta be possible!  Anyone want to give me any hints or model examples?  Of a homebrew formula that does nothing but install a simple one-doc bash script, and is a third-party tap hosted  on github?

Monitoring your Rails apps for professional ops environment

It’s 2017, and I suggest that operating a web app in a professional way necessarily includes 1) finding out about problems before you get an error report from your users (or at best, before they even effect your users at all), and 2) and having the right environment to be able to fix them as quickly as possible.

Finding out before an error report means monitoring of some kind, which is also useful for diagnosis to get things fixed as quickly as possible. In this post, I’ll be talking about monitoring.

If you aren’t doing some kind of monitoring to find out about problems before you get a user report, I think you’re running a web app like a hobby or side project, not like a professional operation — and if your web app is crucial to your business/organization/entity’s operations, mission, or public perception, it means you don’t look like you know what you’re doing. Many library ‘digital collections’ websites get relatively little use, and a disproportionate amount of use comes from non-affiliated users that may not be motivated to figure out how to report problems, they just move on — if you don’t know about problems until they get reported by a human, the problem could have existed for literally months before you find out about them. (Yes, I have seen this happen).

What do we need from monitoring?

What are some things we might want to monitor?

  • error logs. You want to know about every fatal exception (resulting in a 500 error) in your Rails app. This means someone was trying to do or view something, and got an error instead of what they wanted.
    • even things that are caught without 500ing but represent something not right that your app managed to recover from, you might want to know about. These often correspond to erorr or warn (rather than fatal) logging levels.
    • In today’s high-JS apps, you might want to get these from JS too.
  • Outages. If the app is totally down, it’s not going to be logging errors, but it’s even worse. The app could be down because of a software problem on the server, or because the server or network is totally borked, or whatever, you want to know about it no matter what.
  • pending danger or degraded performance.
    • Disk almost out of capacity.
    • RAM excessively swapping, or almost out of swap. (or above quota on heroku)
    • SSL licenses about to expire
    • What’s your median response time? 90th or 95th percentile?  Have they suddenly gotten a lot worse than typical? This can be measured on the server (time server took to respond to HTTP request), or on the browser, and actual browser measurements can include actual browser load and time-to-onready.
  • More?

Some good features of a monitoring environment:

  • It works, there are minimal ‘false negatives’ where it misses an event you’d want to know about. The monitoring service doesn’t crash and stop working entirely, or stop sending alerts without you knowing about it. It’s really monitoring what you think it’s monitoring.
  • Avoid ‘false positives’ and information overload. If the monitoring service is notifying or logging too much stuff, and most of it doesn’t really need your attention after all — you soon stop paying attention to it altogether. It’s just human nature, the boy who cried wolf. A monitoring/alerting service that staff ignores doesn’t actually help us run professional operations.
  • Sends you alert notifications (Emails, SMS, whatever.) when things look screwy, configurable and least well-tuned to give you what you need and not more.
  • Some kind of “dashboard” that can give you overall project health, or an at a glance view of the current status, including things like “are there lots of errors being reported”.
  • Maybe uptime or other statistics you can use to report to your stakeholders.
  • Low “total cost of ownership”, you don’t want to  have to spend hours banging your head against the wall to configure simple things or to get it working to your needs. The monitoring service that is easiest to use will get used the most, and again, a monitoring service nobody sets up isn’t doing you any good.
  • public status page of some kind, that provides your users (internal or external) a place to look for “is it just me, or is this service having problems” — you know, like you probably use regularly for the high quality commercial services you use. This could be automated, or manual, or a combination.

Self-hosted open source, or commercial cloud?

While one could imagine things like a self-hosted proprietary commercial solution, I find that in reality people are usually choosing between ‘free’ open source self-hosted packages, and for-pay cloud-hosted services.

In my experience, I have come to unequivocally prefer cloud-hosted solutions.

One problem with open-source self-hosted, is that it’s very easy to misconfigure them so they aren’t actually working.  (Yes, this has happened to me). You are also responsible for keeping them working — a monitoring service that is down does not help you. Do you need a monitoring service for your monitoring service now?  What if a network or data center event takes down your monitoring service at the same time it takes down the service it was monitoring? Again, now it’s useless and you don’t find out. (Yes, this has happened to me too).

There exist a bunch of high-quality commercial cloud-hosted offerings these days. Their prices are reasonable (if not cheap; but compare to your staff time in getting this right), their uptime is outstanding (let them handle their own monitoring of the monitoring service they are providing to you, avoid infinite monitoring recursion); many of them are customized to do just the right thing for Rails; and their UI’s are often great, they just tend to be more usable software than the self-hosted open source solutions I’ve seen.

Personally, I think if you’re serious about operating your web apps and services professionally, commercial cloud-hosted solutions are an expense that makes sense.

Some cloud-hosted commercial services I have used

I’m not using any of these currently at my current gig. And I have more experience with some than others. I’m definitely still learning about this stuff myself, and developing my own preferred stack and combinations. But these are all services I have at least some experience with, and a good opinion of.

There’s no one service(I know of) that does  everything I’d want in the way I’d want it, so it probably does require using (and paying for) multiple services. But price is definitely something I consider.

Bugsnag

  • Captures your exceptions and errors, that’s about it. I think Rails was their first target and it’s especially well tuned for Rails, although you can use it for other platforms too.
  • Super easy to include in your Rails project, just add a gem, pretty much.
  • Gives you stack traces and other (configurable) contextual information (logged in user)
  • But additional customization is possible in reasonable ways.
  • Including manually sending ‘errors’
  • Importantly, groups the same error repeatedly together as one line (you can expand), to avoid info overload and crying wolf.
  • Has some pretty graphs.
  • Let’s you prioritize, filter, ‘snooze’ errors to pop up only if they happen again after ‘snooze’ time, and other features that again let you avoid info overload and actually respond to what needs responding to.
  • Email/SMS alerts in various ways including integrations with services like PagerDuty

Honeybadger

Very similar to bugsnag, it does the same sorts of things in the same sorts of ways. From some quick experimentation, I like some of it’s UX better, and some worse. But it does have more favorable pricing for many sorts of organizations than bugsnag — and offers free accounts “for non-commercial open-source projects”, not sure how many library apps would qualify.

Also includes optional integration with the Rails uncaught exception error page, to solicit user comments on the error that will be attached to the error report, which might be kind of neat.

Honeybadger also has some limited HTTP uptime monitoring functionality, ability to ‘assign’ error log lines to certain staff, and integration with various issue-tracking software to do the same.

All in all, if you can only afford one monitoring provider, I think honeybadger’s suite of services and price make it a likely contender.

(disclosure added in retrospect 28 Mar 2017. Honeybadger sponsors my rubyland project at a modest $20/month. Writing this review is not included in any agreement I have with them. )

New Relic

New Relic really focuses on performance monitoring, and is quite popular in Rails land, with easy integration into Rails apps via a gem.  New Relic also has javascript instrumentation, so it can measure actual perceived-by-user browser load times, as effected by network speed, JS and CSS efficiency, etc. New Relic has sophisticated alerting setup, that allows you to get alerted when performance is below the thresholds you’ve set as acceptable, or below typical long-term trends (I think I remember this last one can be done, although can’t find it now).

The New Relic monitoring tool also includes some error and uptime/availability monitoring; for whatever reasons, many people using New Relic for performance monitoring seem to use another product for these features, I haven’t spent enough time with New Relics to know why. (These choices were already made for me at my last gig, and we didn’t generally use New Relic for error or uptime monitoring).

Statuscake

Statuscake isn’t so much about error log monitoring or performance, as it is about uptime/availability.

But Statuscake also includes a linux daemon that can be installed to monitor internal server state, not just HTTP server responsiveness. Including RAM, CPU, and disk utilization.  It gives you pretty graphs, and alerting if metrics look dangerous.

This is especially useful on a non-heroku/PaaS deployment, where you are fully responsible for your machines.

Statuscake can also monitor looming SSL cert expiry, SMTP and DNS server health, and other things in the realm of infrastructure-below-the-app environment.

Statuscake also optionally provides a public ‘status’ page for your users — I think this is a crucial and often neglected piece, that really makes your organization seem professional and meet user needs (whether internal staff users or external). But I haven’t actually explored this feature myself.

At my prevous gig, we used Statuscake happily, although I didn’t personally have need to interact with it much.

Librato

My previous gig at the Friends of the Web consultancy used Librato on some projects happily, so I’m listing it here — but I honestly don’t have much personal experience with it, and don’t entirely understand what it does. It really focuses on graphing over time though — I think.  I think it’s graphs sometimes helped us notice when we were under a malware bot attack of various sorts, or otherwise were getting unusual traffic (in volume or nature) that should be taken account of. It can use it’s own ‘agents’, or accept data from other open source agents.

Heroku metrics

Haven’t actually used this too much either, but if you are deploying on heroku, with paid heroku dynos, you already have some metric collection built in, for basic things like memory, CPU, server-level errors, deployment of new versions, server-side response time (not client-side like New Relic), and request time outs.

You’ve got em, but you’ve got to actually look at them — and probably set up some notification/alerting on thresholds and unusual events — to get much value from this! So just a reminder that it is there, and one possibly budget-conscious option if you are already on heroku.

Phusion Union Station

If you already use Phusion Passenger  for your Rails app server, then Union Station is Phusion’s non-free monitoring/analytics solution that integrats with Passenger. I don’t believe you have to use the enterprise (paid) edition of Passenger to use Union Station, but Union Station is not free.

I haven’t been in a situation using Passenger and prioritizing monitoring for a while, and don’t have any experience with this product. But I mention it because I’ve always had a good impression of Phusion’s software quality and UX, and if you do use Passenger, it looks like it has the potential to be a reasonably-priced (although priced based on number of requests, which is never my favorite pricing scheme) all-in-one solution monitoring Rails errors, uptime, server status, and performance (don’t know if it offers Javascript instrumentation for true browser performance).

Conclusion

It would be nice if there were a product that could do all of what you need, for a reasonable price, so you just need one. Most actual products seem to start focusing on one aspect of monitoring/notification — and sometimes try to expand to be ‘everything’.

Some products (New Relic, Union Station) seem to be trying to provide an “all your monitoring needs” solution, but my impression of general ‘startup sector’ is that most organizations still put together multiple services to give them the complete monitoring/notification they need.

I’m not sure why. I think some of this is just that few people want to spend time learning/evaluating/configuring a new solution, and if they have something that works, or have been told by a trusted friend works, they stick with it.  Also perhaps the the ‘all in one’ solutions don’t provide as good UX and functionality for particular areas they weren’t originally focusing on as other tools that were originally focusing on those areas. And if you are a commercial entity making money (or aiming to do so), even an ‘expensive’ suite of monitoring services is a small percentage of your overall revenue/profit, and worth it to protect that revenue.

Personally, if I were getting started from virtually no monitoring, in a very budget-conscious non-commercial environment, I’d start with Honeybadger (including making sure to use it’s HTTP uptime monitoring), and then consider adding one or both of New Relic or Statuscake next.

Anyone, especially from the library/cultural/non-commercial sector, want to share your experiences with monitoring? Do you do it at all? Have you used any of these services? Have you tried self-hosted open source monitoring solutions? And found them great? Or disastrous? Or somewhere in between?  Any thoughts on “Total cost of ownership” of self-hosted solutions, do you agree or disagree with me that they tend to be a losing proposition? Do you think you’re providing a professionally managed web app environment for your users now?

One way or another, let’s get professional on this stuff!

“This week in rails” is a great idea

At some point in the past year or two (maybe even earlier? But I know it’s not many years old) the Rails project  started releasing roughly weekly ‘news’ posts, that mostly cover interesting or significant changes or bugfixes made to master (meaning they aren’t in a release yet, but will be in the next; I think the copy could make this more clear, actually!).

This week in Rails

(Note, aggregated in rubyland too!).

I assume this was a reaction to a recognized problem with non-core developers and developer-users keeping up with “what the heck is going on with Rails” — watching the github repo(s) wasn’t a good solution, too overwhelming, info overload. And I think it worked to improve this situation! It’s a great idea!  It makes it a lot easier to keep up with Rails, and a lot easier for a non-committing or non/rarely-contribution developer to maintain understanding of the Rails codebase.  Heck, even for newbie developers, it can serve as pointers to areas of the Rails codebase they might want to look into and develop some familiarity with, or at least know exist!

I think it’s been quite successful at helping in those areas. Good job and thanks to Rails team.

I wonder if this is something the Hydra project might consider?  It definitely has some issues for developers in those areas.

The trick is getting the developer resources to do it of course. I am not sure who writes the “This Week in Rails” posts — it might be a different core team dev every week? Or what methodology they use to compile it, if they’re writing on areas of the codebase they might not be familiar with either, just looking at the commit log and trying to summarize it for a wider audience, or what.  I think you’d have to have at least some familiarity with the Rails codebase to write these well, if you don’t understand what’s going on yourself, you’re going to have trouble writing a useful summary. It would be interesting to do an interview with someone on Rails core team about how this works, how they do it, how well they think it’s working, etc.

 

searching through all gem dependencies

Sometimes in your Rails project, you can’t figure out where a certain method or config comes from. (This is especially an issue for me as I learn the sufia stack, which involves a whole bunch of interrelated gem dependencies).  Interactive debugging techniques like source_location are invaluable and it pays to learn how to use them, but sometimes you just want/need to grep through ALL the dependencies.

As several stackoverflow answers can tell you, you can do that like this, from your Rails (or other bundler-using) project directory:

grep "something" -r $(bundle show --paths)
# Or if you want to include the project itself at '.' too:
grep "something" -r . $(bundle show --paths)

You can make a bash function to do this for you, I’m calling it gemgrep for now. Put your in your bash configuration file (probably ~/.bash_profile on OSX):

gemgrep () { grep "$@" -r . $(bundle show --paths); }

Now I can just gemgrep something from a project directory, and search through the project AND all gem dependencies. It might be slow.

I also highly recommend setting your default grep to color in interactive terminals, with this in your bash configuration:

export GREP_OPTIONS='--color=auto'