It’s 2017, and I suggest that operating a web app in a professional way necessarily includes 1) finding out about problems before you get an error report from your users (or at best, before they even effect your users at all), and 2) and having the right environment to be able to fix them as quickly as possible.
Finding out before an error report means monitoring of some kind, which is also useful for diagnosis to get things fixed as quickly as possible. In this post, I’ll be talking about monitoring.
If you aren’t doing some kind of monitoring to find out about problems before you get a user report, I think you’re running a web app like a hobby or side project, not like a professional operation — and if your web app is crucial to your business/organization/entity’s operations, mission, or public perception, it means you don’t look like you know what you’re doing. Many library ‘digital collections’ websites get relatively little use, and a disproportionate amount of use comes from non-affiliated users that may not be motivated to figure out how to report problems, they just move on — if you don’t know about problems until they get reported by a human, the problem could have existed for literally months before you find out about them. (Yes, I have seen this happen).
What do we need from monitoring?
What are some things we might want to monitor?
- error logs. You want to know about every fatal exception (resulting in a 500 error) in your Rails app. This means someone was trying to do or view something, and got an error instead of what they wanted.
- even things that are caught without 500ing but represent something not right that your app managed to recover from, you might want to know about. These often correspond to
warn (rather than
fatal) logging levels.
- In today’s high-JS apps, you might want to get these from JS too.
- Outages. If the app is totally down, it’s not going to be logging errors, but it’s even worse. The app could be down because of a software problem on the server, or because the server or network is totally borked, or whatever, you want to know about it no matter what.
- pending danger or degraded performance.
- Disk almost out of capacity.
- RAM excessively swapping, or almost out of swap. (or above quota on heroku)
- SSL licenses about to expire
- What’s your median response time? 90th or 95th percentile? Have they suddenly gotten a lot worse than typical? This can be measured on the server (time server took to respond to HTTP request), or on the browser, and actual browser measurements can include actual browser load and time-to-onready.
Some good features of a monitoring environment:
- It works, there are minimal ‘false negatives’ where it misses an event you’d want to know about. The monitoring service doesn’t crash and stop working entirely, or stop sending alerts without you knowing about it. It’s really monitoring what you think it’s monitoring.
- Avoid ‘false positives’ and information overload. If the monitoring service is notifying or logging too much stuff, and most of it doesn’t really need your attention after all — you soon stop paying attention to it altogether. It’s just human nature, the boy who cried wolf. A monitoring/alerting service that staff ignores doesn’t actually help us run professional operations.
- Sends you alert notifications (Emails, SMS, whatever.) when things look screwy, configurable and least well-tuned to give you what you need and not more.
- Some kind of “dashboard” that can give you overall project health, or an at a glance view of the current status, including things like “are there lots of errors being reported”.
- Maybe uptime or other statistics you can use to report to your stakeholders.
- Low “total cost of ownership”, you don’t want to have to spend hours banging your head against the wall to configure simple things or to get it working to your needs. The monitoring service that is easiest to use will get used the most, and again, a monitoring service nobody sets up isn’t doing you any good.
- public status page of some kind, that provides your users (internal or external) a place to look for “is it just me, or is this service having problems” — you know, like you probably use regularly for the high quality commercial services you use. This could be automated, or manual, or a combination.
Self-hosted open source, or commercial cloud?
While one could imagine things like a self-hosted proprietary commercial solution, I find that in reality people are usually choosing between ‘free’ open source self-hosted packages, and for-pay cloud-hosted services.
In my experience, I have come to unequivocally prefer cloud-hosted solutions.
One problem with open-source self-hosted, is that it’s very easy to misconfigure them so they aren’t actually working. (Yes, this has happened to me). You are also responsible for keeping them working — a monitoring service that is down does not help you. Do you need a monitoring service for your monitoring service now? What if a network or data center event takes down your monitoring service at the same time it takes down the service it was monitoring? Again, now it’s useless and you don’t find out. (Yes, this has happened to me too).
There exist a bunch of high-quality commercial cloud-hosted offerings these days. Their prices are reasonable (if not cheap; but compare to your staff time in getting this right), their uptime is outstanding (let them handle their own monitoring of the monitoring service they are providing to you, avoid infinite monitoring recursion); many of them are customized to do just the right thing for Rails; and their UI’s are often great, they just tend to be more usable software than the self-hosted open source solutions I’ve seen.
Personally, I think if you’re serious about operating your web apps and services professionally, commercial cloud-hosted solutions are an expense that makes sense.
Some cloud-hosted commercial services I have used
I’m not using any of these currently at my current gig. And I have more experience with some than others. I’m definitely still learning about this stuff myself, and developing my own preferred stack and combinations. But these are all services I have at least some experience with, and a good opinion of.
There’s no one service(I know of) that does everything I’d want in the way I’d want it, so it probably does require using (and paying for) multiple services. But price is definitely something I consider.
- Captures your exceptions and errors, that’s about it. I think Rails was their first target and it’s especially well tuned for Rails, although you can use it for other platforms too.
- Super easy to include in your Rails project, just add a gem, pretty much.
- Gives you stack traces and other (configurable) contextual information (logged in user)
- But additional customization is possible in reasonable ways.
- Including manually sending ‘errors’
- Importantly, groups the same error repeatedly together as one line (you can expand), to avoid info overload and crying wolf.
- Has some pretty graphs.
- Let’s you prioritize, filter, ‘snooze’ errors to pop up only if they happen again after ‘snooze’ time, and other features that again let you avoid info overload and actually respond to what needs responding to.
- Email/SMS alerts in various ways including integrations with services like PagerDuty
Very similar to bugsnag, it does the same sorts of things in the same sorts of ways. From some quick experimentation, I like some of it’s UX better, and some worse. But it does have more favorable pricing for many sorts of organizations than bugsnag — and offers free accounts “for non-commercial open-source projects”, not sure how many library apps would qualify.
Also includes optional integration with the Rails uncaught exception error page, to solicit user comments on the error that will be attached to the error report, which might be kind of neat.
Honeybadger also has some limited HTTP uptime monitoring functionality, ability to ‘assign’ error log lines to certain staff, and integration with various issue-tracking software to do the same.
All in all, if you can only afford one monitoring provider, I think honeybadger’s suite of services and price make it a likely contender.
(disclosure added in retrospect 28 Mar 2017. Honeybadger sponsors my rubyland project at a modest $20/month. Writing this review is not included in any agreement I have with them. )
The New Relic monitoring tool also includes some error and uptime/availability monitoring; for whatever reasons, many people using New Relic for performance monitoring seem to use another product for these features, I haven’t spent enough time with New Relics to know why. (These choices were already made for me at my last gig, and we didn’t generally use New Relic for error or uptime monitoring).
Statuscake isn’t so much about error log monitoring or performance, as it is about uptime/availability.
But Statuscake also includes a linux daemon that can be installed to monitor internal server state, not just HTTP server responsiveness. Including RAM, CPU, and disk utilization. It gives you pretty graphs, and alerting if metrics look dangerous.
This is especially useful on a non-heroku/PaaS deployment, where you are fully responsible for your machines.
Statuscake can also monitor looming SSL cert expiry, SMTP and DNS server health, and other things in the realm of infrastructure-below-the-app environment.
Statuscake also optionally provides a public ‘status’ page for your users — I think this is a crucial and often neglected piece, that really makes your organization seem professional and meet user needs (whether internal staff users or external). But I haven’t actually explored this feature myself.
At my prevous gig, we used Statuscake happily, although I didn’t personally have need to interact with it much.
My previous gig at the Friends of the Web consultancy used Librato on some projects happily, so I’m listing it here — but I honestly don’t have much personal experience with it, and don’t entirely understand what it does. It really focuses on graphing over time though — I think. I think it’s graphs sometimes helped us notice when we were under a malware bot attack of various sorts, or otherwise were getting unusual traffic (in volume or nature) that should be taken account of. It can use it’s own ‘agents’, or accept data from other open source agents.
Haven’t actually used this too much either, but if you are deploying on heroku, with paid heroku dynos, you already have some metric collection built in, for basic things like memory, CPU, server-level errors, deployment of new versions, server-side response time (not client-side like New Relic), and request time outs.
You’ve got em, but you’ve got to actually look at them — and probably set up some notification/alerting on thresholds and unusual events — to get much value from this! So just a reminder that it is there, and one possibly budget-conscious option if you are already on heroku.
If you already use Phusion Passenger for your Rails app server, then Union Station is Phusion’s non-free monitoring/analytics solution that integrats with Passenger. I don’t believe you have to use the enterprise (paid) edition of Passenger to use Union Station, but Union Station is not free.
It would be nice if there were a product that could do all of what you need, for a reasonable price, so you just need one. Most actual products seem to start focusing on one aspect of monitoring/notification — and sometimes try to expand to be ‘everything’.
Some products (New Relic, Union Station) seem to be trying to provide an “all your monitoring needs” solution, but my impression of general ‘startup sector’ is that most organizations still put together multiple services to give them the complete monitoring/notification they need.
I’m not sure why. I think some of this is just that few people want to spend time learning/evaluating/configuring a new solution, and if they have something that works, or have been told by a trusted friend works, they stick with it. Also perhaps the the ‘all in one’ solutions don’t provide as good UX and functionality for particular areas they weren’t originally focusing on as other tools that were originally focusing on those areas. And if you are a commercial entity making money (or aiming to do so), even an ‘expensive’ suite of monitoring services is a small percentage of your overall revenue/profit, and worth it to protect that revenue.
Personally, if I were getting started from virtually no monitoring, in a very budget-conscious non-commercial environment, I’d start with Honeybadger (including making sure to use it’s HTTP uptime monitoring), and then consider adding one or both of New Relic or Statuscake next.
Anyone, especially from the library/cultural/non-commercial sector, want to share your experiences with monitoring? Do you do it at all? Have you used any of these services? Have you tried self-hosted open source monitoring solutions? And found them great? Or disastrous? Or somewhere in between? Any thoughts on “Total cost of ownership” of self-hosted solutions, do you agree or disagree with me that they tend to be a losing proposition? Do you think you’re providing a professionally managed web app environment for your users now?
One way or another, let’s get professional on this stuff!