debugging in a complex system of inter-related components

…which are under various parties control.

Eric Hellman recounts the process of tracking down an odd hard to reproduce bug in a link resolver process, that he describes as a “funny joke”, and indeed the particular nature of the bug is kind of funny in a hard to describe geeky kind of way.

But this got me thinking again about something I’ve been thinking about for a while, the added difficulty of tracking down bugs in the complex inter-related ‘chains’ of software that more and more are responsible for providing our library services — chains of software from various sources, hosted in various places, open source and licensed and free-with-no-contractual-support whatsoever.

What may not be hilarious is that if Eric hadn’t known the top tech guy at the company, he might have sent it to their support, who would, 99 times out of a hundred (Eric describes a number of other 99 times out of a 100 odds that led to this bug, which effected probably around a million interactions, not having been found/solved until now), have found the problem too mysterious and non-reproducible and closed the issue as non-reproducible. Meaning it could have taken until around 100,000,000 problem cases for it to be solved, instead of a million. (Actually, I think that math is wrong, and it would take even longer for the problem to be solved, but I forget my probability arithmetic, and all these numbers are just made up anyway. Suffice it to say it would have made it even less likely for the problem ever to actually be discovered by those with the power to solve it. –added 14 Dec 2009)

As a local systems librarian type, it can be VERY frustrating trouble shooting problems that involve so many different pieces in a chain like this. With a link resolver, that I’m responsible for, we’ve got:

  1. (usually) the source of the openurl link.
  2. The link resolver itself. (In my case that consists of 2a) Umlaut open-source front-end, and 2b) the proprietary licensed “knowledge base” product behind Umlaut.
  3. Possibly CrossRef and/or Pubmed (which I often forget to consider; this example shows why not to forget that.)
  4. The target destination, the platform hosting the content you are directing to.

In some cases there can be another couple pieces too. #1, the source of the link, can be our own federated search product.  Which will then subdivide into:

  1. the source database being searched by the federated search product.
  2. The Xerxes open source federated search front-end we use.
  3. The proprietary licensed federated search engine backing up Xerxes.

So we’ve got up to 8 seperate components involved that could be the source of the problem. Some of these components are entirely accessible to my debugging (the open source self-hosted ones), some are partially accessible to my debugging (the proprietary licensed components that are still self-hosted), others are only barely accessible to my debugging (the free or licensed components in ‘the cloud’, which I can use Live HTTP Headers to examine HTTP transactions, and that’s about it).

So it can seriously take me many hours to get to the bottom of a reported problem — or as far to the bottom as I can get, sometimes you just end up against the walls of a licensed/cloud “black box”, sometimes in a way that lets me be pretty sure _which_ “black box” component is at fault (maybe or maybe not with any detailed insight on exactly _how_), but in other times not being able to decisively narrow it down because of lack of ability to peer inside the black box.  And that many hours is only because I’ve become pretty darn familiar with these systems; if it were the me of three years ago instead of today, that many hours would possibly be many days instead.

If the component at fault is an open source component, then figuring out the problem was the hard part, and the subsequent fix, 9 out of 10 times, is easy.

But if the component at fault is one of the licensed products (locally hosted or in the cloud), then those many hours of debugging was the EASY part. The hard part is getting the vendors of the problematic component to pay attention: figure out the proper support channel (easy when it’s the link resolver or federated search engine; harder when it’s one of the many dozens of licensed platforms we have, any one of which we only very occasionally have need of contacting support on); convince them there really IS a problem (even though it’s hard to reproduce), and it really IS their software’s fault (many times they’ll try to blame it on another component in the chain).  Being succesful at this can take many more (even less fun) hours, not exagerating. (And often involve being pushed up a support chain one or two times, at which point you may have to start from scratch and re-deploy the evidence you had already deployed at a lower level of support to re-convince a new person of what’s up).

And once you’ve succeeded there, what happens is you’re told “This problem has been forwarded to our developers.”  And you get to hope that they’ll get around to fixing it some time before you retire, and maybe even courteously let you know when they do.

Hmm, this joke isn’t actually so funny to me anymore.

And gives one understanding of why 99 times out of 100 the Local Expert or Responsible Party finds an excuse to dismiss the problem instead of get to the bottom of it. Cause even once you’ve gotten to the bottom of it (and I actually agree with Eric that this kind of debugging is kind of fun), the ACTUAL hard work has barely begun (and I don’t find interminable exchanges with support people where you try to convince them there really is a problem, and it really is likely in their product, to be much fun at all).  And all this is for a bug that, you’re actually not sure exactly how many users it’s effecting, and is just one of probably many such “heisenbugs” your software probably exhibits, geez is it really worth it?  Except the whole problem is that these many hard to reproduce bugs add up, meaning any given user has (maybe? All of these 1 in a 100 things are just guesses, not based on any actual evidence) a good chance of running into a few of them in any given, say (again making this up) month of use.  Phew.

Now, the good news is that lately I’ve noticed some of our vendors getting better at acknowledging, understanding, and even fixing bugs when I report them. I don’t know if it’s entirely fair to mention them by name, because on the one hand the fact that I have developed such good relationships with their support/developers means that their products somewhat regularly have problems I find. But, hey, nothing is bug free, and the positive is that they are unusually easy to work with and generally (eventually) fix the problem. So I’ll mention both Jeff Lang at Thomson Reuters and Mason Golden at Gale as being absolutely a pleasure to work with.   And Chuck Koscher at CrossRef is also pretty great. This list isn’t meant to be exclusive, but I feel like I complain so much, it’s worth pointing out by name some people at vendors who have been really responsive, easy to communicate with, and actually take responsibility for trying to get a fix in on their end and letting us know when it’s been made.   (There may very well be such people at nearly every vendor we license something from — but finding and developing a relationship with the right person at each of the dozens of vendors we deal with is a chore of it’s own, they aren’t neccesarily who you first get connected to when you file a support ticket).


2 thoughts on “debugging in a complex system of inter-related components”

  1. I think this is an issue for individual users as much as it is for large institutional service providers, and will become an even bigger issue in the days ahead as we move more and more of our “stuff” into the cloud and rely on a variety of service providers and – as you rightly point out – LINK all of these services and “stuff” hosters together.

    I’ve long since lost track of all the accounts I have out on the web, and as I use more and more widgets and plugins and filters and automatic notifiers and whosits and whatsits, it’s become ever more difficult to track down the source of an issue when something “goes wrong” or doesn’t behave as I expect it to behave. I’m not sure I have any answers, other than as an end user I’m trying to do a better job of at least _documenting_ some of these services/links so when something goes wrong I at least have a list of places to look, but as the complexity of the systems and services we use proliferate, I think problems are going to be ever harder to track down.

  2. At least you feel like you can do something. I can’t even go and kick a box. In “public services” all I can do is e-mail you (and you respond immediately, thank you!) or our IT desk (response varies) or heaven help me, the vendor. When none of these work, I blog or tweet it or put it on a listserv and then I usually get a call or e-mail within about an hour if it’s the vendor’s fault.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s