Gem review: HTTPClient is a nice http client for ruby

There are lots of gem options for making HTTP requests in ruby. Too many. In part because ruby stdlib net/http ends up being a bit too low-level an API for common use cases, and we ruby developers like our API’s at the appropriate level of abstraction for concise and easy DRY code.

After examination, I think httpclient is a really good general purpose option for a great many use cases, as I’ll try to show here.  Contrary to somewhat popular belief, it’s performance is actually quite good in most cases, and better than it’s peers in many (sometimes significantly, in some scenarios). So long as you re-use an HTTPClient instance if you’re going to be doing lots of requests. HTTPClient instances are safe for multi-threaded concurrent use, so no problem! .

I am not sure how popular HTTPClient is compared to it’s peers. But it doesn’t get a lot of exposure in the ruby blog universe you find when google searching. I think HTTPClient’s main author, nahi, just isn’t much of a self-promotor, although he did provide that giant chart of ruby http client options linked above, trying to show that HTTPClient is the most feature complete, it’s kind of wall-of-information that probably isn’t the most useful.

This is my attempt to rectify that. I like to know how things work, and think others do too — so this is kind of a code review of some aspects of implementation too. So many people have gotten some aspects of a http client wrong, that I wanted to see what was going on under the hood to feel good about it, and get a feel for the code if I ever need to patch it. So, yeah, we get a bit long here — I try to write concisely and clearly, but I hope I have readers who don’t insist on “tl;dr” twitter-length reading material only!

One nice thing, it’s pure ruby, you’ll never run into problems getting a C extension compiled on a box that’s missing certain dev headers or has the wrong version.

A Good API

Common cases are are easy, less common cases are almost always still possible (and frequently easy).  The docs are okay but could be better. Some fairly common cases:

http = HTTPClient.new
http.get "http://example.com?foo=bar"

# OR send query params in a hash, your choice!
http.get "http://example.com", "foo" => "bar"
# It won't convert hashes and arrays to url params for you, that's
# app-dependent, do that yourself. 
# You can use an array of key/value pairs to repeat the same
# query param key more than once. 

# Custom headers, no problem. Have to skip
# a second param for query params. 
http.get "http://example.com", nil, "Content-Type" => "application/xml"

# follow redirects? sure. 
http.get('http://exmaple.com/', :follow_redirect => true)

What you get back is an  HTTP::Message object. That’s a class defined by httpclient, not sure why it’s not namespaced appropriately, I got confused thinking it was stdlib at first.

Not sure if there’s a way to stream back the HTTP response, which you might need sometimes. I think there might be, but I haven’t been sure I’m finding the right thing yet. httpclient’s docs could be a bit better, that’s one critique.

You want to send a post or put with data? No problem.

http = HTTPClient.new

# for a post, second arg hash is x-www-form-urlencoded, as in
# an html form post, instead of query params to add onto URI. 
http.post "http://example.com", "foo" => "bar"

Doing a multipart file upload is also relatively straightforward, as is posting (or put’ing) your own string with your own content-type (say, when you have to post an XML body). See the docs under “How to Post”.

Other standard HTTP methods are also supported, `http.head`, `http.options`, etc.

If for some reason you need a non-standard extension HTTP method that httpclient’s api doesn’t recognize (‘PATCH’?), you might be able to use the #request method directly, I haven’t tried it. (All github sourcecode links in this blog post are fixed to the current codebase in ‘master’ at time I’m writing this essay).

By default, it does store and use cookies, within a certain HTTPClient instance. You can easily turn this off, to store no cookies, or give it a file system path to persist cookies between instances. (Yeah, it might be nice to be able to store cookies somewhere persistently other than the filesystem. pull request?)

It does handle HTTP authentication, and not just ‘basic’ — also ‘digest’ and even NTLM. It will send all traffic through an HTTP proxy if you want (I never have).

It does have an API for making http requests async in the background (using a ruby Thread), and checking on progress or waiting on completion. (See main docs under ‘invoking HTTP methods asynchronously’).

It even properly sets ruby 1.9 char encoding, based on http response headers, hooray.

It will even transparently uncompress gzip’d responses for you. 

It does support mocking under test using either bare WebMock or VCR.

It does a few other things ‘right’ (IMO) under the hood, that have given me trouble with other http clients.

Timeouts

Some other ruby http options have no timeout by default. Some (‘open-uri’ under ruby 1.8.7 I think!) don’t even have a way to set timeouts.

The problem with having no timeouts, is if the server you are talking to misbehaves and takes 60 seconds, 5 minutes, 10 minutes, who knows, to return a response — your app will happily hang waiting for it. (Yes, this has happened to me).  Just about any I/O operation needs a timeout.

httpclient has default timeouts, and makes it easy to set your own values too.

Best of all, it implements timeouts with a single ruby thread per HTTPClient instance, watching for all timeouts, even under multi-threaded use of HTTPClient.  That’s the right way to do it — implementations that use use stdlib Timeout.timeout end up creating a timeout-watching thread for every invocation.  Thread creation is expensive, and if you end up creating hundreds you’re adding a burden for the thread scheduler too.

Persistent HTTP Connections

If a server supports HTTP 1.1 persistent connections, httpclient does store and re-use them for subsequent requests. It seems to do a decent job of catching if the server has closed the connection and re-establishing transparently.

Best of all for me, unlike some of it’s peers, it will even re-use persistent connections across threads, in a thread-safe way.

Does that matter?  Well, it depends on the use case. If you are actually manually creating your own threads, and you want to re-use persistent connections across them for efficiency, then of course it matters.  But even if you aren’t, it may matter. Let’s say you have a Rails application, and in an action method you do an HTTP request somewhere. Do you want persistent connections re-used across requests?  Whether each request winds up in a new thread may depend on your ruby app/web server choice (I believe mongrel, different threads; thin same thread; passenger not sure and it may change in passenger 3.2 or future versions), and rails config (config.threadsafe!).

If I’m making http requests in an action method (I am), and want persistent connections re-used between requests where possible (I do), I don’t want to have to think about the internals of my web server (which may change, if I change web servers or future versions do things differently), I want it to just work, regardless. So I want connections to be re-usable across threads.  httpclient will do that for me.

If you for some reason don’t want persistent connections shared between threads, simply don’t share an HTTPClient instance between threads.

persistent connection cleanup

One question I’ve had before when dealing with http clients that keep persistent connections — do I need to do anything to clean up these stored/cached connections when I’m done with them?  Is it bad if I just let them all ‘leak’?  I guess eventually the remote server would close the connection, but I’m still going to have my local ruby objects with socket connections sitting around, could I run out of local sockets or something?

In some implementations, depending on use, they’ll get garbage collected when they go out of scope, which I guess is good enough?  But since the whole point of persistent connections is to keep em around, I’m likely to want to keep my HTTPClient around, nothing’s gonna get garbage collected.  Some other implementations give you a way to manually tell the client to close all connections, but it can be tricky to find the right place to do that.

What does HTTPClient do?  Well, every time you use a connection, HTTPClient will (thread-safely) examine it’s existing cached connections (for any host, not just the one you are about to make a request to), and clean up (properly closing their sockets) any that haven’t been used since a timeout value (by default 15 seconds, the same as what apache will timeout a persistent connection).

On the one hand, this will minimize trying to use a connection that’s been closed by the server. But on the other, it will to some extent take care of ‘leaked’ connections. Although only if/when you re-try a connection to the same host. This is probably good enough, seems better than ‘leaking’ all connections, and at least it doesn’t seem to have caused any problems for current users. If it wasn’t, it seems feasible to just add your own timeout thread that calls `#scrub_cached_session` on an HTTPClient’s session object (would probably have to `synchronize` that method; it’s not now, but it’s only called from inside synchronization in get_cached_session)

no max/upper bound on connections

Do note that HTTPClient has no upper bound on how many persistent connections it’ll simultaneously open to a server.  Ideally this might be nice, but it’s a pain to implement (esp thread-safely), I’m not sure if it’s neccesary, and I don’t think any of it’s peers have this either. Remember that if you weren’t using a persistent-connection caching gem at all, for the same logic as whatever you’re doing, you’d still get as many or more simultaneous connections to the server, you’d be putting more load on it.

an implementation detail on multi-threading

I was curious to see how HTTPClient managed thread-safety in sharing persistent connections between threads. And wanted to see the code to make sure it looked like it was indeed doing this. (I confess I can’t quite understand the possibly relevant tests).

I had considered how to write my own code to do this, and it seemed tricky.

In fact, it’s not, HTTPClient has a cleverly elegant solution. When a connection is ‘checked out’ from the store of cached persistent connections (in a synchronized block), it’s simply removed from the pool.  (Yeah, slice! with an exclamation point actually removes the selected item(s) from the array. Took me a while to catch on there, don’t see that method much).

So no other thread will see it until it’s ‘checked back in’.  If no cached connections are available, the caller will simply create a new one — and either way, “check it back in” to the pool when done. Quite elegant!  Note that if an uncaught exception is raised, the connection might never be checked back in and will be ‘leaked’ — with a DB connection pool that might be a problem, but with http connections, and especially with no upper bound on number of connections to be created, I don’t think it’s likely to cause a problem.

SSL

HTTPClient handles SSL ‘https’ requests automatically, with no special API, just pass in an ‘https’ URI. This is important to me (and a pain with stdlib net/http), I want to be able to switch http to https with config, without having to write seperate paths for http/https.  Great!

HTTPClient by default will verify chain-of-trust and validity of server’s certs. You can easily turn this off if you want on a per-client basis, but you really don’t want to, even if you think you might. (but `client.ssl_config.verify_mode = OpenSSL::SSL:VERIFY_NONE` if you do.)

trusted cert store

HTTPClient takes an unusual approach to it’s trusted cert store though — rather than try to use the OS host’s trusted cert store, the gem itself actually distro’s with it’s own trusted cert store, based on what ships with Java.

As I’ve sometimes had to deploy on servers with out of date trusted cert stores (that don’t include, for instance DigiCert), I find this convenient.  And it is, apparently, what Java does.

But some people could consider this ‘wrong’.  You’re trusting the gem author to give you the right certs. Someone could theoretically man-in-the-middle your gem install, and insert their own hacked certs.  rubygems was just patched to fix a (long-standing?) bug making that kind of attack easier.

It is possible to clear HTTPClient’s default trusted cert store, and add your own.  (Or just to add your own supplemental certs if you need to). But my distro’s (all distro’s?) store their OS hosted trusted certs in a directory, not a single file. It’s not clear to me how to use add_trust_ca on a directory — what the heck is “a ‘c-rehash’eddirectory name”? But I haven’t played around with it, and am not super familiar with OpenSSL.

It would be nice if there were a single API method you could use to tell HTTPClient to use your host-level standard OpenSSL trusted cert collection though, like most of the other HTTPS-knowing gems seem to do by default. It’s possible that actually exists, and just needs documentation.

You can also set your own client key when making an SSL request, if for instance you are using an HTTP api that will authenticate you by client key. That’s (IMO) a really nice way to do HTTPS api authentication, that hardly any api’s use, in part because few http client libraries make it easy to do.

Performance

I think some are (unnecessarily) worried about HTTPClient’s performance, as a result of Vincent Landgraf’s extraordinarily useful “HTTP Client performance shoot-out” a year ago. HTTPClient scored very poorly.

However, nahi, HTTPClient author, points out that the test creates an HTTPClient instance for each request, in it’s ‘inner loop’.  It looks like it may have been an accident, since the code also creates an HTTPClient out of the inner loop! Other http clients tested which use the model of first creating a client and then using it were not consistent with this, they did create a single client outside the inner loop. So this was not an apples-to-apples test.

nahi did his own performance tests, which I can’t find now, showing HTTPClient did quite well.  But I’ve also done my own performance benchmarks under a variety of scenarios, based on Vincent Landgraf’s code, and I too found that HTTPClient does quite well, in some use scenario’s better than it’s peers, and in all in the same vicinity, never the orders of magnitude problems Landgraf’s original test showed.

But my conclusion is that HTTPClient is quite sufficiently performant to use without worrying about it.  If you are making lots of HTTP requests, and care about performance, you do want to preserve and re-use a shared HTTPClient instance though. Since HTTPClient is thread-safe, you can even store an HTTPClient instance in a class-level global variable, regardless of your environment, without worry.

Conclusion

I suggest HTTPClient as a really solid http client solution, that should be appropriate for a wide variety of usage scenarios. It’s got good API to do almost everything anyone could need need, it’s implemented well under the hood, it’s performance is excellent. I need to do all sorts of diverse things with HTTP, almost all of my apps involve http client functionality, usually for a large variety of remote API’s.  I don’t want to have to analyze my use case each time I do it and pick a different client (switching if my use case changes slightly), I need a workhorse that will Just Work, for anything reasonable. HTTPClient looks like it, for me.

Most everything I’ve needed to do with an http client, HTTPClient does well. There are a few functions I could see myself needing (although I haven’t needed yet), that it’s not entirely clear to me how or if HTTPClient supports, including  streaming, or sending new non-standard HTTP methods.  If I needed those functions, I’d still try to work with HTTPClient, making pull requests as needed (documentation, or code), or just asking the authors or community if they could help me out if I needed something I couldn’t figure out how to do. HTTPClient is good enough to stick with it and fix it if/when needed, rather than flit around between http clients — any client is going to have some bugs or missing features.

I think HTTPClient is worthy of becoming a de facto standard amongst the many, many, ruby http client options.  If you are writing a gem that needs to support http client requests itself, I’d encourage you to consider using HTTPClient under the hood.

This entry was posted in General. Bookmark the permalink.

3 Responses to Gem review: HTTPClient is a nice http client for ruby

  1. Pingback: ruby HTTP performance shootout redux | Bibliographic Wilderness

  2. Carson Cole says:

    Good piece, thanks. So, if using Threads to hold each request, should I spawn HTTPClient.new for each? When the tread complete, does this instance also disappear, freeing up any utilized memory. In my case, I am probing web services to see that they are operational, and I do a series of them at a time, and don’t want any one probe to hold back the checking of the others.

  3. jrochkind says:

    Paul, so you do NOT need to create a new HTTPClient instance for each thread. HTTPClient is thread-safe in such a way that different threads CAN share the same HTTPClient. Persistent connections are particular to a single HTTPClient instance, if you want to share persistent HTTP connections accross threads, you will want to re-use the same HTTPClient instance accross threads. If you do NOT want to share persistent HTTP connections accross threads for some reason, you would not want to share the HTTPClient instance.

    If a connection is in use by one thread, and another thread tries to use the same HTTPClient instance to connect to the same host — the second thread will NOT be held up, the HTTPClient instance will simply open a second connection to the host. (It will also not try to share the same http connection between threads, which is what some other ruby http libraries do, which is obviously not thread-safe, and leads to disaster).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s