very rough benchmarking of Solr update batching performance characteristics

In figuring out how I want to integrate a synchronized Solr index into my Rails application, I am doing some very rough profiling/benchmarking of batching Solr adds vs not, just to get a general sense of it.

(This is all _very rough estimates_ and may depend a lot on your environment and Solr setup, including how many records you have in Solr, if Solr is being simultaneously used for queries, etc).

One thing some Solr (or ElasticSearch) integration packages sometimes end up concentrating on is batching multiple index-change-needed events into fewer Solr update requests.

Based on my observations, I think it’s not actually the separate HTTP requests that are expensive. (although I’m benchmarking with a solr on localhost).

But the commits are — if you are doing them. In my benchmarks reindexing a whole bunch of things, if I’m not doing any commits, whether I batch into fewer HTTP update requests to Solr or not has no appreciable effect on speed.

But sending a softCommit per record/update makes it around 2.5x slower.

Sending a (hard) commit per record makes it around 4x slower.

Even without explicit commit directives, if you have your solr setup to autocommit (soft or hard), it may of course occasionally pause to do some commits, so your measured time may depend on if you hit one of those.

So if you don’t care about realtime/near-realtime, you may not have to care about batching. I had already gotten the sense from Solr’s documentation that Solr will really like it better if the client never sends commits, but just lets Solr’s autoCommit/autoSoftCommit/commitWithin configuration to make sure updates become visible within a certain amount of maximum time. The reason to have the client send commits is generally because you need to guarantee that the updates will be visible to queries as soon as your code doing the update is finished.

The reason so many end up caring about batching updates might not because individual http requests to solr are a problem, but because too many _commits_ are. So if for some reason it was more convenient, only sending a commit per X records might be just as good as actually batching http requests — if you have to send commits from the client at all.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s