marc export speed-up by piping to avoid file system

In order to index our Marc records in Blacklight, we need to export them from the ILS, and then send them to the indexer (currently SolrMarc) to map them to the Solr index.

We sometimes do this with our entire corpus. (Our actual production indexing strategy is still in formation). The straightforward way to do this is to export a giant marc file to disk, and then sic SolrMarc on the giant marc file.

But there are significant speed gains from piping the export directly to SolrMarc. The export and SolrMarc can run in parallel (main advantage, probably), and you also skip the disk writes and reads for that intermediate file you don’t really need (possibly some perf advantage, and also you don’t need a buncha disk space and need to clean up after yourself). It’s also just convenient to be able to just run one command (a rake task I made), and essentially index our ILS without having to worry about intermediate files. (There are logs written for both the exporter and SolrMarc, which the automated process scans for error messages and alerts the operator. Or at least it will soon.)

SolrMarc can read from stdin, and our ILS Marc exporter can write to stdout, so easy to chain them together with a standard unix pipe.

I did a test export of a 500k range of bibs in our ILS.  There are at most actually 490756 records in that range, but possibly fewer since we’re actually only exporting records marked ‘public’, but i haven’t actually counted exact number, I’ll use 490756 as my number to calculate estimated “records per minute” stats.

  • Marc export to disk:  45 min — approx 182 records per/second
  • SolrMarc index of that marc file on disk: 71 minutes — approx 115 rps
  • 45 min + 71 min == 6960 seconds — approx 70 rps
  • Export piped directly to SolrMarc: 68 minutes — approx 120 rps

Now, my Marc export isn’t including item/copy/holdings information yet, which might slow it down some when it does. But since the limiting factor in the pipe seems to be SolrMarc, that probably won’t slow down the pipe, unless it slows down the export to be even slower than SolrMarc!

The fact that the piped run is slightly faster than SolrMarc all by itself may indicate some benefit to not having to hit the disk — or could just be non-random-sampling error, I just ran this thing once.

It’s good that SolrMarc is the limiting factor, because there are various ways we can speed up our indexing (using multiple threads, optimizing the code generally, throwing more hardware at it), but there aren’t very many good ways we can speed up our ILS export.  So that gives us some room for improvement, probably approximately down to the limit of the exporter process by itself, approx 182 rps. Although actually that’s just against our test ILS, we’ll see if it gets any faster against production, but I doubt it, I think most of the bottleneck in the exporter is actually creating the marc records, not the database calls. (The exporter is third party code proprietary, so we can’t optimize it as easily, although we DO have a source license. But I hate working with Java these days, personally.). Wait, I could, on a multi-processor machine, partition the export of my corpus into separate processes each taking a portion of it and piping to solrmarc. That’d work, if the limiting factor in the export isn’t the db. If I ever have enough metal for that.

This entry was posted in General. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s