Updating SolrCloud configuration in ruby

We have an app that uses Solr. We currently run a Solr in legacy “not cloud” mode. Our solr configuration directory is on disk on the Solr server, and it’s up to our processes to get our desired solr configuration there, and to update it when it changes.

We are in the process of moving to a Solr in “SolrCloud mode“, probably via the SearchStax managed Solr service. Our Solr “Cloud” might only have one node, but “SolrCloud mode” gives us access to additional API’s for managing your solr configuration, as opposed to writing it directly to disk (which may not be possible at all in SolrCloud mode? And certainly isn’t using managed SearchStax).

That is, the Solr ConfigSets API, although you might also want to use a few pieces of the Collection Management API for associating a configset with a Solr collection.

Basically, you are taking your desired solr config directory, zipping it up, and uploading it to Solr as a “config set” [or “configset”] with a certain name. Then you can create collections using this config set, or reassign which named configset an existing collection uses.

I wasn’t able to find any existing ruby gems for interacting with these Solr API’s. RSolr is a “ruby client for interacting with solr”, but was written before most of these administrative API’s existed for Solr, and doesn’t seem to have been updated to deal with them (unless I missed it), RSolr seems to be mostly/only about querying solr, and some limited indexing.

But no worries, it’s not too hard to wrap the specific API I want to use in some ruby. Which did seem far better to me than writing the specific HTTP requests each time (and making sure you are dealing with errors etc!). (And yes, I will share the code with you).

I decided I wanted an object that was bound to a particular solr collection at a particular solr instance; and was backed by a particular local directory with solr config. That worked well for my use case, and I wound up with an API that looks like this:

updater = SolrConfigsetUpdater.new(
  solr_url: "https://example.com/solr",
  conf_dir: "./solr/conf",
  collection_name: "myCollection"
)

# will zip up ./solr/conf and upload it as named MyConfigset:
updater.upload("myConfigset")

updater.list #=> ["myConfigSet"]
updater.config_name # what configset name is MyCollection currently configured to use?
# => "oldConfigSet"

# what if we try to delete the one it's using?
updater.delete("oldConfigSet")
# => raises SolrConfigsetUpdater::SolrError with message:
# "Can not delete ConfigSet as it is currently being used by collection [myConfigset]"

# okay let's change it to use the new one and delete the old one

updater.update_config_name("myConfigset")
# now MyCollection uses this new configset, although we possibly
# need to reload the collection to make that so
updater.reload
# now let's delete the one we're not using
updater.delete("oldConfigSet")

OK, great. There were some tricks in there in trying to catch the apparently multiple ways Solr can report different kinds of errors, to make sure Solr-reported errors turn into exceptions ideally with good error messages.

Now, in addition to uploading a configset initially for a collection you are creating to use, the main use case I have is wanting to UPDATE the configuration to new values in an existing collection. Sure, this often requires a reindex afterwards.

If you have the recently released Solr 8.7, it will let you overwrite an existing configset, so this can be done pretty easily.

updater.upload(updater.config_name, overwrite: true)
updater.reload

But prior to Solr 8.7 you can not overwrite an existing configset. And SearchStax doesn’t yet have Solr 8.7. So one way or another, we need to do a dance where we upload the configset under a new name than switch the collection to use it.

Having this updater object that lets us easily execute relevant Solr API lets us easily experiment with different logic flows for this. For instance in a Solr listserv thread, Alex Halovnic suggests a somewhat complicated 8-step process workaround, which we can implement like so:

current_name = updater.config_name
temp_name = "#{current_name}_temp"

updater.create(from: current_name, to: temp_name)
updater.change_config_name(temp_name)
updater.reload
updater.delete(current_name)
updater.upload(configset_name: current_name)
updater.change_config_name(current_name)
updater.reload
updater.delete(temp_name)

That works. But talking to Dann Bohn at Penn State University, he shared a different algorithm, which goes like:

  • Make a cryptographic digest hash of the entire solr directory, which we’re going to use in the configset name.
  • Check if the collection is already using a configset named $name_$digest, which if it already is, you’re done, no change needed.
  • Otherwise, upload the configset with the fingerprint-based name, switch the collection to use it, reload, delete the configset that the collection used to use.

At first this seemed like overkill to me, but after thinking and experimenting with it, I like it! It is really quick to make a digest of a handful of files, that’s not a big deal. (I use first 7 chars of hex SHA256). And even if we had Solr 8.7, I like that we can avoid doing any operation on solr at all if there had been no changes — I really want to use this operation much like a Rails db:migrate, running it on every deploy to make sure the solr schema matches the one in the repo for the depoy.

Dann also shared his open source code with me, which was helpful for seeing how to make the digest, how to make a Zip file in ruby, etc. Thanks Dann!

Sharing my code

So I also wrote some methods to implement those variant updating stragies, Dann’s, and Alex Halovnic’s from the list etc.

I thought about wrapping this all up as a gem, but I didn’t really have the time to make it really good enough for that. My API is a little bit janky, I didn’t spend the extra time think it out really well to minimize the need for future backwards incompat changes like I would if it were a gem. I also couldn’t figure out a great way to write automated tests for this that I would find particularly useful; so in my code base it’s actually not currently test-covered (shhhhh) but in a gem I’d want to solve that somehow.

But I did try to write the code general purpose/flexible so other people could use it for their use cases; I tried to document it to my highest standards; and I put it all in one file which actually might not be the best OO abstraction/design, but makes it easier for you to copy and paste the single file for your own use. :)

So you can find my code here; it is apache-licensed; and you are welcome to copy and paste it and do whatever you like with it, including making a gem yourself if you want. Maybe I’ll get around to making it a gem in the future myself, I dunno, curious if there’s interest.

The SearchStax proprietary API’s

SearchStax has it’s own API’s that can I think be used for updating configsets and setting collections to use certain configsets etc. When I started exploring them, they are’t the worst vendor API’s I’ve seen, but I did find them a bit cumbersome to work with. The auth system involves a lot of steps (why can’t you just create an API Key from the SearchStax Web GUI?).

Overall I found them harder to use than just the standard Solr Cloud API’s, which worked fine in the SearchStax deployment, and have the added bonus of being transferable to any SolrCloud deployment instead of being SearchStax-specific. While the SearchStax docs and support try to steer you to the SearchStax specific API’s, I don’t think there’s really any good reason for this. (Perhaps the custom SearchStax API’s were written long ago when Solr API’s weren’t as complete?)

SearchStax support suggested that the SearchStax APIs were somehow more secure; but my SearchStax Solr API’s are protected behind HTTP basic auth, and if I’ve created basic auth credentials (or IP addr allowlist) those API’s will be available to anyone with auth to access Solr whether I use em or not! And support also suggested that the SearchStax API use would be logged, whereas my direct Solr API use would not be, which seems to be true at least in default setup, I can probably configure solr logging differently, but it just isn’t that important to me for these particular functions.

So after some initial exploration with SearchStax API, I realized that SolrCloud API (which I had never used before) could do everything I need and was more straightforward and transferable to use, and I’m happy with my decision to go with that.

3 thoughts on “Updating SolrCloud configuration in ruby

  1. True, thanks Charlie! I’m not sure I’d call that “writing it directly to disk” you are still interacting with a remote zookeeper, right? One of the options I am looking at is a managed Solr from SearchStax, and they don’t provide username/password-based auth to the zookeper directly, although they do to the Solr cloud.

    I also didn’t like the idea of having to get the zookeepr command line on my CD machines or anywhere else that might want to do a config sync (say a heroku release hook). The one in the solr docs you point to is actually `./bin/solr zk`, the entire `solr` command line needing access to solr.jar and maybe other things, which is a lot to get installed just to do an update. (although there might be a way to arrange a lighter-weight zookeeper cli than the one in the solr docs).

    So the Solr Cloud API actually worked pretty nicely for my uses, where the zookeeper command line did not. (The zookeeper command line can talk to a remote host so must be using network communications of some kind, but it doesn’t seem to be via a public HTTP API?)

    I am fairly inexperienced with this, do you think there are any advantages to to using zookeeper command line? Or any other pro’s/con’s in general we should be aware of?

  2. Yeah I agree that the http method has many advantages. Just thought I’d mention it because it’s close to just “writing it to disk.” You’re interacting with a remote zk sure, but it doesn’t have to be remote, zookeeper can be run locally with your solr instance, believe they call that embedded zookeeper (see https://lucene.apache.org/solr/guide/7_4/getting-started-with-solrcloud.html#solrcloud-example). It’s how we run our local solr instnances for dev (cloud mode with embedded zk).

    No I don’t think there are any specific advantages. In the beginning, when I began working more with Solr that’s just what I used because I found this method first probably. Later I figured out the http stuff and use that for all deploy related things and will be using it for local to replace the CLI stuff soon too.

    I’m certainly not claiming to be a solr guru btw! :) Nice blog post.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s