Thinking like Solr — it’s not an rdbms

I saw a great point made recently in this recorded lecture,  comparing rdbms (SQL)  to NoSQL data stores.

Solr is NOT a “NoSQL data store”, it’s best not to think of Solr as a ‘store’ at all, but instead as an ‘index’. But Solr also is definitely not SQL, and can’t do somethings that would be easy for SQL, so many of us trained over so many years to think in terms of SQL sometimes don’t realize how to think in terms of Solr instead.  So the approach in that lecture to thinking about rdbms vs NoSQL applied to Solr as well anyway, and goes something like this….

In rdbms, you set up your schema thinking only about your data, and modelling your data as flexibly as possible. Then once you’ve done that, you can ask pretty much any well-specified question you want of your data, and get a correct and reasonably performant answer.   Maybe you’ll have to go add another index to your rdbms later for unanticipated questions, no problem.

In Solr, on the other hand, we set up our schemas to answer particular questions. You have to first figure out what kinds of questions you will want to ask Solr, what kinds of queries you’ll want to make, and then you can figure out how to structure your data to ask those questions.  Some questions are actually very hard to set up Solr to answer — in general Solr is about setting up your data so whatever question you have can be reduced to asking “is token X in field Y”. If you can’t figure out a way to reduce a category of question to that, it’s going to be tricky to get Solr to do it.

This can be especially tricky in cases where you want to use a single Solr index to answer multiple questions, where the questions are such that you really need to set up your data differently to get Solr to optimally answer each question.

Hiearchy and Relations makes Solr sad

‘Hieararchy’  or ‘relations’ in your data can easily lead to that sort of problem. There are several different use cases under the heading of ‘hieararchy’ or ‘relation’ which can all be painful. Because they can lead to situations where you want to set up your index like THIS in order to answer questions at one level of hieararchy, but you want to set up your index like THAT in order to answer questions at another, but you really do want to use one single index, because even if you had the resources for two indexes, there are other sorts of questions that you want to answer that you can’t answer once you’ve split into two indexes!

There’s a reason you can’t spell ‘rdbms’ without ‘relational’, but you can spell Solr. Some things are hard to do with Solr.  Certain features in Solr trunk/upcoming Solr 4.0 are meant to try to get at certain sorts of hieararchical or relational use cases — including ‘pivot’ faceting, “field collapse”, and limited ‘join’ functionality — but they’re definitely a bit squirrely, and you’re definitely out of Solr’s sweet spot when you try to do em — the new features might make easier or possible what wasn’t easy or possible before, but not always, and still not always as easily or as performant as you might like.

How to make friends with Solr

So on the one hand, you can’t say “this is what my data looks like, how should I model it in Solr?” With an rdbms, you can totally do that. With Solr though,  you have to also say “and these are the kinds of questions I will want to ask Solr about my data, and the kinds of answers I want to get.” In order to begin answering how to set up your Solr schema.  Unlike in an rdbms, you can’t just set up a general purpose flexible schema that will then let you answer unanticipated sorts of questions later.

SQL/rdbms is actually really awesomely powerful technology, and it’s totally focused on the idea that if you model your data right, you can then flexibly ask any sort of well structured question you want about it.

Solr is not a general purpose store like an rdbms, where you can set up your schema once in terms of your data and use it to answer nearly any conceivable well-specified question after that.  Instead, Solr does things that rdbms can’t do quickly, or can’t do easily, or can’t do at all.  But you lose some things too. For almost any individual category of question, there’s probably a way to set up a Solr index to answer it.  But when you start wanting a single Solr index to answer multiple categories of questions, especially questions at different levels of hieararchy, it can get very difficult or impossible to get Solr to do it well, or sometimes do it at all.

Mainly, what Solr does is  relevancy ranking. It also does ‘facetting’ pretty darn well, in ways hard to get other sorts of tools to do. If you don’t actually need either powerful relevancy ranking or facetting over a large result set, Solr may not be the right tool in the first place, an rdbms or NoSQL store may be better. But if you do… then you’ve just got to work around the trade-offs as best you can.

Like I said, I think some of the new features in trunk start to get at making working around the trade-offs better in certain use cases.  If Solr developers (or patch contributors) keep trying to push Solr like this, and come up with solutions to make pushing it easier, then Solr will keep getting better at it.

This entry was posted in General. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s