note to self: more ideas for browse search in solr

Mostly as a note to myself, but share it in case it makes any sense to anyone else.

In the back of my mind, I’m continually thinking of how to implement a traditional opac ‘browse search’ in solr. Solr isn’t really quite designed for this. Mostly the back of my mind has been trying to figure out how to do this with the solr features already there.

But late tonight now, I figured, eh, maybe I understand Solr enough to try and dive into the solr code, and get the back of my mind thinking about how to actually hack the feature into solr directly.

Traditional browse search let’s you a ‘start with’ query on a list of “headings”. Those ‘headings’ generally end up as facet values in most people’s solr implementation.

Ideally, it would improve upon traditional browse search, in letting you do a browse search with “filters”, ie searching through the headings only including headings attached to bibs that have been filtered (bibs in a certain physical library, say).

So there are _several_ logic paths solr can take to do facetting, depending on which solr.method type you choice, whether the field is multi-valued or single-valued, possibly your facet.sort, and maybe some other factors.

I figured I’d focus on the path I actually need: facet ‘fc’ method, on multi-valued fields, doing a facet.sort=index, and with a facet.limit set to a positive integer. (And NO facet.prefix set).

The outcome I want? Well, start with the idea of the built in facet.offset. I want to do something that’s kind of like that, but I don’t know the offset I want yet, I want solr to figure it out for me based on a prefix. Instead of facet.offset, , I’m going to give, well, I’m making it up, so let’s call it facet.offset_from_prefix . For facet.offset_from_prefix=X, I want solr to figure out the offset that would put the FIRST facet beginning with X as the first value in the facet set — or if there is no facet value beginning with X, then whatever facet value is alphabetic sort closest to X. Then I want to continue as if this was actually specified as a facet.offset, returning facet values starting from there. AND I want the eventual solr response to the client to _include_ this calculated offset (so the client can page forward and back if it wants).

.For the conditions we set above, i think the control-flow path will lead us to: SimpleFacet#getTermCounts, which will get an UninvertedField for our facet field, and then call UninvertedField#getCounts on it.

If we look at UninvertedField#getCounts , an interesting part is the logic for handling facet.prefix. Now, facet.prefix is not what we want, because it changes the overall set of facet values returned. We don’t want to change the overall set, we just want to find the correct _offset_ for a prefix, within the overall unchanged set.

Okay, but look at what facet.prefix does: It FIRST _does_ find exactly the offset we want, by using NumberedTermEnum#skipTo/getTermNumber. Aha, this just showed us how to do what we want to do in solr. (We just don’t want to do the NEXT part of what the facet.prefix handling logic does, reset the overall facet value list’s “0” offset to this found offset).

So we just need to get UninvertedField#getCounts to accept a facet_offset_prefix param (and change everything up in it’s calling chain so that’s passed to it from the url params). And then, when such a thing is present, use that NumberedTermEnum logic to get the offset we want — and SET the variable that holds an explicit offset that would have been passed in by the user to this found offset — that’s it, now let the rest of the Solr logic continue as normal. (Perhaps raise an exception of some kind of conflicting params were passed in — for instance, this this facet_offset_prefix is kind of incompatible with an ordinary facet.prefix. ).

Now the facet values will returned will be right for our spec. The only thing that remains is figuring out how to _echo back_ the looked-up offset to the client, in the solr response. I have no idea how to do that, but trust there should be a not too hard way to modify SimpleFacet to include an extra xml element or attribute in it’s responses, which is I think what would need to be done.

So… I totally don’t actually understand what I’m talking about… but I still think I’ve figured out a decent plan.

If anyone actually has any idea what I’m talking about (the intersection between people who understand the solr code, and people who read my blog, may be 0; and on top of that, talking about code in narrative is inevitably confusing, and I’m not sure if this post is actually comprehensible by anyone)…

Does this actually sound like it just might work?

Is there an obvious reason the performance of this will be crap? By basing it on logic already used by SimpleFacet depending on your arguments, I figure it should perform just as well as, well, the equivalent facet.prefix and/or facet.offset querries already would. But if someone who actually understands Solr sees an obvious performance problem, let me know.

While the amount of code that has to be changed is actually fairly minimal, it might effect a buncha classes, since I need to get my new parameters passed all the way down the call chain to the right place, and then get the calculated offset passed all the way back up to make it into a response. Is this going to be a big pain in the butt custom fork/patch version that will be hard to maintain in parity with continuing Solr developments? (Certainly if the implementation of SimpleFacets#getCounts or UninvertedField#getTermCounts ever changes significantly, the patch would have to be entirely rewritten).

Assuming it actually does work, wonder if there’s any chance of getting a patch like this into solr main stream.

8 thoughts on “note to self: more ideas for browse search in solr

  1. I’m not quite sure I understand what you’re proposing, but would it be possible to use a with the upper endpoint open and the lower endpoint the field:headingvalue (eg title:Necronomicon)?

    You may need to set the rewrite method to avoid a BooleanQuery.TooManyClauses exception. (I think that future versions of Lucene have a cool compile-to-finite-state-automaton feature which will make such queries run better, but this should work.)

  2. I don’t want to limit the document result set. And in fact, I don’t even want to limit/restrict the list of facet values. Instead, from within a list of many many unique facet values (a hundred thousand; a million;more), I want to “skip” to the offset which has the facet value beginning with a certain prefix.

    It’s confusing to explain what I need to do, or why I’d ever want to do this. I should make a little screencast demo’ing the legacy library catalog feature I am trying to duplicate.

  3. I think this sounds great. I was thinking that I knew how to do something easier: be able to use a prefix search (facet or regular search) to hone a result set (“results that start with “B”). What you’re suggesting is an ability to do that while keeping the entire result set available — perfect.

  4. Naomi, if your idea does not require patching Solr java code, it might indeed be better. But I don’t entirely understand what you’re suggesting.

    One problem with us all cooperatively solving this problem is it’s kind of hard to talk about — we can’t even get Erik to understand our requirements in the first place, heh.

    I wish we could all get in a room together with Erik (or some other Solr experts) for 4 hours, make sure we all understand the requirements, and then brainstorm solutions.

    One thing I realized thinking more is that my solution gets us only so far — it gets us _exactly_ the traditional use OPAC functional requirement of a “prefix” search. But _really_, on top of a prefix search, aren’t we also going to want the option of a search that isn’t left-anchored, or even a search (on headings/terms/facet-values) that can use full solr analysis to do stemming and synonyms and such? And my solution runs into a dead end there, while a solution to that broader problem will ALSO easily handle the left-anchored/prefix case.

    But when I try to think of solutions to the broader function — oh boy it gets even a lot harder. I start thinking about somehow doing a “join” in Solr between two cores (or just two document sets in one core?). First do the search against one document set, then take all the values from fieldX for that document set, and THEN use that set of values to constrain your search against document set 2. That’s really what we need, and all without leaving Solr because “all the values for fieldX” could be millions, so it somehow all needs to be done internally to solr without giving up internal lucene/solr objects that you can possibly use to do lucene/solr intersections and stuff, instead of refetching. Figuring out how to patch solr (or write a SimpleFacet clone replacement) to do THAT gets us into even trickier code, esp for me who understands lucene internal api very little.

  5. Oh, I understand what you’re saying now Naomi. Yeah, what you knew how to do IS easier, but without keeping the entire result set available we can’t duplicate traditional legacy OPAC functionality that lets you page forward and backwards PAST your prefix-limited-set in either direction.

    Now, I guess more investigation with users/stakeholders might be required to see if cloning that function from the traditional OPAC is _really_ neccesary or not. But i kind of think it might be, in some uses of this underlying feature.

  6. Well, the paging forward and backward in the entire result set is exactly what the “browse by callnumber” does — it’s starting at a given point in the shelf keys and browsing backwards and forwards in that list. The technique can apply to any field: title, author, LCSH, whatever. The missing piece of what i did there is the ability to apply filters (e.g. facets) to the browse — it’s currently only able to scroll back and forth thru ALL the documents in the index.

    I think maybe defining our dream requirements is a better starting point to solving “the problem”. What would be really useful? It might be something much different than what OPACs provide now.

    Maybe you could throw your ideas of what you *want* into a blog post? with maybe a quickie fake screen shot ?

    Jennifer Vine has some pre-ideas about a title-author browse; our full call number browse is in the works now. We haven’t gotten up to authority files.

    But still, all of those things are tied to our current metadata. What do users want? Given a title, a single link to get to all works by the same authors? A single link to get to all similar works, for different definitions of similar? Start from an author, from a title, from a subject? From a call number, for the dinosaurs like us that are used to call numbers? From a keyword that isn’t an official subject? Etc. These are still paltry dreams, still too tied to what I know we’ve got in metadata and in current information retrieval and usage data mining.

  7. Yeah, there are a couple other weird things about your current implementation, as I understand. While I think it can handle a document that has more than one call number, when you are actually paging through, if a document has more than one call number, when you’re actually paging through them, can you know _which_ of the call numbers was actually the right one to display for the current page? I recall thinking you probably couldn’t, from your explanation of your method at code4lib. But maybe I was wrong.

    The call number browse and the “traditional OPAC browse search” case are also slightly different in that you are paging through actual _documents_, although sorted by value (in a multi-value field). The “traditional OPAC browse search” is actualy paging through the (facet) values themselves, with counts next to each one, not paging through documents.

    Our requirements are awfully confusing. In part because there’s this complicated gulf between “what we’re used to” and “what do we really need”. But also because “what we’re used to” actually solves several different use cases at once (in a not necessarily optimal way for any of them). And because our legacy data is set up with certain assumptions on how it’s going to be used. Phew.

    Figuring out our “utopian” requirements is approaching it from one direction; the other is just getting coding and doing some stuff, and seeing what we realize about “Oh, but wouldn’t it be nice if it did X too” after that. It’s probably neccessary to approach from both directions, even though the second one will sometimes result in false starts where you (or I) end up throwing out what you spent a month working on.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s