A post I made in conversation on the NGC4Lib list, that will be nothing new for programmers in the audience, but may be useful to the less technical amongst us.
I didn’t realize how much ‘magic’ search results ranking — whether on Google or in a library catalog — can seem to non-technical librarians. I think it’s crucial for all librarians — cataloging, research, or other — to develop a basic understanding of how such software can work. Software packages that search are the tools of our trade.
I’ll start with some comments on Google, and move on to some generalities about typical modern catalog-type software searches.
Ranking in search results anywhere, as with all software, is based on defining certain rules that will be followed to order the results.
Even in Google, citation (link) analysis is actually just one small part of the rules; it’s the one they are most famous for, because it’s the big innovation that let their search results satisfy users so much better than the competition when they first started. Even then it wasn’t the only component in the rules, and since then Google has added all sorts of other tweaks too. The details are carefully guarded by google, but if you search around (on google, heh) you can find various articles (from google and from third party observers) explaining some of the basic concepts. Commercial online businesses are of course especially interested in reverse engineering googles rules to get their pages to the top.
Even looking at the link analysis aspect of google, the key innovation there was not actually to put “popular” (oft linked to) sites first. They do that, but the actual genius element of google citation/link analysis was that the words people use in a link pointing to a site are very useful metadata describing that site — the words many people use to describe a site, will probably average out to often match the words a searcher might use when looking for content like that site. That’s really the genius element there.
No matter what, this kind of ranking is an attempt to give users something that satisfies their needs when they enter some words in a search box — all you’ve got is the words, which are kind of a surrogate for what the user really wants, and you try to figure out how to make most of the users satisfied most of the time. The fact that we all use Google every single day and find it invaluable shows that this is possible. But it’s _definitely_ an ‘opinionated’ endeavor, trying to make your system satisfy as well as possible in as many searches as possible — it’s not like there is some physical quantity “satisfactoriness” which just has to be measured or something, it’s software tricks to try and take a users query and, on average, in as many cases as you can, give the user what will satisfy them. [For that matter, it occurs to me now, philosophically very very similar to the reference interview — the user _says_ something, and you need to get at what they really want/need. The difference is that in software it all needs to be encoded into guidelines/rules/instructions/heuristics for the software, you don’t get to have a human creatively having a conversation with another human].
Some folks at the university of wisconsin provided an accessible summary of how their Solr-based experimental catalog does results ordering, that you may find helpful. (Solr is open source software for textual search that many of us are using to develop library catalog search interfaces). http://forward.library.wisconsin.edu/moving-forward/?p=713
There are some other things going on that that blog post doesn’t mention to. In particular, one of the key algorithms in Solr (or really in the lucene software Solr is based on) involves examining term frequency. If your query consists of several words, and one of them is very rare across the corpus — matches for that word will be boosted higher. Also, a document containing that search word many times will be boosted higher than a document containing that word just one time. There is an actual simple mathematical formula behind that — that has turned out to in general/average be valuable in ranking search results to give the users what they meant, it’s kind of the foundational algorithm in solr/lucene. But in a particular application/domain, additional tweaking is often required, and all that stuff mixes in together, resulting in something that is described by fairly complex mathematical formula, and is not an exact science (as Google’s own ranking is not either!).
Additionally a Solr search engine, when giving a multiple word query, might boost documents higher when those words are found in proximity to each other. Or might allow results that don’t match all the words, but boost higher the more words found.
Anyhow, I spend some time describing this, because I firmly believe these are the tools of our trade now — whether you’re using a free web tool like Google, a licensed vendor database like EBSCO, or a locally installed piece of software (open source or proprietary) like a library catalog.