A post I made in conversation on the NGC4Lib list, that will be nothing new for programmers in the audience, but may be useful to the less technical amongst us.
I didn’t realize how much ‘magic’ search results ranking — whether on Google or in a library catalog — can seem to non-technical librarians. I think it’s crucial for all librarians — cataloging, research, or other — to develop a basic understanding of how such software can work. Software packages that search are the tools of our trade.
I’ll start with some comments on Google, and move on to some generalities about typical modern catalog-type software searches.
Ranking in search results anywhere, as with all software, is based on defining certain rules that will be followed to order the results.
Even in Google, citation (link) analysis is actually just one small part of the rules; it’s the one they are most famous for, because it’s the big innovation that let their search results satisfy users so much better than the competition when they first started. Even then it wasn’t the only component in the rules, and since then Google has added all sorts of other tweaks too. The details are carefully guarded by google, but if you search around (on google, heh) you can find various articles (from google and from third party observers) explaining some of the basic concepts. Commercial online businesses are of course especially interested in reverse engineering googles rules to get their pages to the top.
Even looking at the link analysis aspect of google, the key innovation there was not actually to put “popular” (oft linked to) sites first. They do that, but the actual genius element of google citation/link analysis was that the words people use in a link pointing to a site are very useful metadata describing that site — the words many people use to describe a site, will probably average out to often match the words a searcher might use when looking for content like that site. That’s really the genius element there.
No matter what, this kind of ranking is an attempt to give users something that satisfies their needs when they enter some words in a search box — all you’ve got is the words, which are kind of a surrogate for what the user really wants, and you try to figure out how to make most of the users satisfied most of the time. The fact that we all use Google every single day and find it invaluable shows that this is possible. But it’s _definitely_ an ‘opinionated’ endeavor, trying to make your system satisfy as well as possible in as many searches as possible — it’s not like there is some physical quantity “satisfactoriness” which just has to be measured or something, it’s software tricks to try and take a users query and, on average, in as many cases as you can, give the user what will satisfy them. [For that matter, it occurs to me now, philosophically very very similar to the reference interview — the user _says_ something, and you need to get at what they really want/need. The difference is that in software it all needs to be encoded into guidelines/rules/instructions/heuristics for the software, you don’t get to have a human creatively having a conversation with another human].
Some folks at the university of wisconsin provided an accessible summary of how their Solr-based experimental catalog does results ordering, that you may find helpful. (Solr is open source software for textual search that many of us are using to develop library catalog search interfaces). http://forward.library.wisconsin.edu/moving-forward/?p=713
There are some other things going on that that blog post doesn’t mention to. In particular, one of the key algorithms in Solr (or really in the lucene software Solr is based on) involves examining term frequency. If your query consists of several words, and one of them is very rare across the corpus — matches for that word will be boosted higher. Also, a document containing that search word many times will be boosted higher than a document containing that word just one time. There is an actual simple mathematical formula behind that — that has turned out to in general/average be valuable in ranking search results to give the users what they meant, it’s kind of the foundational algorithm in solr/lucene. But in a particular application/domain, additional tweaking is often required, and all that stuff mixes in together, resulting in something that is described by fairly complex mathematical formula, and is not an exact science (as Google’s own ranking is not either!).
Additionally a Solr search engine, when giving a multiple word query, might boost documents higher when those words are found in proximity to each other. Or might allow results that don’t match all the words, but boost higher the more words found.
Anyhow, I spend some time describing this, because I firmly believe these are the tools of our trade now — whether you’re using a free web tool like Google, a licensed vendor database like EBSCO, or a locally installed piece of software (open source or proprietary) like a library catalog.
I agree, and that’s why I modded my “intro to libtech” syllabus to introduce TF/IDF, page ranking, and so on into the equation; see http://www.slideshare.net/cavlec/week8-5557551
I just bookmarked this post so that I can make future students read it, in fact. :)
A lot of people know Google takes links into accounts, but I think the “link text as metadata about the page linked to” part is a LOT more interesting/useful/innovative than the “pages with more links go higher” part, and most people only know/think about the latter. The former is the real genius innovation.
If you just want to boost popular items, libraries could do something like use circulation data — but I’m not sure how useful that is for our data and users. Just cause it works for Google — but again, it’s not even the most clever part of “page rank”, which is really just one component of what google does anyway.
If you want to try to analogize google’s very clever “crowd effect metadata”, using link text as metadata keywords about the page linked to…. we don’t really have a good way to do that at all. The genius thing is that everyone makes links for their _own_ purposes, and they don’t (or didn’t used to) even think about the link text being used for search results, but by making links they are effectively a third party adding keywords describing another page, that google can use.
Couldn’t we do something with folksonomies, like those on LibraryThing?
You’re right about link text vs. popularity, and the next time I give that lecture, I will alter it accordingly.
I think TF/IDF is something that is standard taught in LIS School these days. I’m not sure though if it’s used in Catalogs like those by Innovative Interfaces? At least for the library record, there aren’t enough terms for TF/IDF to work. At least for our Encore, it’s pretty easy to work out how they rank the 5 levels (position of term in title , exact match etc)
But then again databases that libraries do subscribe to don’t really tell you exactly how relevance ranking is done, usually some handwaving in the help files (if at all)
I’m curious about the reasoning here, are they afraid competitors will copy their search algos? Or that academics will game the system so their articles will appear on top?