Information Retrieval and relevance ranking for librarians

I started out trying to write an essay explaining a bit more about how relevance ranking works, trying to be both accurate (if incomplete) and accessible to non-technical readers.  The motivation was librarians who wanted to know more about how relevance ranking worked, since it’s so crucial in understanding how Blacklight/Catalyst work. I think this was quite a justifable thing to want to know, you’ve got to understand what’s going on to develop your search strategies in the system.

But then I also ended up including a basic introduction to the Information Retrieval concepts of “recall” and “precision”, in part because I ended up trying to write a piece explaining why we think relevance ranking is useful, to explain that we’re doing this for a reason, not moving away from traditional library headings-browse interface for no reason at all, or just becuase it’s trendy.

I am not sure if I came up with the idea that relevance ranking is, in part, a way out of the “recall vs. precision bind”, which you’ll find here.  I may have read it somewhere, but I’m not sure if that’s a conventional way of thinking about relevance ranking or not?  It does seem a true and useful way to look at the utility of relevance ranking — even more so after this coincidentally recent discussion on Ngc4Lib, whose resolution in my opinion hinges upon the fact that relevancy ranking is called ranking, rather than “relevancy measurement” for a reason– it’s a clever way to order documents without being able to assign an objective or context-free measurement to them — and this limited accomplishment is valuable precisely because of how it let’s you do an end-run around recall vs. precision.

Anyhow, this essay has goals of both persuasion and education, without sacrificing any accuracy or honestly in the education for the persuasive aim. You may find it useful in talking about relevance ranking to people with decades of experience with traditional library headings browse, to explain the how and the why.”Catalyst” is the name of our new Blacklight/Solr based public catalog, while “HIP” is the brand name of our proprietary legacy OPAC.

Catalyst and Relevancy Ranking

One of our goals with Catalyst is to make it as simple and easy as possible to do common simple things, while still providing power tools for sophisticated use. This is motivated in part by the JHU persona study, which found that some types of users demand a very quick and simple interface (eg “He likes his research to work simply and take the least amount of time possible”), while others need sophisticated tools (eg “. To support his research, he prefers to go about searching for information and materials in highly structured, often complex ways. “)

The same person may in fact be in different categories at various times. Generally studies of academic library use show that a known item search – a relatively simple kind of search that should work simply – is one of the most common uses of the catalog.

We want to meet both simple and sophisticated needs. A quote from computer scientist Alan Kay seems appropriate here: “Simple things should be simple, complex things should be possible.”

Our attempt in Catalyst to focus on simple keyword search with good results ranking is in part an attempt to make simple searches work simply and easily. We provide other tools to support more sophisticated searches, ideally allowing a user that starts out simply to fluidly and intuitively move into more sophisticated tools as needed. Examples of the more sophisticated tools presently in Catalyst include the left-hand Filters, fielded searching, and the advanced search screen. Additional sophisticated features may be developed – taking into the account our resources available to develop new features, balanced to user need/benefit for a particular feature.

Librarians are of course some of the most sophisticated searchers, with complex needs. Librarians are part of our user group too, and we want to meet those needs. We think Catalyst can in fact do pretty well at meeting them, in some cases even better than HIP.

It is a different kind of interface than traditional library headings browse however, and you’ll have to fine-tune your search strategies to best take advantage of it, the details of which will depend on your personal preferences as well as the type of research questions you typically engage in. We know it can be uncomfortable to have to learn new strategies for a new environment, but we think in this case even sophisticated librarian searchers will ultimately find the new interface rewarding. And of course, most of our users don’t have decades of experience with traditional “headings browse” search, and we need to provide for them too.

Below, we’ve gone into some additional detail on how Catalyst does results ranking, for those of you who will find such detail helpful in developing your search strategies. We also provide some suggestions for ways to use Catalyst’s additional tools.

Recall, Precision, and “too many results”

A fundamental tension in “information retrieval” (see http://en.wikipedia.org/wiki/Information_retrieval) is between “recall” and “precision”. Recall is the measure of how many appropriate results are included in the results (the more the better). Precision is the quality of including only appropriate results – which could also be phrased as not including inappropriate results (the fewer inappropriate results the better).

Increasing recall generally decreases precision, and vice versa.

Contemporary information retrieval techniques have found an interesting way to try and resolve this tension: Err on the side of recall (including many documents in result set), but work to put the most “precise” documents FIRST in the result set.

This is what you see in Google for instance: Just about any search you do in google will return hundreds of thousands of results (at least), but the ‘best’ results are first, and you page through the results stopping when you’ve had enough, or when the results start getting less useful.

Catalyst attempts the same approach.

One might ask, well, can’t the computer just cut off the results when they get less useful? But this is really just saying “can’t you increase precision”. Increasing precision will be great for some searches, but for others will result in very few results even when there are many more that would be useful (decreased ‘recall’). The approach of erring on the side of recall while putting most precise results first is a try to get out of the recall vs. precision bind.

We haven’t necessarily gotten it exactly right, and appreciate feedback on where things go wrong and you can’t find what you need – but it can be a delicate thing to try to get it exactly right. We also plan to do user testing to see how things work for a sample of non-librarian users with their own search approaches.

So one answer to “there are too many results, what do I do about it?” might be to simply look at the first few, or first few pages, or first page, until you stop getting useful results – much like you do on other contemporary search interfaces like Google.

However, there are times when you will want to increase precision in your search, because of ‘too many results’, or because you aren’t getting what you wanted near the top.

Some approaches for narrowing down searches

  • If you know you are interested in things from a specific time period, such as only recent results, use the “Limit by publication year” feature to exclude results you’re not interested in. This will often work better than attempting a ‘sort by date’, because of the recall/precision approach mentioned above – sort by date is going to put the most recent hits first even if they are low-precision hits. Note also that the ‘limit by publication year’ feature gives you a chart overview of the date distribution of your current result set.
    • However in cases where you’ve already limited your results to a very precise set, for instance using specific Limits, or when you simply don’t have very many results in your set for other reasons – ‘sort by date’ may work well.
  • Use double quotes to do exact phrase searches.
  • Switch from an ‘any field’ search to a Title, Author, or Subject search.
  • Use other limits from the left-hand sidebar to refine your initial search with more precision.

Note Well: All of these approaches, by increasing precision, may be excluding documents of interest to you. For instance, not every document that is about a given topic has the same LCSH heading (or any LCSH heading) assigned; not every document by a given author has a correctly controlled author heading assigned. We hope to make an initial ‘any fields’ search as useful as possible, putting the best documents at the top of the results, including documents that would be missed by more precise searching. We recommend that, at least as you’re getting the sense of the system, you try starting out with an ‘any fields’ search, and only narrow if needed.

More detail on ‘relevance ranking’

‘Relevance ranking’ is the name for the category of algorithms that try to do what we discussed earlier: Put the ‘best’ documents first in the search. Some people don’t like the word ‘relevance’, because how does a computer know what is relevant to a human? Of course it doesn’t, but this is the ‘term of art’ used in the field of information retrieval for the process of trying to match a user’s query to the set of documents in a way that will as often as possible match what typical users were intending to find.

The foundation of most contemporary relevance ranking, including the technology behind Catalyst is a “term frequency – inverse document frequency” algorithm. (see http://en.wikipedia.org/wiki/Tf%E2%80%93idf)

A TF-IDF algorithm basically means that the more times a particular search word appears in a document, the more ‘relevant’ that document is judged likely to be; and further, that in a particular multi-word search, more unusual words across the database as a whole will be weighted higher when they match.

TF-IDF algorithms have caught on because they generally do useful things for most people doing many searches. However, one of the downsides of this algorithm is that sometimes smaller records end up boosted higher, because the query matched a higher percentage of words in those records.

We fine tune the relevance ranking algorithm to try and work best for our particular type of database and the kinds of searches our users do. In addition, we add in several other significant factors that in some cases can actually have more effect on your search results than the base TF-IDF.

  • If a multi-term query appears as a contiguous phrase in a record, that’s boosted more than if the terms appear independently.
  • Terms appearing in areas of the record identified as Title, Author, or Subject are boosted much higher than terms appearing in other areas of records. Generally in that order: Title, Author, Subject. Although the interaction of this rule with others means results might not be strictly in that order. (We also further fine-tune to boost some kinds of titles, authors, or subjects more than others. For instance a match in a controlled field (100 etc) is boosted higher than a match in a transcribed field (245$c).
  • In some cases, matches may be on alternate forms of the words you entered (automatic stemming), but matches on exact forms of words entered will be boosted higher in the result set.
  • In some cases, we allow the result set to include results that didn’t match all of your search terms, but the more search terms matched, the higher the document will be in the result set. (This is actually fairly classic TF-IDF, not an addition).

There are some other fine-tunings as well. The interaction of all these rules (which are actually implemented as mathematical formulae ultimately) means it’s not neccesarily easy to explain exactly why a given result set is at is – with patience, one could go trace through all the math involved. But we rarely do that, and it might still leave one confused about exactly what’s going on.

But we’re still interested in searches that work poorly, to see if we can tune things better for those searches – without making things worse for other searches.

We think that, while it may need some tuning, the relevance ranking approach will prove to work well for the most popular kinds of searches.

Considering traditional headings browse in terms of recall and precision

Consider a traditional alphabetical heading browse of subject headings. This could be considered a very high precision search. Once you’ve identified a subject heading and ‘drill down’ on it, you’ll get only items that were posted to that heading by a cataloger.

As with the general relationship between recall and precision, this makes it a somewhat low recall search. There may very well be items in our catalog that are about your subject of interest, but are NOT posted to the particular heading you clicked on, and you’ll be missing those. These items may be missing because of: changes in subject cataloging practice; multiple subject vocabularies in use (LCSH vs MeSH); cataloger idiosyncracy or mistake; LCSH cataloging principles controlling level of specificity of headings; or records from third party vendors that do non-standard or poor subject cataloging, or no subject cataloging at all. (Many of our ebook records have non-LCSH subject headings, or none at all).

This last issue of non-standard or poor authority control could be even more significant if in the future we load records from non-MARC sources into Catalyst, such as JScholarship or other local databases – such records will be often be controlled with different vocabularies, if controlled at all. So the more of these we add, the lower the ‘recall’ of very-high-precision headings-browse type search will be. We’re trying to create a basic foundation that can reasonably accommodate such diversity of metadata in a single search.

(But note well this does not mean that controlled vocabulary is irrelevant – it is very valuable in adding terms that may not be in transcribed fields in a standard way, and catalyst search takes advantage of this by indexing controlled vocabularies, and boosting matches on controlled vocabulary terms higher in the result set. The left-hand “limits” or “facets” also absolutely depend on standardized controlled vocabulary assignment by catalogers to achieve consistent groupings into, eg, ‘topics’ or ‘authors’.)

Keyword search may in many cases provide a better starting point. Even an “all fields” search instead of a “subject” search may be better at the recall/precision balance, as it will include matches to your search terms in title, table of contents, or summary fields, in addition to subject headings fields. Or course, subject headings are still searched in an ‘all fields’ search, and hits on subject terms are boosted in the results.

If an ‘all fields’ search on your topical terms returns too many results, especially results not relevant to you, (too low precision), switching to a Catalyst “Subject” field search will give you higher precision, will still searching matching multiple forms of subject headings containing your words (for instance, from different vocabularies). If that’s still not precise enough, you can increase precision yet further by choosing from one one of the topical limits on the left, to limit to just that LCSH subdivision (LCSH subdivisions is where the topical limits come from). You can also click on a subject heading displayed for a given record, to do a new search for the words and phrases in that subdivision. Exactly which of these techniques will be useful to you depend on the particular search you are doing, but these are tools to increase precision if your initial search is is too high-recall/low-precision.

We think you can often find very relevant results using this method that you would have missed in a traditional library alphabetic headings browse. Give it a try, see if that is the case.

(An additional downside of the traditional headings browse is that it requires an extra step or steps (identifying subject headings from a list; possibly switching back and forth between headings list and records list multiple times), which may not serve users well when they want a simple, basic, search. (“simple things should be simple”)).

(Of course, there is still plenty of room for improvement in the Catalyst approach as well. For instance, authorities “see from” synonyms could be included in the keyword search; they are not at present. And there are other alternatives we could provide for sophisticated searching, again balancing our resources to develop new features against user need.)

This entry was posted in General. Bookmark the permalink.

6 Responses to Information Retrieval and relevance ranking for librarians

  1. I have written a few blog postings about TFIDF, specifically designed for librarians. Of particular interest may be TFIDF In Libraries: Part I of III (For Librarians)”. –ELM

  2. Candy Schwartz says:

    This is perfect as an example I can share with my org students after giving them the basics of TFIDF. Thanks

  3. carolslib says:

    This is brilliant, thanks! As I read it I’d think “but what about…” and the next section would cover that area. I really like how you describe the TF-IDF, which is how the earliest search engines worked (or so I recall) and how it evolved to more weighted searching. Having both recall and precision is important, especially since it will often help the user narrow or expand based on results received.
    Thanks very much – this is a fantastic post!

  4. Dorothea says:

    Love this. It’ll end up in my syllabus for sure.

  5. Pingback: Lamenting the poor support of the expert user | Christina's LIS Rant

  6. Pingback: customizing Blacklight: “disable automatic stemming” | Bibliographic Wilderness

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s