Combining information from catalog and link resolver kb

Our link resolver, branded Find It, powered by Umlaut backed up by SFX and other sources, aims to comprehensively answer the question “what library services can be provided for this known citation.” One of the more important services is electronic full text access.

Because of the legacy data environment we’ve inherited, in order to advertise online full text access for a given citation (which is usually an article, but can be a journal, or even a monograph), the software needs to combine information  from MARC in  our traditional ILS (via the 856 field), as well as data from the link resolver (SFX) knowledge base (which among other things, unlike our MARC data, when it works is capable of bringing the user to an article-level link for their article citation, and of knowing if we have access to the particular article cited.)

This gets tricky to do without providing misleading or confusing duplicative or missing information for the user.

Here is a description of the heuristic algorithm Umlaut uses to decide whether to include information from each source, and how.  This explanation demonstrates how this is neccesarily an imperfect and approximate process, due to our legacy enherited data environment, but I think what we’re doing now works reasonably well.  Slightly edited from internal documentation I prepared for my coworkers.

Why combine?

Why not just use links from SFX? (link resolver)

  • Becuase there is information about full text links in Horizon that is not included in the SFX database.

Why not just use links from Horizon? (traditional MARC ILS)

  • Because SFX’s information is more powerful for supporting the user’s work. The majority of Find It use is for journal articles:
    • For a journal article, the SFX link should take the user directly to the article requested; the Horizon link will always take the user only to the journal title page and require additional navigation from the user.
    • The SFX kb includes machine-processable range-of-coverage information, so an SFX link will only be shown if the SFX kb thinks the article requested is within the range-of-coverage. Horizon links, when they have ranges of coverage at all, do not have machine-processable ranges of coverage, so Find It is unable to determine if the Horizon link is supposed to include the specific article cited or not.
  • There are some links in SFX that are not in Horizon, although this gap is growing smaller.
  • Horizon links historically were less reliable than SFX links (more likely to be broken), and much less likely to have even human-readable range-of-coverage information, although this gap is narrowing too.

Why not just combine simply?

The first approach might be simply displaying all the links Find It can find from Horizon together with all the links Find It can find from the SFX kb.

However, this would lead to some unfortunate consequences for the user, mostly related to the fact that some platform links exist in both Horizon and the SFX kb, but in different forms with different capabilities.

Double-listing

In many cases, under this simple approach, a display would include the same link twice, once from Horizon and once from SFX. The links would be labeled slightly differently in the display, as there’s different label information from each source.

While this alone would be unfortunately confusing to the user, what’s worse is that these links aren’t in fact exact duplicates: the Horizon link is never going to take the user right to the article requested. So including the duplicate but less functional Horizon link is really a dis-service to the user.

Example:

(Temporary live link: http://findit.library.jhu.edu/findit-dev/go/16097)

Listing Horizon link despite lack of availability

Even worse, in other cases, the user may be asking for an article from a journal we have some coverage of, but not coverage including the specific article requested. The SFX kb is smart enough not to display the link in this case; but the Horizon db is not.

So in these cases, even though we have information (in SFX) that we in fact do not have availability for the requested article, and no SFX-derived link will be provided–a non-functioning Horizon-derived link would be provided anyway.

This leads to user frustration; a main goal of Find It is to, wherever possible, not give a user a link that won’t in fact work.

Example

SFX knows that cambridge UP range of coverage is only to 1999.


Temporary live link: http://findit.library.jhu.edu/findit-dev/go/16108

But on an article-level request for a 1995 article, the Horizon cambridge link is still shown:


Temporary live link: http://findit.library.jhu.edu/findit-dev/go/16118

General concept of a solution

The general idea of a solution is to then only show Horizon links when they do not represent platforms that SFX already knows about. Since the SFX kb is more fully functional than the Horizon db in terms of links, we try to consider the Horizon db ‘supplemental’, only using it when it’s got information that is not also in SFX. That would prevent double-listing, and prevent listing of Horizon links when SFX knows that we don’t actually have access (to the specific requested article) after all.

However, it’s trickier than it sounds to implement this in software, because it’s not always entirely clear from a Horizon record whether a given link is something the SFX kb knows about or not.

All the Find It software has to go on is the URL in the Horizon record. From this URL, Find It needs to decide if it’s something the SFX kb knows about or not.

Find It tries to do that by keeping a list of URL hostnames that it believes SFX knows about, and ignoring Horizon URLs that match this hostname. But this is subject to a couple problems:

  1. It’s tricky to get an accurate list of ‘hostnames SFX knows about’, some things may be left off.
  2. Even if a Horizon URL matches a hostname SFX knows about, the particular Horizon link may represent a different platform on that hostname that SFX does not know about. The best example of this is with ebooks. For instance, SFX in general knows about “ebscohost.com”. However, we have e-book links from ebscohost.com, and SFX does not know about them.

So Find It has developed an approximate ‘heuristic’ algorithm to try and do the best it can to display Horizon links only when they are likely to actually be supplemental to what’s in SFX, and not when they are likely to duplicate what’s in SFX. This is an imperfect compromise to try and give the best display we can to the user, it will invariably make some mistakes both in the directions of inclusion and exclusion. But we’ve tweaked the algorithm from experience to try and get an optimal compromise.

The Find It Horizon inclusion algorithm

This is the compromise attempt at optimal rules that Find It implements as of now (Nov 2009). As we tweak the algorithm in response to problem cases, it may continue to evolve.

  1. Find It first assembles a list of bibs from Horizon that ‘match’ the Find It request. Then it looks for (marc 856) links within these matches.
    • This is a somewhat imperfect process to begin with, depending on what information was supplied to Find It with the request, and what information is in the Horizon records, Find It may miss some records (‘false negatives’), or get some incorrect ‘matches’ (‘false positives’) here.
    • However, if the Find It request originates from a catalog detail page, false negatives are unlikely, because Horizon bibID is used to make sure the originating record is considered a ‘match’.
  2. If a matching Horizon marc record is not for a Journal, then Find It displays it not matter what. This is because the SFX kb generally does not include e-books, so we were incorrectly excluding too many Horizon e-book links otherwise.
  3. If the request is: A) for a title-level_citation (not an _article level citation, but for a journal as a whole) and B) there are no links from SFX provided, then Horizon links are shown no matter what.
    • In this case, the Horizon link is highly likely to represent something SFX does not know about. Since it’s a title-level citation, there are no range-of-coverage issues. So if nothing is coming through from SFX, the Horizon links are highly unlikely to ‘duplicate’ anything from SFX, they are likely to be unique information we want to show the user.
  4. Otherwise, the request is for a specific article, and the Horizon record is for a journal. In these cases, Find It will display the link only if it’s hostname does not match the list of hostnames Find It believes SFX knows about.
    • Find It maintains a list of hostnames that it believes the SFX kb knows about. It maintains this list by an automatic extraction from the SFX kb. However, this list is supplemented by hand with URLs that could not be automatically extracted from the SFX kb. (For instance, automatic SFX extraction tells Find It that “ebscohost.com” is something SFX knows about. However, we have manually supplemented that with “epnet.net”, knowing this is really the same thing.)
      • Technical info: This manual list is specified in the Umlaut file /config/umlaut_config/initializers/umlaut/resolve_logic.rb, the variable “additional_sfx_controlled_urls”
This entry was posted in General. Bookmark the permalink.

4 Responses to Combining information from catalog and link resolver kb

  1. Where I work (Yale), we followed a different and simpler path to coordinated results from our link resolver and our catalog. We use Voyager for the catalog and SFX for the link resolver. Basically, we subordinated the catalog (for e-serials) to the SFX KB. The catalog has records only for e-serials that we have in the SFX KB (we use a MARC record service to get the records for the catalog and load updates monthly after the SFX KB is loaded. The records in our catalog use the SFX KB to link to all of the versions of that title we have access to. So our catalog depends on the SFX KB for the title-level MARC records, current holdings data (resides in the SFX KB, not the catalog) and the access links to the e-serial (resides in the SFX KB, not the catalog.) Works for us.

  2. jrochkind says:

    Yes, if you can actually coordinate/rationalize your data stores, that’s definitely preferable.

    Here, the cataloging dept did not believe that was desirable/possible. We have things in the catalog that are in SFX, and things in the catalog that are not. And we have things in SFX that are in the catalog, and things that are not. And the cataloging dept believes this is a neccesary state of affairs that nothing feasibly can be done about short-term.

    If the back-end data stores could be coordinated/rationalized instead, that would definitely be far superior.

  3. Well, it wasn’t easy to get people to see that this was desirable; and it isn’t easy to keep people seeing that this is desirable. It involves a lot of cross-unit coordination and requires that we think differently about the catalog and the SFX KB. The catalog (for e-serials) is secondary, a derivative of another database and process. The SFX KB isn’t simply a stand alone tool that does link resolution and generates an A-Z list of titles: it is the more the database of record for our e-serials than anything else, but it isn’t quite that either. It needs the catalog to complete the information we have about our e-serials–bibliographic and payment information, for instance. In its way, it is more complicated than your algorhythm! Perhaps we can think of it as a social algorhythm.

  4. Pingback: Catalo & co (19/12/09) « pintiniblog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s