persistent search urls; can be tricker than it seems

One simple thing everyone these days wants (or ought to) in a ‘discovery layer’ these days is persistent urls for nearly all pages; including both individual record pages and search results pages, as well as other appendixes.

By persistent urls, we mean a url which you can bookmark, or include in a blog post, or tweet, or send in an email — and when you or someone else later accesses the URL, it will still work, and still point to the same page it did before.

The uses in that list beyond ‘bookmark’ are actually probably more important that actually bookmarking. It’s what lets our catalogs or discovery layers particulate in the community, conversation,  and information ecology of the web. For instance, librarians here at my place of work are already using persistent URLs to particular searches in our new blacklight-based catalog to point to what we have on a certain topic, linking in blog or facebook posts.

Many of our legacy OPAC’s failed at persistent URLs in the most basic way; with all or most of the URLs not only having a session ID in the URL, but having URLs that were not neccesarily interpretable by software at all outside the context of that sessionID. That is, there was a sessionID in the URL and the URL could not be interpreted by software without the session ID.

Web software stopped being designed that way oh, about 10 years ago, when people started to realize that making the URLs good web citizens was important for useability and power.

So our ‘next generation’ interfaces start out (rightfully) by avoiding this, and having persistent URLs that do not include session IDs.

And that’s pretty much it for your individual record URLs, which then will proabably look something like “/records/12345”.   But there are still a couple tricks for reliably pesistent search urls.

Avoid exposing back-end implementation in your URLs

Because back-end implementation may change.  Here’s a good example, assuming your discovery layer is based on Solr, which it probably is if you have any control over it’s URLs!

A very simple discovery layer might expose solr field names in the URLs it presents to users.  A standard Solr dismax search, you tell Solr what fields you want searched with what boosts in a ‘qf’ parameter, an author search might be expressed to Solr including this parameter:  “&qf=main_author^100,other_author^50,transcribed_author”.

The most basic approach to a discovery layer may have that URL parameter in your application’s URL, that your users see, too.

But then what happens if you later decide to change your Solr schema? Maybe you realize you want to boost the fields differently. Or maybe you realize you don’t need three seperate author fields, you’re going to combine them all into one. (Or that you actually need 6 instead of 3!).

When you make that change in your underlying Solr, if users have bookmarked URLs which actually specify  “&qf=main_author^100,other_author^50,transcribed_author”, in the best case they’ll no longer get the same ‘author search’ that fresh searches get, and in the worst case the search won’t work at all, or will even result in an error message.

So keep implementation details like that out of your user-facing URLs.  Your user-facing urls should just say something like “&search_type=author”, and then your application will translate that to what to send to Solr.

I forget if Blacklight ever exposed this level of Solr field name details in it’s URL; but it did initially put Solr request handler names in the user-facing URLs, a different request-handler for each search type.  This was a similar problem; if you changed the organization or names of your request handlers, or changed your mind about mapping user search types one-to-one to Solr request handlers, previously saved URLs could break, and things were just tricky. So it’s no longer this way.

Blacklight does still expose back-end Solr field names in it’s “sort” parameter, which is in fact a bad design, and hopefully we’ll get around to fixing at some point.

Keep back-end implementation details out of your user-facing URLs; instead your application needs to translate your user-facing URLs to the appropriate back-end implementation.  Then you can change your back-end implementation details, and previously stored URLs will continue working as you want them to.

Controlled vocabulary values

In general, we want a saved search results URL to resolve to results for the same search when later clicked.  We realize that it won’t neccesarily resolve to the same record list if the database has changed; this is generally desired behavior, that’s how we want it.

If you search for barbadensis you might get two hits. If you bookmark that URL (or put it in a blog post), and come back to it 6 months later and click on it…. if those two records have been deleted from your catalog and no new records including that term have been added, you’ll get 0 hits.  This is fine and expected, it’s the kind of persistence we want out of our search results URLs.

But what about controlled vocabularies?  If you have specifically done a search for an author’s name, say “Smith, John, 1930-“, to get search results with records attached to that specific controlled author’s name… and a death date is later added to the authorized heading…  then when you later do a search for “Smith, John, 1930-“, you might get 0 results (depending on how the software does searches), because there’s no longer any records with “Smith, John, 1930-“, they’ve all got “Smith, John, 1930-2010” instead.

Or same thing with LCSH, perhaps an LCSH controlled heading is changed to a more modern term, and your saved search for “Cookery” gets you 0 results, because it’s all “Cooking” now.

We could hypothetically imagine a solution to this, where perhaps the software doesn’t put the actual user-presentable heading in the URL, but instead puts the LCCN for the authority record in the URL.  When the UI lets users choose limits from a controlled vocabulary (for example “facet” style), it’s relatively straightforward how this hypothetically could work.

Hypothetically this is the right solution, but in fact, it’s not very feasible in our actually existing environment, where our actual corpuses probably include mix-and-matched vocabularly from several vocabularies (LCSH, NLM, locally created subjects, other stuff), that may or may not be identified clearly as to which vocabulary it comes from, and each vocabulary may or may not be exposed in an updated machine-readable way (id.loc.gov is a great start) neccesary to support software working this way, and even if they are, we don’t really have the tools (starting with our ILSs) to manage the data in this way.

Not even to get into changes that aren’t just a controlled heading being changed, but a previously existing LCSH being merged together with one or more others, etc. And free-floating sub-divisions combined in not entirely machine-predicatable ways with other headings where multiple components might have an LCCN.  And then trying to make it work for textually entered searches for “Cookery” too, requiring a different approach.

We just don’t have the tools and infrastructure to make it feasible to deal in persistent identifiers for terms in our large controlled vocabularies rather than the actual user-presented terms.

And a search returning 0 results because an NAF or LCSH heading has changed just isn’t a big enough problem to justify purely local investment into the non-trivial infrastructure to try to deal with it. So it goes, we can imagine a better infrastructure, maybe some day we’ll have it, but for now, we just call it ‘good enough’.

But other than these gigantic controlled vocabularies we use for name and subject headings, we have much smaller, sometimes local controlled vocabularies, that are amenable to a better approach. And sometimes require a better approach. Sometimes we don’t even think of these as ‘controlled vocabularies’. Let’s look at an example.

Format limits

Our discovery layer provides a ‘facet’ limit in the sidebar by ‘format’.  You can, for instance, choose to look at a search limited to just Videos. 

And here’s what the URL we hope to be persistent looks like for that search:

https://catalyst.library.jhu.edu/?
   f%5Bformat%5D%5B%5D=Video%2FFilm&
   q=nicaragua

It’s not as pretty as we might like, let’s ignore all that “%5b” stuff for now; but it does work as a persistent URL you can link to in a blog post or facebook status update, and people clicking on it later will get the same search.  But do take note that “=Video/Film”, the actual term we present to the user is in the URL. (That %2F is just a url-encoded ‘/’).

We use basically our own list of “format” values; the formats a record belongs to is obtained from the MARC record, but it’s not a one-to-one mapping. Most other people have had to do similar, exactly what’s in the MARC record is actually several different sometimes inconsistently conflicting controlled vocabularies, none of which are really suitable for directly showing to the user, you’ve got to translate them into something that is. So we did so, we get a list including “Book”, “Video/Film”, “Online”, “Conference”, “Map/Globe”, etc.  (Our list isn’t entirely intellectually consistent either; figuring out how to present form/format/genre to the user in a consistent and understandable way is a hard problem! So we just come up with a bucket of things that seem to work to help the user.)

Oh yeah, so in fact, this list of things we came up with is a ‘controlled vocabulary’ too, although it doesn’t much matter if you think of it that way.

So we have a persistent URL that works fine…. but notice that what if you realize, oh “Video/Film” isn’t the best thing to call that category, let’s call it “Moving Pictures” instead.  Okay, we wouldn’t do that, but what we did do was decide that what we had been calling “Serial” we wanted to call “Journal/Newspaper” instead.

Problem is, that users might have a bookmarked URL that says “restrict to Format ‘Serial’.”  And if we just go ahead and make this change, that bookmarked URL will result in 0 hits, because there no longer are any records with Format ‘Serial’, they’ve all got ‘Journal/Newspaper’ instead.

So what do we do? Now hypothetically, since this is a “controlled choice” facet-style limit and not free-entered text, we could use that method we talked about before, of putting some internal id in the URL, rather than the actual user-presented label. Maybe the url would say “format=5”, and the software would know that “5” is the limit formerly known as “Serial” now known as “Journal/Newspaper”. Problem with this approach is:

  • In actual real life, we already had deployed our software with the actual labels in there, so even if we changed it to use persistent identifiers/codes instead, we’d still be breaking any existing links.
  • You know, it’s just over-engineered. Yeah, it seems “correct”, and is really general purpose, that strategy could be used for all sorts of things. But it’s a pain to implement, and we just don’t need it yet. If someone wrote some general purpose controlled vocabulary handling feature for Blacklight we could use for this too, then maybe it’d be worth using here, but it’s not worth me trying to write such a feature for this case, it’s just over-kill.

So what’d I actually do?

  1. Re-index the database so both “Serial” and “Journal/Newspaper” show up in the facet list, containing the exact same records, they are synonyms. Wait overnight for the re-indexing to complete in the live production environment.
  2. Add a redirect to the application, so if a URL comes in asking for “Format ‘Serial'”, it will actually redirect the browser to the same URL asking for “Journal/Newspaper” instead.Now once this is done, even though both show up in the UI, if a user clicks on “Serial”, they’ll actually see “Journal/Newspaper” echo’d back to them instead. For instance if you try to follow this URL:https://catalyst.library.jhu.edu/?f%5Bformat%5D%5B%5D=Serial&q=nicaragua&search_field=all_fields(note it’s got “serial” in it), you’ll actually find yourself looking at this URL instead:https://catalyst.library.jhu.edu/?f%5Bformat%5D%5B%5D=Journal%2FNewspaper&q=nicaragua&search_field=all_fields

    (with “Journal/Newspaper” in it instead.) This UI effect isn’t actually the goal, but it’s not that bad; it’ll be temporary, because once this redirect is in effect:

  3. Re-index again removing “Serial” alltogether, now just “Journal/Newspaper” shows up. It’ll no longer be possible for anyone to choose “Serial” in the UI, it won’t be presented — but if someone had a saved URL that asked for format=Serial, and they click on it — it’ll still work as desired, redirecting them to format=Journal/Newspaper instead, hooray!

Not too hard, but at first I was about to just skip ahead to #3, replacing Serial with Journal/Newspaper in one fell swoop without the redirect; but I realized a more careful step by step approach was needed to avoid breaking saved URLs limiting to Serial. And limiting to serials/journals/whatever is a pretty popular thing to do, so odds are users would have noticed, perhaps even links in official library blog or facebook posts would have stopped working, etc. 

redirect implementation

In a Blacklight app, which is a Rails app, this kind of redirect is fairly straightforward to do with a rails controller filter. Here’s mine, in CatalogController:

before_filter :redirect_legacy_values, :only => :index
  def redirect_legacy_values
    should_redirect = false

    # Check for things we want to change; we mutate params in place,
    # cause it shouldn't matter since we're only going to redirect
    # and stop further processing.

    if params[:f] && params[:f][:format] && (index = params[:f][:format].index("Serial"))
      params[:f][:format][index] = "Journal/Newspaper"

      should_redirect = true
    end

    redirect_to params if should_redirect
  end

2 thoughts on “persistent search urls; can be tricker than it seems

  1. Good to know that it’s redirecting – I was wondering if it broke LibX’s journal title search, but looks like it can at least wait until I’m back at work to fix it :)

Leave a comment