article search, and catalog search

While we libraries typically spend more of our resources on ‘catalog’ search, for academic libraries a significant portion of users probably spend more time looking for articles instead. (Anyone have a cite to any research showing this?)

So many of us are trying to spend more resources on supporting article search in a way that is integrated into our web infrastructure, with a good user interface, instead of a hacky afterthought.

I had been assuming that the way to do this was to, somehow, provide a single search interface that would search both over the traditional catalog/ILS database, and over vendor content including scholarly articles, in one merged result list. But this is a very tricky thing to do, and most of the feasible ways to do it seriously constrain our infrastructure choices to dependency on a single vendor’s stack. However, some interesting research from UVa suggests that users may not want actual merged search results, may indeed specifically prefer not to have it. This would actually open up our options some.

Catalog Discovery Options

As I’ve mentioned before, we undertook a survey of “next generation” “discovery service” options around two years ago, focusing on replacing our traditional OPAC, focusing on the ILS/catalog database.

At that time, looking at our options, we decided that of the various proprietary options available, they’d take pretty much as much local programmer work to set up as an open source option; and they still wouldn’t give us a lot of the feature we wanted, including some sophisticated search features, as well as seamless UI integration with ILS features like account functions (items out, making requests), and live shelf status (checked out, etc.).

So we decided to go with an open source solution, that would be (we thought) actually cheaper TCO, but give us the ability, if we could afford the development time, to do things just how we wanted. In retrospect, the open source solution took more time than we expected, but I do think we ended up with a better interface than we would have gotten from even today’s (two years later) proprietary products, better UI integration with ILS functions. While we dont’ have ‘browse search’ yet (and it’s non-trivial to add to Solr-based  software), if we can afford the time to spend on it, I am confident I could code it up — instead of just waiting/hoping for a vendor to do so in a proprietary product. We have more control, and more options.

In the context of the options we evaluated two years ago, I think we made the right choice.

But the context has changed somewhat, which initially caused me to second guess a bit.

The Rise of Aggregated Index Offerings

Two years ago when we did our survey, it was before the rise of vendor-provided aggregated indexes. These are indexes of third party content, initially and still majority composed of scholarly articles indexed at the article-level, combined from various third party sources. Examples are Serial Solutions Summon, Ex Libris Primo with Primo Central, and OCLC WorldCat Local. (Which existed two years ago, but had not yet added significant third party article-level content, which they since have done).

This class of product marks a new option for providing “article searching” support to our patrons.  Historically, there have been several ways libraries tried to provide this support. One way is simply giving the user a list of licensed databases, and letting them to a variety of third party vendor platforms to do these searches. While OpenURL allowed such an approach to still, hackily, integrate with library services and alternate full text destinations, this still required users to deal with a bunch of very different interfaces, fairly poorly integrated into the library web infrastructure.

The next approach, at least 15 years old, was broadcast federated search, where products — such as Ex Libris Metalib or IndexData MasterKey — actually try to search a handful (or dozens or even hundreds) of third party databases every time the user enters a query, and blend the results. This gives us a single interface for searching multiple vendors databases, and allows us to integrate that interface (to some extent) into a unified web presence…. but it just doesn’t work very well, for somewhat insurmountable technical reasons.

I wrote about this four-and-a-half years ago, and suggested the only way to get a good “meta-search” user experience was to harvest/collect metadata from these disparate third party databases, and combine them in a single aggregated index. (What I called a ‘local index’ in that article I really should have called an ‘aggregated index’ — the word ‘local’ is unclear, local to what?).

I still believe this to be true — and I think the rise of the aggregated index products bears this out — but I’ve come to realize it takes an enormous effort to do this well (including keeping your data up-to-date on an ongoing basis), and it’s infeasible for an individual library to do this on their own. It would take a large consortium, or a large vendor, to have the resources and economy of scale to pull this off.  Thus the rise of the vendor-provided aggregated index products.

The Trade-Off

So Summon, Primo(Central), and WorldCat Local all include such an aggregated index, hosted on the vendor’s platform. They also all will give you a product that includes your local metadata (ie, what you control locally through your ‘catalog’, including your own physical holdings).  This allows what many of us thought was the “holy grail” of library search — combining ‘catalog’ search with ‘article’ search in a single interface allowing you to search both at once and returning a single merged, relevancy ranked, result set. (That part in italics is key, we’ll return to it shortly).

But it’s completely incompatible with an open source we-control-the-software-and-can-add-features approach.  Sure, some of these vendors give you limited APIs, allowing you to (at potentially great cost in development time) put your own ‘skin’ on their search, potentially integrating with local systems (for instance, patron account screens) better than the out of the box product. (and some of them don’t give you sufficient APIs for that). But they’re all based on indexes that reside on a vendor’s server, and you don’t have access to change the underlying indexing routines, strategies, fields, etc. Nor will any of these vendors share their aggregated data with you. (Perhaps because of licensing agreements with the providers they harvest from; perhaps just because it would be too much trouble and expense for the limited number of libraries interested in such a service and what those customers would be willing to pay.)

There’d be no way to add a browse search, or a novel timeline or map search results display, or a novel authorities browse, etc.  You’re stuck with what the vendor gives you, and stuck paying for it, and by combining your catalog search and your article search in one vendor product, you’re putting all your eggs in one basket, limiting future flexibility, increasing cost of switching, etc.

But what you get an aggregated index. This seems like a trade-off without a best-of-both-worlds option.  But in fact that’s only true if we assume that users want/need to be able to enter a query, and get back both ‘catalog’ and ‘article’ results, in a single merged and relevancy ranked result list.

Recent communications from librarians at UVa give us cause to question this assumption.

Users actually may not want single search

Julie Meloni from the University of Virgina writes on the blacklight listserv:

  • Process: A/B testing (really A/B/C/D testing) of four interfaces that offered some sort of aggregated search (Stanford, Michigan, Villanova, and University of Central Florida (who is doing the blended results/relevancy rankings if anyone remembers that conversation from NGC4Lib ) if you’re wondering). From those results we determined two critical pieces of data (among several others): patrons come to Virgo knowing the _type_ of item they’re looking for (e.g. book or article), and too much info in search results is not desired.
  • Really important point that came out in user testing here (of our patrons and their needs, with all due respect to others) is that patrons _did not_ want blended results. At all. Across the board dissatisfaction with that approach. This was awesome for us to hear because it meant that we _didn’t_ have to come up with some intricate/ tricky/very fragile way of maintaining article metadata (that legally we couldn’t hold anyway) in our own Solr index such that everything could have our own relevancy rankings applied and so on.

This is in fact a hugely important re-evaluation of our assumptions of what user’s want. As Julie says, it opens up our options a whole lot.

If we don’t need to provide blended search results, then it may indeed be possible to use a vendor-provided aggregated index combined with a different product, such as an open source Solr-based product, to provide searches.

You’d still want to provide an integrated look-and-feel that makes it look to the user like it’s one “product”. But if you don’t need to blend the results, behind the scenes it could be consulting your own Solr index for ‘local’ (ie catalog) content, and the remote vendor provided database for ‘article’ content. Which is exactly what UVa plans to do.

It’s not necessarily easy, some libraries will, at the moment, still find it easier to just pay a vendor to provide a single solution that does it all. Certainly it’s possible for these all-in-one vendor aggregated index products to provide seperate “local” vs “non-local” content searches, if that’s what the user really wants to. But so long as you don’t have to blend the results, it’s feasible to combine a local open source ‘catalog’ index with a remote vendor supplied ‘articles’ index, maintaining more control over your local search functions — and can become even easier and cheaper as libraries with the interest and resources to work on it provide more open source tools.

What do you call these two types of search?

So UVA says that users didn’t want blended results. They wanted to keep these ‘two types’ of searches — catalog and ‘article’ — seperate. But this makes me wonder, what words or concepts did users use to describe these ‘two types’, how can the system describe them in a way that makes sense?

I don’t know if users really know what ‘catalog’ means, especially as contrasted to ‘articles’. Many users expect our existing ‘catalogs’ to include article-level citations and are confused they don’t. The ‘catalog’ isn’t just ‘books’ — from a back-end point of view, it’s our ‘locally controlled metadata’, but this obviously is not something the user cares about at all.

In fact the ‘catalog’ contains all of our ‘physical’ holdings — including books, videos, CDs, pamphlets, assorted ‘realia’, etc. And it includes journals we have physically, but only at the journal or volume/issue level, not article-level metadata. Oh, and then it includes some electronic content, like ebooks we’ve licensed, and electronic access to journals. But still not article-level metadata. Which again, is what many of our users are interested in finding much/most of their searching time.

And then there’s this ‘other stuff’, aggregated metadata from a variety of third party vendors. Which is mostly scholarly article citations (which often we’ll be able to provide electronic full text for), but may also contain some book citations, some audio/video citations, who knows what, whatever EBSCO or ISI or Scopus decided to index.

So apparently we know that users (or a least UVa users) want to keep these things seperate, what is it they think they’re keeping separate? (in fact, our inability to make these two categories distinguish on any actual user-centered characteristic is what led me to assume they should be ‘blended’).

I think some user-centered research on that would be interesting, perhaps if we learn more about what UVa did, we’ll find they explored that too. Joseph Gilbert from the UVa says:

It’s still a bit earlier in our usability testing to know for sure, but our delineation of “Catalog + Articles”, “Catalog only”, and “Articles only” seems to resonate with our user population.

Villanova uses “Combined results” [“combined” is a side-by-side listing of two result sets, not a merged result set –jrochkind], “Books and more” and “Articles and more”, which our users found a bit confusing, partly for the reasons you mention (especially “books” as a stand in for everything in the local collection) and partly for UI reasons. Catalog seems to be a reasonable catch-all for all our videos, bound journals, books, etc., though we also highlight specialized sub collections like our video search and music search. We’ve found that sometimes users are unclear if “article search” means only online sources or not, but we have a significant design space dedicated to making this distinction clearer. I think any single-word label without contextual help is likely to be confusing in one way or another.

Cheaper source of aggregated index?

It occured to me recently that in addition to the new fangled ‘next generation’ ‘discovery layer’ aggregated indexes, there are actually some aggregated index type products we’ve been paying for for quite a while.

Examples are Scopus and ISI Web of Knowledge.  Both Scopus and ISI try to get as many scholarly citations as possible, by aggregating from a variety of sources, similar to the newer aggregated index products. Many of our libraries are already paying for Scopus and/or ISI, and at significantly cheaper (is my impression) prices than vendors are charging for new-fangled aggregated index products.

So what is the difference between these ‘old’ aggregated index products and the new fangled ones?

Well, the new ones allow you to blend your local metadata (‘catalog’) with their aggregated index in a single relevance ranked hit list. You know, the thing we’re saying maybe our users don’t actually want, and maybe we can’t feasibly do while maintaining control over our software stacks either.  The new ones also give somewhat fancier slicker interfaces than Scopus or ISI did last time I looked — interfaces that the UVa-type approach probably wont’ be using anyway, instead using API’s to get results from aggregated index products and present them in local open source interfaces.

So I wonder if we could profitably use Scopus or ISI Web of Knowledge as our ‘article search’ source in this strategy, at significant cost savings over buying one of the new ‘discovery layer’ aggregated index products, only for use of it’s API.

It depends on the quality of the APIs and the quality of the search results from an ‘old school’ aggregated index product. I seem to recall Scopus has a pretty good API, but I haven’t looked at either one in a while.

I have a fantasy now of doing A/B(/C/D) style testing with users comparing our various ‘meta-search’ options. One or more of Summon, PrimoCentral, or WorldCat Local; compared to Scopus or ISI Web of Knowledge; perhaps compared to Metalib too. See how the actual results stack up, see if Scopus or ISI can provide ‘article search’ services just as well, at lesser cost.

(I can’t keep track of vendor ownership these days. Are either or both of Scopus or ISI owned by company which is in addition  trying to sell you a more expensive ‘discovery layer’ service? That might effect how much their owners would want to meet such a use case).

Hope to see more from UVa

So, I think UVa’s research in this area is really important, really enlightening and challening some assumptions many of us have about user desires and needs and how they should be translated into software to suppor them.

I hope we can soon see more complete write-ups from UVa on what they did and what they found, continuing as they continue to find out more stuff through research and trial development.

This entry was posted in General. Bookmark the permalink.

18 Responses to article search, and catalog search

  1. Hi Jonathan,

    There are a couple of other benefits brought by the new discovery services. The ability to match the records in the aggregated index to your library’s holdings (electronic and print) so the user can choose to retrieve only what’s immediately available in full text. That’s is a big improvement over having to follow openurl links to check for availability. I reckon we’ll see that feature coming to the “old” indexing products very soon though: the vendors offering to harvest your holdings. Ovid and PubGet do this (sort of) and Google Scholar of course (although you can’t limit search results, it’s half way there by making it quicker to see which articles are available from your library’s holdings.)

    The other is enrichment of book records. In Summon I believe you can search Syndetics content, like you can in Aquabrowser, so you can search the chapter titles of books. And now Serials Solutions is going to index HathiTrust metadata in Summon, including the in-copyright stuff, so you’ll be able to search the full text of books you have in your print holdings. That should be incredible, like having Google Books search for your library catalogue. I can’t see many libraries doing that with their local installations of VuFind or Blacklight.

    I’ve been checking out the new discovery services to see how they perform for finding “known items”: book titles, journal titles, authors and articles, and they’re very disappointing. For fun, try searching for “The C++ programming language”, or “Fyodor Dostoyevsky”. I think that’s why users don’t want blended results, because when they know what they’re looking for but they just want to find it, they don’t want it buried in thousands of records for newspaper articles and book reviews. Google Scholar and Google Books do much better, without the facility to browse headings, so I’m hopeful that will improve as these products mature.

    Cheers,
    Laurence Lockton
    Bath, UK

  2. jrochkind says:

    Good points, Laurence, thanks.

  3. Dorothea says:

    Guess what just got added to my lib-tech syllabus for ILS day? :) Thanks for the great summary.

  4. Karen Coyle says:

    I suspect that users are having a hard time knowing what ‘catalog’ means — more so today than ever. They learn about the catalog by using the catalog, but what they learn may not be 100% correct. Years ago in the U Calif catalog we had a small number of libraries that actually created records for journal articles. This, however, gave some users the impression that the catalog contained books and articles, when the latter was actually an extremely limited set. As users encounter catalogs with both, their concept of ‘catalog’ is modified. Since users move from one library to another, they often arrive at a catalog with expectations learned in a previous experience. To me the big question is how can we quickly and accurately let users know what the catalog contains, this particular catalog, so they don’t approach it with a mistaken view.

  5. Greg Gosselin says:

    Laurence, I suggest you take a look at at Northwestern’s Primo discovery UI and give those *problematic* known-item searches a spin. While I agree most new discovery solutions struggle with known-item searching, others clearly do not.

    http://search.library.northwestern.edu/primo_library/libweb/action/search.do?dscnt=1&dstmp=1283878402046&vid=NULV&fromLogin=true

  6. jrochkind says:

    Greg, it looks to me like that Primo is set up to only search ‘catalog’ type resources?

    I think Laurence was suggesting that most of em, when configured to search both ‘catalog’ and ‘article’ type, generally do a poor job of known items from the ‘catalog’ portion. Is there a way to use that Primo to search over both ‘catalog’ and ‘article’ type results, with blended result set?

  7. Real data on real users is always good and sometimes surprising, like this one. I wonder if they could segment by age — I suspect younger people don’t see as much of a distinction between books and articles. Any insights from search log analytics?

  8. Aaron says:

    Interesting stuff. Somewhat related to this discussion about blended results is what to do with other content e.g Results from libguides, FAQs, site search results etc whether a tabbed search box is the way to go etc..

    http://musingsaboutlibrarianship.blogspot.com/2011/04/why-you-may-need-real-one-search-box.html

  9. Thanks for such a thoughtful review of the situation. It seems that many libraries are marching toward just one blended search box/set of results without weighing the impact this might have on users (good or bad). To me the revolution with the release of Summon 2 years ago is the ability to promote the use of articles. This is a huge step forward. The blending of the article and books metadata is a byproduct that tends to drown the latter. That said I do think that eventually there will not be the distinctions we see now. But we are in a transitional phase where users are aware of the differences and are making the distinction. We are offering users options in our interface and collecting statistics to see what they choose.

  10. Hi Greg,
    As I understand it, Primo “boosts” results for books, to make sure you get some before the large set of articles. But it appears that Primo can’t distinguish between C++, C and C#, so the fact the the first result is “The C++ programming language / Bjarne Stroustrup” is possibly more by luck than design!
    Cheers,
    Laurence

  11. Greg Gosselin says:

    Hi Laurence,
    Indeed, Primo provides a configurable mechanism to boost local collections (catalog, IR, websites) relative to external (article) content in a merged result set. This is an exceptional capability, the ability to customize the *blending* of local and remote results in a single result set ranked by relevance. Other Primo/Primo Central sites, e.g., Vanderbilt, Boston College, etc., configure the boosting differently – based on their institutional needs. This keeps us librarians engaged in the discovery process and ensures that local collections continue to be accessible.

    Best,
    Greg

  12. Greg Gosselin says:

    Jonathan, hi
    Primo at Northwestern is by default searching local collections (Voyager, Fedora) as well as the Primo Central index. Check out the “location” facets.

    Best,
    Greg

  13. Tito Sierra says:

    Great post Jonathan. The lack of a proper common vocabulary to differentiate between the various kinds of unified library search tools is a problem.

    In the academic library world there is a desire to deploy ‘single search’ solutions, but IMHO there is not much critical comparative evaluation of the various solutions in this space. The single search box itself is only a small part of the equation. The user experience can vary greatly across vended, open source, and homegrown solutions, so the single search user experience at Library A can be dramatically different from Library B even though they both technically provide a single search box solution. The implementation details matter, both in terms of user experience and product maintenance. For example, a single aggregated index with facets (‘blended’ search?) is very different from a solution that combines catalog, article, and other search results from multiple indexes in a single ‘bento box’ style interface. For examples of the latter see what U. of Michigan Libraries and NCSU Libraries have done with their default library search tools.

    Your observation about the trade-offs of going the vended aggregation index approach is spot on.

    You asked if anyone has research on whether academic library users spend more time looking for articles or catalog items. A few of use at NCSU are putting the finishing touches on a research article that provides a deep dive analysis of two semesters worth of search log data at NCSU. By way of context, our library’s single search solution is a homegrown tool that combines results from multiple sources (‘Articles’ via Summon, ‘Books & More’ from our Endeca-powered catalog, ‘Journals’ and ‘Databases’ from our ERM, etc.) into a single search results interface that our library maintains locally. We do use Summon, but we use it as an article search solution, not a library search solution. Our single search box is prominently featured on our library website, and we average about 3000 searches per day during the Fall and Spring semesters. We track both search queries and clicks to the various ‘modules’ (e.g. ‘Articles’, ‘Books & More’) that we include in the combined search interface. So we know when someone clicks on a catalog title versus an article title.

    Quick findings: In the two semester data set, we found that 41.5% of all clicks were for Articles results, 35.2% were for Catalog results. So the usage of articles was higher than the Catalog. This pattern is consistent from semester-to-semester. There is an interesting pattern of higher Catalog clicks at the start of the semester, with Article clicks growing as the semester progresses. The crossover point is at week 4 of the semester for both semesters. You will notice that 23% of clicks are neither Article or Catalog clicks. The remainder is a mix of Journal titles, library website results, and some locally managed Best Bets we’ve created for really popular resources. Academic library search is more than just article and catalog search.

  14. Dale says:

    Many thanks for a useful synopsis of current options and possibilities. Working at a library that lacks (or has one that it hides and is going away soon anyway, but that’s another story) an article meta-search means that I can to some degree assess the amount of demand that arises from users in the absence of such a library offering. My conclusion: not that much. This reminds me, of course, of the amount of demand that existed for the first generation metasearch products (MetaLib et al.), which was of course near zero. Both generations have at their core the notion of presenting a solution to a problem perceived and described by librarians, yet not well researched nor based on concrete and incontrovertible user demand. From the library’s viewpoint, we have hundreds of database interfaces, but from a disciplinary viewpoint, it’s a handful (how many databases matter to any given person, in other words). To aggregate those together might be useful (which is what first gen metasearch products sort of did, where one could select a subset of targets, but it never worked for the reasons to which you alluded), but when you aggregate everything, then one just adds noise to the results, no matter how well faceted or sliced up.

    At any rate, given how quickly first gen metasearch rose and fell, I feel little pressure to jump on the gen two bandwagon. Clearly it has great momentum at the moment, but something tells me that these new tools are not the “oh my god gotta have it” tools for users. Can anyone show otherwise?

  15. jrochkind says:

    Dale: Well, Tito’s findings that his users click on article results (from a meta-search/aggregated index product) MORE than they click on catalog results seems, to me, to suggest that users want this and find it useful, no?

  16. Tito Sierra says:

    Dale and Jonathan: For what it’s worth, the NCSU campus split is 80% undergraduate and 20% graduate. It is also sci-tech oriented. I suspect that the demand for articles would be much higher at universities with a higher graduate population, higher for those with a stronger science orientation, and lower for liberal arts college libraries. I think one should not generalize too much across institutions.

  17. “users may not want actual merged search results, may indeed specifically prefer not to have it.” I suspect that this depends a *lot* on who the users are. For undergraduates (who usually haven’t learned how to look for articles yet) it would be useful for known title searches to return articles.

    Relevance ranking is so bad in most catalogs that reducing the amount of stuff returned is the best way to improve the user experience. So that factor probably dominates.

    In terms of finding a cite that “[in] academic libraries a significant portion of users probably spend more time looking for articles instead.” — STM uses articles and even undergraduates rarely need books (except textbooks, or maybe math and geology books). Contrast with the humanities where books are the most relevant, and articles less so.

  18. Andrew Nagy says:

    Great positing and great commentary – thanks Jonathan for beginning this conversation.
    It’s nice to see that our original assumptions in designing Summon are what is expected – by abstracting the user interface from the back-end search engine by ways of an API – this allows our clients to utilize their own interface and minimize switching costs down the road (See: University of Michigan, NCSU, Villanova University, Brown University, Royal Holloway University, etc.). If the library doesn’t want a single search, well then they don’t need to – thats the flexibility that has been designed into Summon from the ground up.

    I’d also like to comment on your discussion about using a citation database that covers a wide range of content to take the place of a aggregated index – I mean no disrespect to my fellow colleagues at Elsevier and Thomson Reuters – but the scope of these indexes don’t measure up to that of the aggregated indexes on the market today. Summon’s unified index is much more than just scholarly articles, it is a unification of the world of academic content including video and digital images, open access content, company reports, subject encyclopedias, patents and so much more. Utilizing some of these highly regarded indexing databases will keep you with the same pain points that you had before – just another silo with slightly more content in it. We need to think about compiling all of the relevant content to a library patron into a single unified index to elevate the convenience and expose the data in a highly accessible manner.

    I’d also like to hit on a common misconception that was mentioned by Laurence about known-item searching. Discovery gets a bad rap for this, and from the early days with Summon and VuFind development, I can see why. But with the rapid development of these technologies, this is changing very quickly. Known item searching is a core function with Summon and should be apart of any discovery product. Discovery products need to provide the user with highly relevant results, no matter what the context. That said you can see from the numerous examples here, that this is something that Summon nails – from the most difficult of searches to full citations – Summon presents relevant materials to the users.
    http://drexel.summon.serialssolutions.com/search?s.q=The+C%2B%2B+programming+language
    http://drexel.summon.serialssolutions.com/search?s.q=c%23
    http://duke.summon.serialssolutions.com/search?s.q=the+road
    http://uh.summon.serialssolutions.com/search?s.q=nature
    http://wm.summon.serialssolutions.com/search?s.q=hurricane irene

    In Summary, what these aggregate indexes provide you is the ability to search across a large base of academic content. What makes the difference is the services that the vendor provides on top of this data – the comprehensiveness of the content, the metadata cleanup, the enrichment and dedup processes, the relevancy algorithms, etc. Moreover, the ability to search the full text. With out the full text, you are searching just cataloging metadata – not the real guts of the material that is only found in the full text. With out full text we are back to the same problems we started with from the OPAC and A&Is – the lack of discovery. Imagine searching the full text of the books on your stacks, the full text of the newspaper articles published this morning and the full text of scholarly journal articles all in one search – amazing!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s