Continued adventures in integrated article search

Way back in February 2013, we released an integrated article search service, integrated into our ‘catalog’ web application. (This may be the first time I’m mentioning that here on this blog).

It is based on using the EBSCOHost API’s we already had access to, with our existing EBSCOHost database licenses.

In January 2013, we released this as an additional option, planning on spending some time during the semester on assessment, with what we learned from the assessment helping to guide next steps — for this article search project, we’ve really been trying to do iterative development, with assessment and analysis cycles.

We started with a live service based on the EBSCOHost API’s because we could do so without signing any contracts or making any expenditures, and our previous rounds of investigation had suggested that an EBSCOHost-based solution might indeed be good enough for the basic search cases we were focusing on.  That investigation, reported in the Code4Lib Journal article linked above, was based on a blinded evaluation survey tool, as well as a limited amount of ‘expert’ non-blinded evaluation (our committee compared search results from different search products).

In both cases, we were trying to evaluate tool’s suitability for use with a basic search function, without using advanced search features or limits. Because the chief goal of our initiative was improve our offerings for such basic search patterns, you can read more about why in my previous write-up.

We included a side-by-side two column articles+catalog results display — also part of our initial planning, for reasons you can also see in that write-up above.  (Can you still call it ‘bento style’ when there are only two compartments?):

catalyst_bento

You can also click through to see just article results, still inside our consistent integrated interface, with the familiar UI also used for Catalog results.

catalyst_ebsco

I’m showing you screenshots, because the actual article search results require an authenticated login (or being on campus) to see — it’s an ordinary EBSCOHost license, we can only show results to authenticated users.

All of this is powered by the ruby bento_search gem, which serves as standardization layer on top of vendor search API’s.  Among other things, part of the point of bento_search is to allow us to switch search vendors easily, and while maintaining consistent user interfaces cross-vendor.

Assessment

After deploying this as an option at the beginnign of the 2013 Spring semester, we undertook a multi-pronged assessment in the middle of the semester.

We designated a several week period of assessment in March 2013 (covering a bit on both sides of midterm period on our largest campus.)

  • Looked at click analytics for the article search tool for this period (used Google Analytics on our web app, with some custom events).
  • Deployed a survey, linked to from the article search tool.
  • Solicited open-ended feedback from library research/reference staff.
  • Did some task-based participant observation and interviewing (ie, ‘usability’, more or less), where staff sat a couple dozen patrons down in front of article search, in-person, and watched them do what they did.

Overall, we decided the assessment confirmed that we were on track for article search improvement initiative — that there was demand for an easier to use and more visible tool for simple search use cases, and our integrated article search tool based on EBSCOHost was effectively filling this role. (I’m sorry I’m not free to share details, or even a more detailed summary, of our assessment findings).

So based on that, we decided to indeed keep travelling down the path we were on — including decommissioning our Metalib/Xerxes-based federated search tool at the end of the Spring 2013 semester, which we indeed have since done.  The Metalib-based federated search tool’s goals had huge overlap with the goals of our new search tool, and we don’t have the resources to support both — you can’t just keep adding new things without ever removing things, for our own ability to support things well, as well as for our users comprehension of our offerings.  (This was not without some controversy — any change is going to make some things worse, as it hopefully makes more things better, and any change is also likely to upset people used to the new system).

But, the mid-term assessment also gave us some suggestions for areas to investigate improving, as part of our iterative development process of this article search tool.

To some limited extent from user-feedback — and much more so from reference/research librarian feedback — appeared were two main themes of desired improvement:

  • People worried that the coverage of the EBSCOHost tool was not sufficient enough.
    • It covered a lot, with our extensive EBSCOHost licenses, including EBSCOHost versions of open access databases like ERIC or Medline, as well as EBSCOHost’s absorption of former Wilson broad-coverage databases. But of course it didn’t cover ‘everything’ — the worry was that maybe it didn’t cover enough, and would be leaving important things out of patrons searches, and that such an insufficiency may exist despite not being revealed by our previous analysis and assessment.
    • And of course, no tool will cover ‘everything’ — but was there an alternate tool available that might cover more?
  • There was demand for more advanced search tools — especially more limiting and facetting options, including limiting based on content type, online availability, and disciplinary perspective.  These sorts of limits/filters/facets were beyond what could be provided based on the EBSCOHost api.

Looking more in-depth at Summon

Based on these discovered desires, it made sense to do another iteration of investigation, looking more in-depth at a possible purchase of a ‘discovery’ product with API, to power our article search, that might give us these features.

Based on our existing evaluation and analysis, we specifically wanted to look more at Summon, it’s the product we thought was most promising, based on feature set and API quality,  from our first round of comparison and evaluation feature set.

We have trial/evaluation access to Summon (thanks Serials Solutions). So I stood up a demo of our Catalyst tool with integrated article search, with the article powered under the hood by Summon rather than EBSCOHost:

catalyst_summon

If you compare that to the earlier screen-shot of the EBSCOHost-based tool, you’ll see it looks exactly the same.  Same limits (peer-reviewed-only and date range), same layout, same UI.  Just a different set of results for the same query — the results returned by Summon instead of EBSCOHost.

Because bento_search, this is exactly what bento_search can do for you, provide an abstraction layer to normalize on top of different search results providers, and let you easily switch out one for another. The only chief code difference between our Summon-based demo and our EBSCOHost-based app, is basically a configuration file pointing it at summon, it only took a couple days to implement (including fixing some bugs in the Summon adapter I noticed only on this more extensive view).

I also took the bento_search demo app, and customized it to stand up a side-by-side comparison of EBSCO results vs Summon results for a given query, to help in comparing and evaluating.  That took literally only minutes, using the existing bento_search tools:

bento_compare

For purposes of this look, we did automatically force exclusion of certain content types from the Summon results — mainly ebooks, as we were focusing on an article-type search (but we did leave in book chapters, considering them more article-like).

Why just basic search?

If we were to actually purchase Summon, it would be in part for additional features it provides beyond “enter a query and get back results”, including advanced searching tools like facets and limits and subject/database recommender services.

But our initial demo/prototype/evaluation setup doesn’t include any of these. Why?

  • Trying to compare two tools that vary on many dimensions can be overwhelming for evaluators, and can also make it unclear on what basis evaluations were made — maybe someone prefers Tool 1 because it’s results are better, another prefers Tool 2 because it’s layout is prettier, and another prefers Tool 3 because it has a facet the others doesn’t.By controlling and fixing everything but the search results themselves — including keeping the layout identical with bento_search — we can know we’re comparing one thing at a time.  The original plan was to start here, and then add in other features in subsequent development iterations.
  • Basic search — enter a query and get results back — is an especially important use case. It was our primary motivating use case for this whole initiative.Facets and advanced search features may very well be important — but regardless of what advanced features the tool provides, it absolutely has to provide good results in the simple “enter a query look at results” case, so it made sense to start by comparing/evaluating under that case.(There is mixed evidence on how often facet limits and other more advanced search tools are used by users. Serials Solutions own analysis of analystics from their out-of-the-box Summon  found that facets were used commonly although not in the majority of searches — but this may be because many customers set up certain facets to be operative by default.  They found that fielded searches and boolean connectors were used very rarely, in only about 5% of searches.  There are other studies that exist of this sort of thing too, based on analytics, with roughly compatible results.)
  • Also, by switching the underlying results provider without changing any other UI elements, we could consider making the switch in production in the middle of the semester, or even without much notification or documentation to users, without causing any disruption.  Further serving our iterative development goals. It’s also possible it would still be too disruptive to make such a change like that, it’s just a potential, we have not gotten to the point of fully evaluating it, it’s just a possibility.

So what did things look like?

For this round of evaluation, we solicited feedback from expert (research/reference) librarians, on our committee as well as from our library system at large.

As noted, we were specifically focusing on evaluating basic search at this point; I think most of the evaluators executed searches they deemed typical or representative or illustrative for their subject areas, and then compared the results.

Surprisingly to us here, the consensus seemed to be that Summon-based results were not appreciably better than the EBSCOHost-based results.  Perhaps this shouldn’t have been surprising because the findings in our original ‘blinded’ experiment, remember, were that EBSCOHost was basically just as good.  But there was a lot of local concern about the lack of sufficient coverage in the EBSCOHost-based solution, and flaws in that solution that our original experiment may have failed to capture: We thought better coverage would give us better results.  But at least for the sorts of experimental queries our expert evaluators were doing (granted, not so many known-item searches, it looks like) — apparently, not so much.  Additionally, there seemed to be a general trend from our evaluators showing that the Summon-based results were in fact often graded as worse than the EBSCOHost-based ones.

Some general trends from evaluator’s evaluations: Summon-based results sometimes included a higher percentage of older non-recent articles on the first page than EBSCOHost, which they didn’t like. Also seemed to include more duplicate results than the EBSCOHost-based results.

It may be that the increased coverage of Summon is actually related to it’s decreased judgements of quality, not independent.  In some ways, the EBSCOHost corpus could be considered ‘artisanally curated’ , both in terms of the quality of metadata and the choices of contents.  Summon, by setting as a goal to include as much of the scholarly universe as possible (and we agree with that goal!) neccesarily includes less targetted contents, with less normalized/corrected metadata.

This still doesn’t necessarily determine our ultimate outcome — there are other reasons we were considering Summon than quality of result sets for basic search, or than coverage in general. Summon would give us advanced search tools we know some users some of the time want.  Summon’s license would allow showing search results to off-campus users without authenticating.  Summon’s superior API and normalized vocabularies in facets would allow more innovative features in our suite of applications — for instance, a feature that linked you directly to book reviews of a book you are looking at in the catalog; and limiting by disciplinary focus (a highly sought feature).  Summon’s direct-linking features might lead to a lower failure rate in click-throughs actually getting to fulltext, a noted problem in our Spring 2013 assessment. Ability to exclude newspaper articles (or search only newspaper articles), which Summon’s superior application of content controlled vocabulary would provide, would be handy to our users

But of course there’s only so much we would be willing to sacrifice in basic search use case to get these other things — the basic search use case is really important.

Sorry, I’m not at liberty to share exactly what we have or are in process of deciding in our continuing evaluating process, but I feel comfortable in sharing what I am sharing. I think it’s important for libraries to compare notes and findings on this stuff, so am trying to share what I think my local colleagues will be okay with sharing.

Reflections on the process

Okay, this starts to get rambling, and also very much is my own personal tentative reflections. If you’re not interested in such, just skip the entire rest of this post!

Overall, this ‘article search improvement’ initiative here has ended up much lengthier, and much more controversial, than most of us here originally anticipated.

Here’s a blog post from librarian Aaron Tay trying to survey current findings and opinions about ‘discovery’ from libraries — I found it somewhat comforting to see from Tay’s summary that the sorts of conflicts and challenges we’ve been having here are pretty typical (see for instance point #3 in his blog post).  (Thanks Christina Pikas for the pointer to this blog post).  Of course, typical or not, the challenges and conflicts still remain.

strategic goals?

I remain personally convinced that focusing on a simple, integrated article search is a good forward-looking strategic direction for our library and academic libraries in general.  This recent blog post by Brian Matthews in the Chronicle perhaps provides some more anecdotal evidence towards this conclusion too: even sophisticated searchers are increasingly demanding simpler search tools supporting simple keyword searching use cases, and Matthews believes this will only continue to increase. Read the followup too.  (Thanks to Robin Sinn for the pointer to these posts).  Adding to the existing evidence I believe we had (anecdotal and otherwise) towards this conclusion, but Matthews makes a case in terms of real stories of real people, which some people may find illustrative.

But there isn’t necessarily consensus on this locally, on what we should be focusing on, prioritizing, or working towards.  There are lots of people who care deeply about these things (we have a very committed and passionate staff) — but who have potentially radically different answers to these questions.   These are times of many options and possibilities for libraries, which in some ways is exciting, but is also very challenging, especially in organizations that may lack much coordinated strategy setting or decision-making to resolve differing analyses within the organization.  I had hoped that by using a research-based (starting with the initial presentation of the plan and it’s justification), assessment-based, and iterative approach, we would help build consensus and shared understanding on our directions and steps — but that has not necessarily been so successful.

always include current tools in evaluation

I (and our committee) severely under-estimated the local attachment to our previous/existing Metalib/Xerxes-based federated search tool, especially from librarians. In my recollections, when we rolled this tool out 4-5 years ago, there was huge librarian resistance to it, and generally consensus among librarians was that this was not a tool they would recomend to users. But either my memory was faulty, or the opinion of current librarians was not the same of the opinion of librarians (not all the same people) those 4-5 years ago!

We discovered mid-evaluation that there was a lot of attachment to the current tool.  We did not include the current tool in our first round of comparison evaluation (the blinded user study), because it was technically inconvenient to do so, and we thought that the current tool was so bad it must be worse than anything else, and no local staff liked the current tool anyway.

In retrospect, you should always include your current tool(s) in any comparison evaluations/assessments, even when there is significant development or other costs to do so.

assessment is hard in libraries

We were trying to be quite a bit more ‘data driven’ in this project than we have often, in my experience, historically been with software development here.

The trick here is getting data — whether quantitative or qualitiative — that is actually valid. Any assessment — just about any social science experiment — is always subject to questions and criticisms about “but does this really validly measure what we wanted it to measure?”  And I think the critical culture of libraries (I do not except myself) means we can spend a lot of time trying to pick all the holes in our assessments, to ultimately show that our own assessments were useless!

I think there was some skepticism from local stakeholders that our original “blinded user-facing survey” was actually legitimate. On the one hand,  skepticism that it accurately measured the collective evaluation of our users.  But also a more fundamental skepticism, from some stakeholders, that it’s even appropriate to ask our users to evaluate for themselves the suitability or quality of our tools — some thinking that our users are not qualified to determine for themselves what tools are better, that only experts can determine this.

There’s also quite a bit of worry that aiming for the collective/aggregate user opinion — as most attempts at user-based assessment, especially quantitative ones, will end up doing — can be aiming at the ‘lowest common denominator.’

But from the other end, it may also be problematic to assume that individualized expert evaluation is a suitable proxy for what will really serve our users well. Experts (whether librarians, or especially librariphile faculty) may themselves behave in different ways than most of our users most of the time do; and may sometimes be good at predicting and representing how other types of users behave or think, but may other times be less successful — the only way to know for sure is to try consulting representative samples of users, right?

I think there may also be a “devil you know” bias, where when asked to compare two things, especially when you are asking power/expert users of a current tool — there may be a very strong bias toward preferring the current tool.  You can try to get around this with ‘blinded’ or otherwise standardized/controlled trials, but there may often be a good deal of skepticism toward the validity of such approaches too.

I think we’ve still got to keep doing assessment and data- and assessment-driven decision-making, I remain convinced it’s the best way to ensure we’re meeting the needs of our users.

One way to deal with potential validity flaws in various techniques of assessment is to use multiple techniques.  We’ve got to keep doing user-based assessment and trying to improve the validity of our user-based assessment, because in the end, if we don’t satisfy our users on their own terms, we will not survive as organizations. We’ve got to keep including expert-evaluation assessment (ie, the reference librarians), because they do have vital expertise we can’t afford to ignore, and because they do are an important constituency of the tools we create — and because politically, in most of our organizations, we can’t move forward without them on board.

But there is a cost to doing so much assessment. It takes time. Time you could have spent on actually developing things instead of assessment. There is a thing as too much assessment, and my sense is that libraries, once they get over the hump from doing zero assessment, will rapidly move towards the other extreme of analysis-paralysis as we attempt to keep doing more and more assessment to compensate for possible flaws identified in previous assessment.

We did a lot of assessment in this project.  I think it has resulted in us making much better user-centered decisions than we would have made without it (just my own opinion and one that may be controversial, I’m not sure!). But it’s also taken quite a bit of time. And, while I expected that spending more time on assessment and assessment-focused decision-making would help us build consensus and feel good about our decisions, with data helping to resolve otherwise bitter arguments — I suspect it has not had that effect. There is still quite a bit of conflict and lack of consensus over the path we’ve followed.

iterative development is hard in libraries

I will suggest it’s currently nearly a unanimous opinion in the software engineering field that an iterative development approach is the best way to create high-quality software that best serves your user’s needs while using development time most efficiently. Seriously, go read that article linked in the last sentence if you’re not sure what this might mean, or why people might believe it.

Basically, as opposed to planning everything up front, committing to a plan, and then doing exactly what was planned (over months or even years of implementation), iterative development is planning and implementing in shorter, iterative bursts, paired with some kind of assessment. Do something, see how well it worked and what it tells you about what to do next to best meet stakeholder needs, then based on that decide what to do next, and repeat.  This kind of iterative development is at the core of the trendy ‘agile development,’ although it’s possibly a broader idea than ‘agile’, it’s probably sure that you can’t do ‘agile’ without iterarative development.

We used a more iterative development process in our article search improvement strategy here than we probably ever have before for a development project. With multiple rounds of evaluation, assessment, and implementation, attempting to make each round informed by what we learned from the last.

And it has been a huge challenge all the way; and the challenges have not really lessened as we gained more experience with iterative development as you might think they would.

I remain convinced that there’s no better way to produce high-quality software meeting the needs of users than an iterative development process.

But I think there may be some huge institutional barriers, structural and cultural, to using such a process  in many academic libraries.

This may be partially because of the nature of the academic calendar and it’s effect on our work. Iterative development is going to be hard if it doesn’t also involve regular iterative releases, but there is pressure from some stakeholders to only do releases once a year, in the summer ‘slow period’, and corresponding to updating instructional materials to match the new releases.

I think it may also be, partially, because, in most academic libraries, there are so many people that need to be consulted/involved in any decisions.  For any major (or even not so major) decision here, there are usually several committees that need to sign-off, as well as several different broad constituencies that realistically/politically need to be consulted in order for decisions to be made.  There are many different perspectives represneted in the library, important to solicit in order to make good decisions, with few people who can see the big picture and represent multiple perspectives.

But all of these people are also very busy. Any particular software development project (or even the aggregate of all software development projects a large academic library has in motion at any given time) is just one of many many things that are part of their job, and in some cases may not be seen as a particularly important one.

Many of these people, quite legitimately, don’t want to be constantly coming back to review and decide on next steps for a project in an iterative fashion, they don’t have time to for that — they perhaps would rather be given one big presentation at the beginning, give direction or advice right then at the beginning, and then see how the project turned out at the end, many months later, on schedule, done as described to them at the beginning, and having the effects predicted at the beginning too, naturally!

That is, basically the ‘waterfall’ planning method.

These were just some guesses about structural reasons that might make a library organization prefer a ‘waterfall’ approach, I’m not sure they are right, and there might be others. But the sense I got after a year+ of attempting an iterative development process in this initiative — and notwithstanding that we’ve been talking within our organization about ‘agile’ and ‘iterative’ for even longer — is that, for whatever possibly mysterious reasons, there are pressures and predilections in an academic library that really resist an iterative development approach.

Of course,  even if ‘waterfall’ fits into people’s schedules or work style preferences better, or for whatever other reasons is institutionally preferred — that still doesn’t mean it’s a good method for producing software that actually meets stakeholder needs. I am inclined to think that libraries are far from immune from all the reasons that ‘waterfall’ typically leads to disaster anywhere else.

(I am also much more sympathetic to academic libraries that choose to never develop in-house, but always purchase proprietary vendor software and use it as it comes out of the box with no local development. I don’t think the always-buy-never-build approach, as a general rule, leads to better outcomes for our users — but it may lead to happier staff, and does guarantee that internal departments never have to take the blame for any failures (actual or perceived), that blame can always be laid at the feet of a vendor.)

in conclusion

I have no idea what the conclusion is!

About these ads
This entry was posted in General. Bookmark the permalink.

5 Responses to Continued adventures in integrated article search

  1. Thanks Jonathan for another interesting post – I especially like the ‘iterative development in libraries is hard’ section. Because it is, for exactly the reasons you say and in addition because sometimes people are making decisions about things they don’t really understand beyond the surface meaning.

    One thing that made me uneasy about your testing was the use of experts when you are developing for non-experts.

    I think your side by side comparison of the ebscohost and summon API’s is most revealing in the response generation times – something SerSol will have to keep working on. Were you using Summon 2.0 or ‘classic’?

  2. jrochkind says:

    Thanks Alan.

    I wouldn’t take anything from the one single set of timing values from one single search captured in a screenshot. When we did our original article search comparison, reported in the Code4Lib Journal article, and captured response times over several hundred queries on each service — Summon was actually by far the _fastest_ of the services, and EBSCOHost was one of the slowest.

    It does seem to me anecdotally that Summon has gotten a lot slower lately, but I wouldn’t draw any conclusions without doing anything collection of a good sample size, I wouldn’t assume it’s something they need to work on from one single screenshot, which could be an outlier.

  3. wrt expert evaluation – that’s actually a pretty typical method for usabilty/user research… to have an expert do a heuristic review. also pay attention to what we’re experts in. we’re not necessarily domain experts (in fact most of us aren’t), we’re expert librarians.

    thanks for this post Jonathan. I still want to blog something about this, too, perhaps reiterating some of what I told the committee. Honestly, we are always telling users how valuable human indexed a&i are (or really machine-aided indexed..), but it seems like we don’t believe it ourselves!

  4. Summon has been much slower lately, possibly due to the supporting of both 1.0 and 2.0 versions, may be affecting the API as well?

    “It may be that the increased coverage of Summon is actually related to it’s decreased judgements of quality, not independent. ”

    Yes, I’ve noticed this as well, even in the case of a subject Proquest database that is totally included in Summon can sometimes outperform Summon by giving more relevant results. I

    I think, In theory, “difficult” cases where there are only a small number of relevant articles, would allow the index with the much larger index to shine (to the limit we reach a case of a known article title search with only one relevant result), so there may be a trade-off here.

  5. jrochkind says:

    We no longer have trial access to Summon, but in the last month we did, it definitely seemed to me too that Summon had gotten a lot slower from the days that I benchmarked it as the fastest among it’s peers.

    Interesting to see confirmation from you on the ‘increased coverage related to decreased relevance’, Aaron. I definitely agree that there’s a tradeoff, and increased coverage is definitely going to be good for some things — but it’s difficult to come up with research that will actually show which is better overall for our actual users, and then even more difficult to convince local colleague reference librarians that the research actually did show that.

    But it’s another variation on the old precision vs. recall tradeoff. Sometimes you have to decide if you’d rather prioritize precision or prioritize recall. It’s pretty important for non-technical decision-making librarians to understand the theory of precision vs. recall. I have had limited success in getting this across.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s