A method to map from query to broad topic, and associated resources

Short answer: Take advantage of a facetted response on a search against a corpus that has controlled classification data.

From user query to topic, to resources on that topic

Andrew Nagy tells me in direct email that one of the new features in upcoming Summon 2.0 release is:

Topic Explorer – Over 50,000 english topics will be mapped to user queries on the fly and the API will deliver a “topic” that has an encyclopedia entry, recommended librarian, recommended subject guide, related topics, etc.

(This is described on the Summon 2.0 brochure webpage, although it wasn’t completely clear to me from the webpage that it took user queries as input to arrive at a topic).

This is along the lines of a feature that I’ve been thinking about for years — the ability to recommend appropriate subject resources (subject specializt librarians, subject guides, databases recommended on a particular subject) in response to a user-entered query in a catalog, articles, or other discovery search.

Have been thinking about it for years, as a way to get users to our librarian-recommended resources, but it’s become even more desired by some local librarians in response our recent move towards offering an integrated article search function, currently based on the EBSCOHost API,  in our local discovery UI as an alternative to directly going to individual licensed database platforms, and as a replacement for Metalib broadcast federated search.

So I think Summon is right on track here in their new feature development, which is nice to see, and less usual than it should be in the library proprietary software sector.  This is in some sense an expansion of the existing Summon feature to recommend subject-relevant licensed database platforms based on user-entered queries, expanding it to additional topic-specific resources.

While they say 50,000 topics, I assume there must be some hieararchy to their topic list, with things like institutionally specific librarians and subject pages assigned only to top-level broad topics — it would not be feasible to manually make specialist librarian assignments to 50,000 topics, of course.

So it’s really mapping to fairly broad high-level topics that matters for locally-assigned subject resources like librarians, subject pages, or subject-specific licensed database platforms.  (I’m guessing the SerSol feature may automatically map things like encyclopedia entries at the narrower, more specific elements of the 50k list, which might be neat, but is not what I’m choosing to focus on in this discussion).

The hard part about implementing a feature like this is mapping from arbitrary user query to topic (broad or otherwise).  Once you’ve done that, it’s of course an easy software problem to record URLs or other content that corresponds with each broad topic, and provide them to the user once a topic has been identified.

If you have Summon, and like it’s new “topic explorer” feature, great. We won’t know exactly how it’s implemented, but those with Summon licenses will be able to test it and see how effective we find it, once it’s released.

But what additional options might you have for implementing such a feature yourself, for institutions who do their own development in some cases?

Spider the web, use text mining techniques? — NCSU

Way back in 2007, Tito Sierra then at NCSU presented at the Code4Lib conference on an NCSU project called Smart Subjects. 

As you can see in the slide show there, Smart Subjects was also an attempt to map from a user-entered search query to one or more library subjects.

It did so (and possibly still does so) in a creative way. From existing bodies of text that can be easily classified by department (Course catalogs, departmental lists of published articles), harvest all that text (classified by department), and then index in a text indexing engine, that allows information retrieval relevance ranking techniques to take arbitrary phrases (user-entered queries) and see which academic department’s harvested text corpus has the best match to the query.

believe some years after this, Tito told me he wasn’t, in the end,  neccesarily super enthused with the quality of results attained by this method. In any event, it is a fairly heavy-weight method, with lots of moving parts to develop and maintain and fine tune.

Tito has since left NCSU, but I believe it’s what is still powering the subject recommendations at the bottom left of their “QuickSearch” results, although it’s unclear if the corpus ever gets updated. I don’t know if they ever thought to use it to power recommendations of actual library staff too, although there is a library staff member highlighted on the QuickSearch results page.

(Looks like the NCSU SmartSearch tool began in 2005, 8 years ago!)

Using your catalog corpus, with classification data, as a classifier?

I’ve been thinking for a while of another approach, lighter weight and taking advantage of the extensive person-hours of work that goes into our cataloging metadata. Although I haven’t had a chance to prototype it yet, I’m going to tell you about it anyway.

We have these extensive library catalogs. What if each record in the catalog had broad subjects assigned to it? (They don’t, really, but bear with me, let’s start here).

And let’s say we exposed these broad subjects in a facet. Then for a given query, you’d get a count of how many items in your result set (matching that query) were posted to each broad subject.

Say you search for “project management techniques”, and get back, in the facet based off these broad subjects:

  • Engineering (56801)
  • Business  (47920)
  • Computer Science (34000)
  • Health Science (24000)

That would potentially be a pretty good list of recommended subjects corresponding to the query entered, no?  Then, if you have subject pages, database lists, specialist librarians, etc., already categorized into these same subjects, you could recommend them to the user based on her query.

Now, our library catalogs do have classification data in them assigned to individual records, using vocabularies created and assigned through the hard work of many catalogers over many years. Is there a way to use this data for this purpose?

The common classification systems of Dewey and LCC both classify rather too finely for this use — we need to map to a vocabulary of dozens of topics/subjects/disciplines, so we can assign local resources to each one. Hundreds or thousands is too many.

But is there hieararchy in DDC or LCC that would let you “post up” from finer-grained specific classifications, to more broad classifications useful for our purpose here? DDC might have, but I don’t have many DDC records in my local corpus, and haven’t spent much time with DDC. LCC is known to be less hieararchical than DDC, but there are still ways to get some broad classifications out of it, using the top-level schedules. But it’s tricky to make this work, and the broad categories you end up with aren’t neccearily as useful as we’d like.  (See the “Discipline” facet in our own Solr-based catalog, which is constructed from LCC “posted up” into broad classification. For “project management techniques”, the top Discipline facets are “Technology”, “Science”, and “Social Science”, which aren’t neccesarily wrong, but also aren’t as useful as we might like, they are too broad and somewhat archaic.)

The University of Michigan High Level Browse classification

The University of  Michigan has developed their own High-Level Browse (HLB) classification.  One of the main uses for this classification is indeed a broad classification facet in their catalog search.

The U of M HLB is conveniently based on LCC, and U of M maintains mappings from LCC call numbers to their own HLB classes. Which is what makes the facetting work in the first place, for any corpus with LCC classifications on items.

They’ve developed their HLB based on their own schools, departments, and programs at U of M.  You could try to develop the same locally. But it’d be a lot of work. And U of M awesomely shares their classification, with LCC mappings, in XML form too. So you could just write software to download theirs, use it in indexing into your own Solr catalog index, and get U of M HLB facets in your catalog too — and use them to power a subject recommender too.

Any large research university probably has academic classification needs roughly similar to U of M’s, although there will certainly be special programs you wish were represented that aren’t (or that are in U of M’s, unneccesarily for you), but it will likely be good enough, if you don’t have the resources/organization to develop and maintain your own local classification. (I’m amazed U of M even pulls it off, honestly.)

Let’s give it a try, go to U of M’s catalog and do a search, and check out the top categories represented in the “Academic Discipline” facet in the sidebar.  For “project management techniques”, it’s Business, Management, Business (General), Social Sciences, and Engineering.

If a system made recommendations for subject guides, specialist librarians, subject-relevant databases, and other subject resources, based on those classifications… they’d be fairly relevant to the query, right?

Do some of your own queries, how well does it work?

(In the U of M HLB, there are still a few levels of hieararchy, all of which may be represneted in the facetted result. For instance, “Business (General)” is a sub-category of the more general “Business”. Inter-mixing them both is probably appropriate for facet response, but for making subject recommendations some experimentation is called for as to when to use more-specific and when to use more-general, and when to de-dupliate when  a super- and sub-class are both represented in the potential ‘best subjects’)

Not just for catalog searches

The idea is to use the catalog as a classifier, but that doesn’t mean you can only use it for catalog searches.

For any search in any system you control enough to add custom features to — you could add a feature based on the catalog as a classifier. Even if they are searching in a non-catalog article discovery system — the software could still, behind the scenes, take the user’s query, execute it’s own under-the-hood query against the catalog, look at the facetted broad subject results, and use them to make subject recommendations.

Not necessarily just with your own catalog

Likewise, there’s no reason you need to use your own local catalog as the classifier. Any catalog will do — if it can provide a facetted response of broad subject classification, has an API such that you can use it in this way, and the operators don’t mind you using their catalog in your service.

WorldCat would be great, if OCLC added broad subject classification facet, and an API to retrieve such.  Umich’s catalog, already using their own in-house HLB classificaiton, might be convenient too.

Of course, if you do add umich’s HLB broad subjects as a facet in your own local catalog, your users get the advantage of using that facet directly for their catalog searches too.  (Assuming you have enough control of your local catalog to add such a thing, for instance becuase you’re catalog is based on Blacklight, VuFind, or another tool using a local Solr your control).

Idea worth exploring?

I’m not sure when/if I’ll have time to investigate this idea, although I probably will eventually. But I absolutely don’t mind if someone else runs with it and beats me to it — as long as you share back your findings, how well it worked, etc.

This entry was posted in General. Bookmark the permalink.

8 Responses to A method to map from query to broad topic, and associated resources

  1. I’d like to add a little to this. U of M has been classifying search queries within the library website exactly as you suggest using our catalog data for I think about 3-4 years. We use the classification to suggest our site’s browse pages and to link to subject specialists who have self-identified with HLB categories. Example:

    http://www.lib.umich.edu/mlibrary/search/libguides%3Bwebsite%3Bdigitalcollections%3Bsearchtools%3Bmirlyn%3Bejournals/buddhism

    Unless I missed it, one piece we added beyond what you discussed was a threshold for a minimum number of items in the facet.

  2. jrochkind says:

    Thanks Albert! Interesting!

    How happy have you been with it?

    Yes, I didn’t mention minimum number of items in the facets, although I considered it. I suspect there may be some other tweaks to the logic of how you pick the actual subjects for recommendation from the facet response list.

    Do you guys do something with de-duplicating super- and sub-classes from the facets, like if #1 was Engineering (top-level), and #3 was “Electrical Engineering”, would you recommend both, or only the most specific one, or only the first more general one?

    If I ever get to this, I might contact you to ask for more details from your experience in general. (I’d also encourage you to consider writing it up as a possibly short Code4Lib Journal article!)

  3. jrochkind says:

    Albert, hmm, your tool doesn’t seem to work quite like I expected. You’re using the method discussed here for the “research help” section (librarians), but maybe NOT for the “Research Guides” section or “Databases” section?

    The “Research Guides” section is actually where I imagine it being MOST useful — but on http://www.lib.umich.edu, most of the searches I do give me zero hits in “Research Guides”.

    Uou ay you’re also using it to “suggest our site’s browse pages”, but I’m not sure what section of the results that is, if any?

  4. We don’t de-dupe sub-/super- classes when identifying HLB categories for the website search results. I think that in the past we had plans to sort the list of matching categories so that second and third tiers would be presented before the first tier, but we aren’t currently doing that. Having said that, our health sciences subject specialists have only selected the more specific second and third tiers so you they don’t have this problem.

    You or any of your readers should feel free to contact me if you’d like more details about what we’re doing. I’ll definitely consider this for a Code4Lib Journal article as well.

  5. Correct; we do not use query classification for the Research Guides or Database sections of our search results. I’m not surprised you didn’t see the recommendation for our browse results, we moved them way at the bottom in the “Didn’t find it?” section. And it looks like we need to de-dupe the subjects in that section.

    Regarding the use of the category information in the Guides and Database sections, I expect we would improve our search results by using the categories we identify for query expansion; I’m adjusting how we search databases and journals in the upcoming days. I’ll be adding that into the tests I run.

  6. jrochkind says:

    Interesting stuff Albert, at some point I’ll prob contact you for a conversation if you have time! (Might not be for months though, when I have time, heh).

    Rather than ‘query expansion’, I was originally thinking of using the classification for Research Guides and Database recommendations by assigning Guides and Databases directly to classes. query -> classes -> Guides + Databases assigned to those classes. I think that might get much better results than just matching literal query terms to terms found in Guide or Database metadata. Unless someone is looking up a guide or database by name of course, your current implementation works great there. But in current implementation, I get 0 Guide or Database results for most subject/topic queries I enter.

    Matching guides and databases like this was in fact my own original imagined use case for the technique we’re discussing.

  7. dsalo says:

    UW-Madison’s Forward did some experimenting with this, later dropped when the tool went System-wide. Steve Meyer (now at OCLC) might have some useful tidbits for you.

  8. Cory Lown says:

    Jonathan, you’re correct that the subject recommendations on the lower left module in QuickSearch are powered by Smart Subjects. The data gets updated yearly, although, frankly, I think the data gets stale only very slowly. We’re using Solr to index this data now and the course catalog data and the faculty publications data are harvested via script, so it’s not labor intensive. It would be feasible to use this technique to power subject librarian recommendations, too.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s