facetting subjects accross hetereogenous vocabularies?

I havent’ actually read it yet, but just the abstract alone of this Dlib article makes me think of a reoccurent problem I think about. If showing the user all the subjects that matched their query along with hits is useful (we often describe this as ‘facetted’ display, which I think is actually a misnomer), that might work well when you only have LCSH, but what the heck do you do when you have a corpus involving disparate controlled vocabularies?

Just listing all the controlled terms raw can easily give users misleading ideas in several ways, or just be plain confusing.

And what if some items in the corpus don’t have controlled subject/genre vocab at all?

So on reading that abstract I think, hmm, assuming LCSH is still the most common controlled vocab in your corpus could you use automated clustering algorithms to map other items to LCSH, to actually provide a meaningful list of subjects across your corpus?


2 thoughts on “facetting subjects accross hetereogenous vocabularies?

  1. So the new Smithsonian thing deals with multiple controlled vocabularies pretty elegantly – it lists in one place biological terms (genus-species, etc), in another grouping art terms, then also standard bibliographic things.

    Why LCSH? Why not use other sets of controlled vocabularlies that are meaningful to the users?

  2. Only because we alread have LCSH assigned (and continue to assign them) for the majority of our records.

    I am assuming that human-assigned vocabulary is always going to be more accurate than machine-computed vocabulary—machine computed vocabulary just may very well be better than nothing.

    There are indeed some serious problems with LCSH, which would take another lengthy essay to go into. There are also some significant valuable things about it too.

    The Smithsonian apparently has the benefit of having different (human assigned) vocabularies that apply to different _sorts_ of things. Biological taxonomy vs. ‘art terms’ (presumably from the AAT), vs ‘everything else’. Or at least they’ve decided to deal with multiple vocabularies by making it look thus. We often have to deal with multiple vocabulares that have grealty overlapping domains. They ALL have subject and form/genre terms.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s