Purposes/Functions of Controlled Vocabulary

Quite a while ago on December 14th, Jim Weinheimer wrote something on NGC4Lib that got me thinking about a paper I wrote back in school. Jim wrote:

I often see this sort of finding, but it always leaves me a bit confused. When they discuss “controlled vocabulary” and “clustering” I wonder what they mean. There are two primary purposes and functions of subjects and you can see them especially clearly in LC subject headings: there is the collation function, which brings together metadata records for resources with similar subjects, and then there is the “labelling” function, which provides an authorized term that describes the items that have been collated. The purpose of the labelling function is to make the items collated together findable by humans.

I actually wrote a paper in school on the purposes/functions of controlled vocabulary understood broadly (which in the paper for some reason I call ‘systems of knowledge organization’ rather than ‘controlled vocabulary’), in which I, via the literature and my own analysis, identify quite a bit more than two purposes or functions controlled vocabulary.

I’m not sure if Jim is aying that ‘clustering’ doesn’t mean anything, is really something subsumed by one of his Big Two, or is just an illegitimate purpose you should never use a controlled vocabulary for? (Or just never use a subject vocabulary for? Or never use LCSH for?).

Regardless, I think there are clearly more than two, although figuring out exactly the right ‘taxonomy’ for purposes of controlled vocabulary isn’t neccesarily straightforward.  But I think there are more than two things controlled vocabulary can be useful for, more than two that it is useful for, and indeed more than two that the historical literature (including from Cutter and Dewey) suggested our legacy systems may have been designed to be useful for!

In my paper I identified 10, although they have some overlap. In retrospect, with more experience working with controlled vocabularies — and just as important, systems that try to make them serve various purposes for the user — I think I could collapse some of them.

Here’s the school paper itself, if you really want to read the whole thing:

Towards a Conceptual Model of Knowledge Organization Systems: 1. The Functions of Knowledge Organization Systems
by Jonathan Rochkind
12 December 2005

Here’s my summary of the functions I identified, with some retroactive re-thinking.

Functions of Controlled Vocabularies

1. Class Retrieval

Just plain identify a particular class or term, and then look up all documents with that term attached.  What most librarians probably think of first, and perhaps the same as what Jim calls the ‘collation’ function.  (All of these functions have been referred to by different names by different people; my paper identifies some of them in the literature, including writings by Dewey, Cutter, and the CRG).

2. Browsing

It’s unclear exactly what “browsing” means, but lots of people talk about it. It is some sort of exploratory or investigatory, probably iterative, interaction with a corpus, to be contrasted with the more specifically directed aims of Class Retrieval.

If you were just to page through the LCSH Red Books looking for interesting topics, that might be a form of “browsing”.   If you were to browse a physical shelf ordered by DDC or LCC, that’s another kind of browsing. Note that both of these activities rely on particular features of the controlled vocabulary to make browsing possible: For LCSH, human readable terms applied to classes that, when filed alphabetically, put similar subjects near each other. For Dewey, numeric class numbers that, when filed in order, do similar.

You don’t need to have anything like that purely for ‘class retrieval’, is why it’s worth mentioning browsing as a seperate function. But just as importantly, there are probably other ways to support browsing (exploratory investigation of a corpus), especially in the digital environment, that may not rely on display headings or displayable notation that can be filed in order! Can you think of any?

3. Relationship Navigation.

Some controlled vocabularies let you navigate relationships between classes and terms. Others do not. Perhaps this could be thought of as a subset of Browsing, not sure, but it’s a particular type of Browsing if it is that.

While our previous examples of a shelf browse of a Dewey shelf, or a page through the Red Books may expose you to some hieararchy — because both systems intend to put sub-hieararchical classes “After” their “parent” in filing order — in fact, especially in the computer environment, it’s possible to do much more explicit relationship navigation, and do it accross more than one hieararchy or type of relationship.

Identify a class or term, maybe look at the documents attached to it, then realize that you really want documents on either a more general or more specific topic. Relationship navigation lets you find those documents. (Although purely post-coordinated combinations of terms may also allow you to do that, depending on how the system has been designed and applied).

4. Identification

The Identification function is served by listing assigned class or term information on a record so that the user knows more about the nature of the document indicated.

Consider traditional LCSH “subject tracings”.   The fact that the book is about Subject A may not be revealed by it’s author or title, but there you have a subject tracing to tell you that.

Not sure if this is what Jim means by “labelling” or he means something different.

5. Locating

Basically, just using a controlled vocabulary assignment as a ‘locater’ so you know where to find the document. Basically, that is, traditional shelf location done by classification such as Dewey or LCC. (Although as some have recently noted in NGC4Lib, that’s not originally what Dewey invented DDC for!).

When students come up to the reference desk having written down only a Dewey or LCC shelfmark, they are also assuming it can be used for the ‘locating’ function, to refer to and find a unique item. Although in our library it doesn’t work so well for that, we often wish they had written down the author and title instead!

This isn’t a particularly interesting function, really, but it’s one we use, that some but not all controlled vocabularies are suitable for. (Basically, those that can be used to assign a unique notational string to an item, those we usually call ‘classifications’).

6. Ordering.

Some controlled vocabularies allow you to place documents in some meaningful order (that you couldn’t do without the controlled vocabularies), others don’t.

Traditionally, this is used for shelf location, DDC and LCC. It could also possibly be used in an online display.

This isn’t really that interesting.

7. Surveying

To support the user in getting a general overview of the corpus. While similar to ‘browsing’, walking up and down the shelves isn’t going to be very good for getting a general overview, unless you have a very small corpus!  But looking at Dewey classes arranged in hieararchical fashion, with each class having it’s human readable display label assigned to it, and each class also having the number of documents posted to it listed — ah, now there’s a survey of the landscape!

Again, some controlled vocabularies can be used better for this function than others. Having a hieararchy is probably helpful, unless the vocabulary only has a small number of terms.  Trying to use LCSH for a survey of a particularly large corpus would be tricky, since the hieararchy is so odd, but perhaps it could be done.  NG facetted navigation systems often try to use either dewey or LCC for this though.

8. Dealing with a Large Result Set

What a lame name for a purpose/function, eh?  By this, I really just mean the same thing as ‘surveying’, but applied to a result set, not to the entire corpus. Give you a summarized overview of what’s there when you type in “politics” and get 400,000 results. Very similar to Surveying, really.

9. Keyword Match Enhancement

Even if you have full text, the term you are searching on may not appear in the fulltext of an item that is very much about that concept!   The item may be in another language, or may not have written words at all. Or, to use another example recently raised on NGC4Lib, World War I wasn’t called World War I until there was a II, but a 1930 book can still be about World War I.

If you take the LCSH subject headings, and make sure they are in your textual indexing system too, you can increase the recall of the search.

If you first make them explicitly identify a class, and then click on it — then that’s #1, Class Retrieval. But if they just enter a search in a Google-style search box, and magically get their 1930 document on WWI because it had a subject assigned to it, and that subject included the heading or lead-in term “WWI”, then that’s Keyword Match Enhancement.

The more lead-in terms a vocabulary has (that you can also index for keyword match enhancement), the better it probably is for this. The better the headings or lead-in terms match the end-users query vocabulary, the better too. Which might be useful for other functions but is obviously crucial for this one.

10. Negotiation

Interacting with a controlled vocabulary in various ways can help the user come to a better understanding of what they’re actually looking for in the first place, by showing them what people call different things, and by showing them how different concepts can be related to one another.   The right kind of interaction with the right controlled vocabulary can sometimes do what a good reference librarian does (probably not as well, but open 24 hours).

This might sound odd, but I included this function because it was in fact mentioned in the literature, from Cutter to Vickery to Svenonius.

So what?

So there you go.  I’m not sure those 10 are the right 10, they can probably be blended up a bit to get a better taxonomy. But I’m pretty sure there are more than just 2!

Part of what I talk about in the larger paper is that the online environment has made us try to get more functions out of controlled vocabulary, functions that certain vocabularies may not have been designed for — or that they may not have been used for during the previous 100 years, where the use of a particular vocabulary may in fact not be the same as what it was designed for! Different features of a controlled vocabulary (both in design and in application by an indexer/cataloger) can facilitate different functions.

For 100 years we didn’t need to think much about that, settling into “alphabetic subject vocabulary” used basically only for class retrieval with a card catalog, and “classification” used basically only for browsing via an order that also served as a locator. Interestingly, that’s not what “classification” was neccesarily designed for at all!   Back in the day, there was more confusion and less consensus about what these things were for, which my paper goes into a bit.

With the computer, we can try to do more than we could when the only tools we had for interacting with controlled vocabularies were card catalogs, printed catalogs, and actual shelves. (Cause interacting with shelves arranged by LCC or DDC is interacting with a controlled vocabulary!).  This is bringing us back to some of the chaos of figuring out “what are these things for anyway”, and “how do we design them to serve these functions well?”

This entry was posted in General. Bookmark the permalink.

3 Responses to Purposes/Functions of Controlled Vocabulary

  1. I haven’t read your entire post yet (I definitely disagree with some of your points, but that shouldn’t be surprising coming from me!), but I would just like to answer:
    “I’m not sure if Jim is aying that ‘clustering’ doesn’t mean anything, is really something subsumed by one of his Big Two, or is just an illegitimate purpose you should never use a controlled vocabulary for? (Or just never use a subject vocabulary for? Or never use LCSH for?).”

    There are basically three functions here: 1) a function that collates (clusters) related concepts, 2) a function that labels the collation points, and 3) a function that allow people to find the labels. They are all closely interconnected. In my view, the real advances are with the last two functions.

    The collating/clustering function is not only important, I believe it to be absolutely vital. People should be able to search for the concept of e.g. “Boris Yeltsin” and retrieve something more reliable than simply what comes up in an uncontrolled full-text search, no matter how clever the retrieval process may be. Humans are still needed to do this. The task of collating is not at all easy and demands a lot of experience.

    Next, just as a concept may not be expressed in the same words in the resources you work with, when people are searching for information, not everyone will *think* of the same concept using the same words: e.g. cooking/cookery, insane/mentally ill/Mentally disordered/Mental patients/Mental illness–Patients. Yet, it is still important for the collating function to produce a reliable result, so there needs to be some mechanism for people to find the points of collation we have so carefully made. Iin the past, this has been solved by choosing an authorized form, with see references from less favored ones. With modern tools, the single authorized form is no longer needed, since the display can vary with the language of the user, or the users can decide upon their own.

    So, we have the collation (clustering) points, the human-understandable labels attached to those points. Now comes the part that I did not deal with before and that I think you are discussing here. With tens of thousands or millions of these labels, how do we distinguish them? This has been the task of subject classification and alphabetical order, shown essentially by the arrangements of the cards in the card catalog, in the arrangement of the LCSH, and in the arrangement of books on the shelves.

    More than anything else, this is what I believe is breaking down now with the rise of full-text searching, “relevance” ranking, ranking by most popular, etc.

    Throughout all of these considerations, the one point that remains as solid as ever is the need for the collation/clustering points. (For this discussion, whether collating is done by physically bringing cards together, by using the same text in a computer catalog, or by using URIs, is not all that important because the final result is the same) Concept clustering through automated means is nice and even useful, but it is not so reliable that the rest can be eliminated. Not yet.

    Sorry about going on, but I wanted to emphasize the different functions of all of this, plus I want to point out that I believe we may find that clustering by concepts will be the most important part of a catalog, more than description or maybe even locating the items themselves

  2. Jens-Erik says:

    Jonathan: It is (still!) a fantastic paper!

  3. jrochkind says:

    Thanks Jens-Erik, it was a fantastic class as well! Nice to hear from you, hope all is well.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s