The Purposes of ‘Subject’ Vocabularies

LCSH, LCC, DDC, Ulhrich’s subject headings, BISAC, Ranganathan’s Colon Classification, Bliss Classification (2), Amazon’s subject headings: All are examples of ‘subject’ controlled vocabulary.

I put ‘subject’ in quotes because in reality most, if not all, of these examples include terms to capture ‘aboutness’ as well as terms to capture discipline (ie, perspective), and genre (and in some cases form, format, and intended audience). (Yes, Dewey sometimes captures ‘aboutness’ and LCSH sometimes captures disciplinary perspective. Take a look.)

I have been interested for a while in exploring the purposes of these types of vocabularly. I think they are not as clear and simple as we might be used to assuming. I wrote a (too) long paper about it in library school, which I’ll attach here. I actually wrote this before I had seen NCSU’s Endeca implementation; I’d have written it differently after; but I think this discussion is very relevant to understanding effective use of controlled vocabularies in facetted navigation. Recent discussion on NGC4Lib regarding these types of vocabularies further emphasizes, to me, the importance of considering the functions.

In my paper, I argue that in looking at these vocabularies from the perspective of functions or purpose, the traditional line between ‘classification’ and ‘subject vocabulary’ isn’t actually that clear, but instead we have a number of purposes (not just two) which a given vocabularly may serve better or worse.

The paper is awfully long, so I’ll also now summarize my suggestion as to an initial draft taxonomy of functions. (These functions admittedly overlap in some ways, but I still think ) (The next step, to determine what features of a vocabularly fit what functions or purposes–is only touched upon in the paper). Continue reading “The Purposes of ‘Subject’ Vocabularies”


Google’s algorithms

Very interesting article in today’s NYT Business section (Annoyingly, doesn’t let me put a COinS in my blog post! Argh! Sorry. June 3, 2007. New York Times. “Google Keeps Tweaking Its Search Engine” by Saul Hansell) about Google’s relevancy ranking algorithms.

This article has a sub-text (well, not too sub) about how insanely awesome Google is, how much further ahead than anyone else they are. No doubt getting press like that is part of the reason Google gave the reporter access to this department which is usually instead cloaked in trade-secrecy.

Still, that’s definitley part of the story. It’s important to remember/realize taht Google’s relevancy ranking algorithms are very sophisticated and complex, and getting constantly more so, in order to give us the simplicity of the good results we see. Our simplistic conception of ‘page rank’ is just one increasingly small part of the whole set of algorithms. So, no, we can’t “just copy what Google does” (not least, but not only, because we are dealing with a different data domain than Google).

The solution to what we need isn’t just waiting out there in the open for us to copy. The solution(s) are waiting for us to discover and invent. On the other hand, of course we want to pay attention to what we can learn from Google and what Google does (in broad principles and–where we can figure them out–specific details) in figuring it out.

Some choice quotes: Continue reading “Google’s algorithms”

CCQ 43:3/4 on Semantic Web

I’m finally getting around to looking at the Cataloging and Classification Quarterly vol. 43 iss. 3/4. It’s a special issue on semantic web technologies for libraries. I think it could be really good background reading for the discussions some of us are trying to have.

It looks like it’s got some great stuff in it! I recommend everyone take a look. I am particularly excited to read “Library Cards for the 21st Century” by Charles McCathieNevile and Eva Méndez:

“This paper presents several reflections on the traditional card catalogues and RDF (Resourc Description Framework), which is “the” standard for creating the Semantic Web… The central theme of the discussion is resource description as a discipline that could be based on RDF. RDF is explained as a very simple grammar, using metadata and ontologies for semantic search and access. RDF has the ability to enhance 21st century libraries and support metadata interoperability in digital libraries, while maintaining the expressive power that was available to librarians when catalogues were physical artefacts.”

Haven’t read it yet, looking forward to. (I still say that almost all libraries in 2007 are ‘digital libraries’)

My library has online access to CCQ via Haworth Press.

“Broken”, huh?

Irvin Flack asks:

Jonathan, you say “our current metadata environment is seriously and fundamentally broken in several ways”. What are the ways in which it is broken? I would say the cataloguing community have just been overtaken by a tsunami of change in the last ten years (mainly the shift to digital information) and is still working out how best to respond and adapt.

I suspected someone would ask that of me after the last post. A definitive argument/explanation for why/what is broken in our current environment has yet to be written, and is not an easy thing to do. All I can do is provide a sketch of some notes toward that thesis, which I’ll try to do here.

Continue reading ““Broken”, huh?”

My Cataloging/Metadata Credo

I think our current metadata environment is seriously and fundamentally broken in several ways.

I do NOT think the solution lies in getting rid of everything we’ve got, or in nothing but machine-analysis of full text. I think the solution requires continual engagement by metadata professionals, which will be continually needed. We will always need catalogers—that is, metadata professionals involved in the generation and maintenance of metadata. Because that’s what catalogers are and have always been. Continue reading “My Cataloging/Metadata Credo”

‘Access Points’ as Identifiers

An essay I originally posted to rda-l on 14 February 2007, and put here now mainly to have a persistent URL to easily access it. I made a few minor edits for clarity while I was at it. (So this is perhaps a new Expression of my essay, if you’re keeping track).

“Access points” as “Textual identifiers” ?

Continue reading “‘Access Points’ as Identifiers”

ruby trick question

Okay, back to nuts and bolts programming.

Can anyone explain exactly what’s going on when ruby does, like “20.minutes.ago”. I mean, #minutes must be a method on numeric values, right? So why can’t I find it included in the rdoc for Integer or Numeric? And then #minutes  returns some kind of object that has an #ago method. So, um, what kind? I don’t get it. I like to understand what’s going on.

Future of Bibliographic Control

I’m finding that the LC hearings of the Working Group on Bibliographic Control are producing some very valuable discussion. I hope that the report the working group ends up producing will be equally valuable–and I hope that somehow, it can actually effect our discourse and practice, instead of just disappearing into a black hole as most similar contributions over the past 15 years seem to.

In the meantime you can, and I highly encourage you to, read Mark Linder’s notes on the meeting, as well as Diane Hillman’s.

Serials Coverage: Z39.71 vs. ONIX Coverage

Serials Coverage

I have an issue I’d like to put on the radar of ILS developers generally, especially open source ILS developers, especially apropos since the Evergreen Serials module is in the process of being developed.

When trying to integrate my Link Resolver with my ILS recently, I wanted to accomplish a task that seems like you’d often want to accomplish: When given a particular journal citation (say, issn, volume, and issue), identify if we have it in print, and identify the particular ILS record(s) that correspond to that serial holding in print.

In our environment, this turned out not to be possible to do in a reasonably confident way. Part of the problem is the Z39.71 standard, which is used to express serials coverage/holdings in a human readable format. While z39.71 holdings statements are theoretically intended to be consistent and maybe even machine-processable—anyone who has tried to machine process them will have discovered they aren’t really suitable for recovering the sort of information needed to perform my task, for example.

On top of that, in many actual ILS environments, catalogers end up entering z39.71 purely by hand. I don’t know if there is even a way to validate z39.71 holdings statements automatically (I suspect there is not, an obvious problem in itself), but I’d guess that in a typical environment around half of z39.71 statements in a corpus are probably not strictly legal z39.71. Whether through typo, cataloger misunderstanding of the standard, or simply lack of concern with following the standard I don’t know, probably a different mix in different institutions.

Continue reading “Serials Coverage: Z39.71 vs. ONIX Coverage”