I entered library school a bit over four years ago, coming from a computer programming background. While I’ve ended up a librarian-programmer in a Systems Department, initially I actually went to library school wanting to get out of the programming business.
I didn’t know much about the library business, but I was fascinated by cataloging. Many of the programming problems I had worked on involved metadata and metadata control, and, hey, it’s just a fascinating problem in general. The kinds of thinking you need to bring to bear on cataloging and metadata problems seemed, and still seem, to me to be very similar to the kinds of thinking required by software engineering.
Without knowing much about how cataloging actually worked, I had this idea of catalogers as this esoteric society bringing sophisticated methods — developed over several generations of dealing with gigantic quantities of data — to bear on the problem of efficiently and effectively describing large quantities of stuff in information systems to aid users in navigating their way through a gigantic information universe.
Learning about what actually went on was kind of a rude awakening. At first I thought surely I must be misunderstanding, or surely my instructors themselves must be misunderstanding, what actually goes on. Eventually I had the good fortune to take a class from Allyson Carlyle, who, while she doesn’t agree with all (or perhaps even most) of my assessments of cataloging, has a brilliant grasp of the nature, problems, and history of cataloging. She confirmed that I did basically understand what was going on, helped me improve that understanding, and reassured me that, yes, indeed I wasn’t crazy for thinking the current state of affairs was awfully flawed.
So I haven’t ended up a cataloger, but virtually all of the software work I do is intimately intertangled with cataloging and metadata, and I still have a keen interest in cataloging and metadata control, for both those practical reasons as well as a personal fascination.
Exciting New Techological Tools
One of the things missing in our cataloging and metadata control arsenal is technology to support the actual efficient and control of the giant universe of metadata in various formats we are currently faced with. This is just one of the things missing, but it’s a big one.
There are some recent technological tools and approaches which I think are very exciting in pointing the way to a modernized approach to metadata control in a quickly approaching horizon.
XC Metadata Toolkit
Jennifer Bowen has recently published an open access article on a Metadata Services Toolkit being developed as part of the XC project. (skip the cover pages, straight to the PDF). I find this particular aspect of XC to be very exciting.
I would summarize the Metadata Services Toolkit as being a package designed for the large-scale management of metadata in various formats in bulk. It supports automated rule-based transformations of metadata from one format to another, normalizations of metadata from various sources within a format, de-duplication/combination of multiple records representing the same titles, and aggregation of information about different aspects of the same works/items that comes from multiple sources.
Too much of the normalization and aggregation we currently do is based on painstaking individual manual work. The Metadata Services Toolkit offers the promise of rationalizing this into a coherent automated system, letting humans do what humans are good at and machines do what machines are good at. The Toolkit is also written from the start to “loosely couple” to other systems disparate systems in a larger infrastructure, being one flexible piece of a larger pie.
While the Toolkit claims to aim to be a supplement rather than a replacement to traditional ‘cataloging’ tools, it has a clear and explicitly stated role in authority work, one function that is in fact currently done in more traditional cataloging tools. And in general, I’d argue that the true promise of the Toolkit lies in integrating it with tools for doing more traditional individual item-at-a-time metadata control, in a larger coherent rational package.
Very exciting stuff to me. This is part of what the future of cataloging should look like.
I recently blogged about biblios.net, and the Biblios Web Services (announcement and web page forthcoming, but hints can be seen in the biblios.net web page, and my blog post). While the XC Metadata Services Toolkit is focused on the individual institution, the biblios.net repository is explicitly a shared metadata repository (and control system).
Catalogers have long realized that the effective control of the giant universe of metadata we’ve got relies on cooperation and sharing of metadata. But the practices, and the technological infrastructure, that we have to do this have become outdated and insufficient.
Too many catalogers are making the same changes and fixes in parallel, instead of efficiently sharing them. When it is desirable to share additions and fixes to metadata, too much human effort is currently required to publish and subscribe to these changes, a process that should instead be almost wholly automated.
Biblios.net has the right infrastructure and business model to support the cheap, efficient, and automated sharing of metadata. It’s also designed to support multiple formats (not just MARC), and is designed from the start with web service APIs to enable to exist as one component in a larger environment–or many larger environments, which in turn collectively are our Metadata Universe. Each of our individual cataloging/metadata corpuses is really just one slice of the entire Metadata Universe, it’s time we started acting like it, with systems to support that. I don’t actually think that requires us all sharing only one metadata store, instead it requires tools like this essay discusses.
In my ideal cataloging/metadata future, I’d see multiple shared metadata repositories, which interact with each other and with each of our instution’s metadata infrastructures, to form a larger whole. WorldCat will certainly continue to play a roll in that. Ex Libris has discussed a similar vision of a modernized shared metadata repository as part of their URM Strategy, but they still don’t have anything published on that today. It’s LibLime’s biblios.net that is making the first concrete steps toward this future.
Incidentally, LibLime’s Biblios cataloger’s editor also provides a good example of an evolution of the cataloger’s editor to provide machine help where it is needed (the fact that most cataloger’s editors require you to fill in MARC ‘fixed fields’ more or less manually is a crime!), while integrating with a next generation shared metadata repository.
Version Control of Metadata
Version control systems like CVS, subversion, and their more advanced and recent progeny like Git and Bazaar were originally designed for helping software developers keep track of their software code, in projects worked on by multiple developers. Over the past 20 years, they have become increasingly sophisticated and mature, and are effectively used to support software development projects with thousands of geographically distributed developers and millions of lines of code.
In his Code4Lib Journal article Distributed Version Control and Library Metadata, Galen Charlton makes the case for the applicability of this technology to the shared cataloging environment.
A distributed version control system applied to our cooperative cataloging corpus could allow us to keep track of each individual change to a shared record, who made it, and when it was made. These systems could also allow us to choose, in a bulk, automated, and rule-based way, which sorts of changes from which authors we’d like to automatically incorporate into our system, and which we would not; merging changes from different sources into the same records; and supporting automated publishing and subscribing to metadata changes in the larger universe in a controlled way.
These kinds of features also provide the platform for large scale integration of separate shared cataloging/metadata repositories into one coherent universe.
Existing distributed version control systems, not designed for precisely this purpose, will take a bit of creative invention and work to fit this purpose. But the pieces are there waiting to be creatively put together.
Since Galen works for LibLime too, I hold out hope that this kind of technology may find a home supporting the biblios.net shared repository.
The whole is greater than the sum of its parts
The true promise of these tools will come when they are integrated together into a larger whole. Cataloger’s editors, toolkits for bulk management, shared metadata repositories, all working together to form the cataloging infrastructure of the future.
For this to happen, we need as few barriers as possible to as many people as possible (catalogers and developers both) investigating these tools, experimenting with these tools, and independently working on taking these tools to the next level(s). Everyone tries different stuff, we learn best practices and next steps from that. This requires both our software and our data being designed for inter-operability, and open to such experimentation, not kept in proprietary walled gardens.
An honorary ‘exciting tool’ mention goes to the OCLC Seel project. Seel is a particularly elegant and useful approach to a re-useable framework for controlling metadata in multiple formats. Unfortunately, it’s proprietary technology unavailable for production use by anyone but OCLC.
The Role of the Cataloger
Changes are afoot, and our cataloging practices need to change to keep apace with new possibilities and requirements. Unfortunately, the de-professionalization and general decimation of most cataloging departments have left us seriously deprived of talent and time to work on updating and reinvigorating our practices.
In my mind, the future cataloger does not need to be a computer programmer. However, the future cataloger does need to have a good grasp of what some have called “computational thinking“. This includes: an understanding of what things can done well by computers, and what things are only done well by humans; and an understanding of algorithms and how to divide a problem into discrete chunks.
To me, ‘computational thinking’ seems not too big a jump from the kind of rule-based thinking a professional and theoretically grounded cataloger would already have/need. But that may be a prejudice of my own standpoint in my own brain.
Catalogers and Metadata Controllers (and I don’t think there is a difference between these two things) need to know what they should be expecting and demanding of computers, and what instead needs a human touch. They should be able to imagine (not implement, but imagine, and perhaps design) the software infrastructure both locally and cooperatively necessary to efficiently achieve their aims. They should be able to understand how to put together different technological tools to meet their aims.
Where are these future cataloing leaders (who in truth, we need Right Now, not in the future?) going to come from? Where will they be trained (both new to the field and old hands investigating new problems), where will they find employment, what employers will give them environments where they are supported and expected to think creatively and collectively create the cataloging Metadata Control of the future?
I wish I knew.
Standards and Practices
When innovations are suggested, the response from traditional cataloging departments is often: We can’t do that, it would violate AACR2 / WorldCat policies / ISBD / standard conventions.
And surely, shared standards and conventions are vital for supporting the shared metadata creation that is the only solution for our metadata challenges.
But the existing disparate and poorly integrated collections of legacy standards, rules, and unwritten conventions we have inherited are ill-suited to the tasks at hand. More people involved in standards creation need to understand that kind of “computational thinking” mentioned earlier, to understand how standards can be well or poorly suited to creating metadata which can be processed in a sophisticated way by software. (In fact, I would consider these standards to be a particular kind of “technology”, not to be a separate thing from “technology”)
Because software is necessarily intimately involved in our metadata, from the infrastructures necessary to support the large-scale distributed collaboration we need to create and maintain our metadata, to the software that will provide patrons access to navigating, querying, and discovering this metadata.
RDA had the promissing goals of creating a metadata creation standard that resulted in metadata well suited to machine manipulation and control; and that could be easily grasped by the newcomer (including those in other fields), to at least create basic compliant metadata (and transform metadata to and from other standards in basic ways), even if not the sophisticated details. It looks to me like it has failed on both counts. It is an unmanageable frankenstein, that isn’t particularly suited to the 21st century technological environment either. Although it does make some significant steps and attempts.
(I am particularly excited by the DC-RDA effort to describe RDA’s metadata schema(s), element(s), and vocabulary(ies) using modern machine-suitable methods of description. They are using Jon Phipps’ Metadata Registry software to support this; another exciting tool that could have been mentioned in the Tools section of this essay. Diane Hillman and Karen Coyle, among others, have put in some serious and well-considered work on this aspect of RDA).
But if the RDA effort has expended this much time and effort, and still not succeeded — what else do we have? What can we do? Additionally, RDA’s business model of a for-profit standard that one must pay to access is not well-suited to it’s adoption and success even if the standard itself did meet it’s goals.
One important part of developing standards suited to the current environment is making an explicit and formal description of our ‘data domain model’, our shared mental model for formally describing the resources we care about. This is something that previously has been described in several non-coordinated places (AACR2, MARC, ISBD), as well as un-stated and implicit and held only in cataloger’s minds and shared understanding (thus subject to mis-understanding, and hard to communicate to newcomers).
FRBR and FRAD were the attempts to do this, which RDA wisely based itself upon. FRBR is a good start, but only a start. In the past 10 years since it’s founding, there has been very little investigation and further development of FRBR, which it so desperately needs. FRAD, on the other hand, in my opnion totally misses the boat, it does not provide an adequate conceptual model for even what our ‘authority work’ has been historically let alone should be going forward.
So what is our hope to get what we need? I am not sure. I am somewhat depressed thinking about. All I know is it’s going to have to be one step at a time, one piece at a time, based on demonstrated best practices, not the monolithic all-at-once attempt that was RDA.
But doing things in this agile way is very difficult when we have so many fragilely inter-operating systems based on ill-defined conventions that we need to keep working throughout; and when there are so many cataloging departments staffed to run on autopilot without the resources to adapt.
OCLC and LC are well-positioned to exersize some leadership out of this logjam, but I am not optimistic about that happening from either, for various reasons.