I just left viewing an ALCTS presentation on RDA, by Christine Oliver, entitled “Benefits for Users and Catalogers.”
What the presentation actually was, though, was a focus on the goals and principles of RDA. Which is a fine topic for a presentation. I thought it actually did a pretty good job of explaining what those goals were, and why they mattered, I thought it got that accross pretty well.
Basically, to let us record data that is machine-actionable, data elements that represent individual elements of meaning (rather than phrases in a narrative description, as the ISBD perspective of AACR2 kind of leads to); that is based on a clear conceptual (entity-relation) model; and to have instructions about semantic data elements without assuming a particular encoding/serialization (eg Marc), or a particular way it will be presented.
All that should make it more feasible for our data to be made interoperable with other non-RDA data (which is often called ‘other communities data’, but even our library metadata community has not been limited to AACR2 or MARC for some time). The basis in an explicit conceptual model should also allow the standard(s) to be more easily extended in a flexible way later, without making a patchwork mess of things.
Additionally, the standard is meant to allow cataloger and community judgements about how best to combine the supplied data elements to meet user needs, rather than AACR2′s style of very precise instructions for when to fill out each data element. If the data element applies, you can provide it. And, RDA intends to make instructions somewhat simpler, without whole separate chapters for different media types or what have you, but instead it’s just the data elements and how to fill them out, with more specific instructions for how a certain element is to be handled certain kinds of materials following general instructions.
So far, so good — these are the right things that a modern metadata standard should be trying to do. The presentation was, I thought, a good one at laying these things out and explaining why they matter, in a fairly short period of time to a somewhat diverse cataloger audience.
I remain somewhat worried about whether RDA actually accomplishes these reasonable and appropriate goals. Last time I tried to read RDA, I found it to be a pretty impenetrable monster, and kind of gave up.
I worry that even if RDA succeeds in providing a proper focus on explicitly modelled semantic units (not sure if it does), shoe-horning that data into MARC-with-a-few-tweaks, as is currently planned, might wind up being the worst of both worlds. Catalogers will, I worry, have to spend more time, still creating the legacy elements which aren’t covered by RDA and in many cases are unneccesary duplication of RDA elements. And for that extra time, data created according to RDA but shoe-horned into MARC (with loss of granulirty in several places) may not actually provide the benefits to easier machine actionability and interoperability that are designed. And the catalogers will, I worry, remain in the somewhat confusing environment of having a bunch of different rule sets governing how to fill things out (RDA, then MARC, or only MARC for things not covered by RDA but which still need doing, etc.)
Data model vs. data creation rules
I also worry about the fact that RDA combines two things that should really be seperate: The entity-relational-attribute model on the one hand, and the rules for supplying data to attributes and making relationships on the other. These are two things that should really be de-coupled, with the latter depending on the former but the former being independent.
Of course, our legacy environment is even worse, with the ‘data model’ being supplied by an unholy combination of ISBD (which AACR2 is theoretically based on, but which is a wholly inadequate model for the 21st century computer environment) and MARC on the other (which was never really designed to be a ‘data model’, but which has come to fill that role for historical reasons, and which AACR2 pretends does not exist; the data model should not have to reference the rules for filling it out; but the rules for filling it out absolutely need to reference the data model! MARC/AACR2 has it backwards).
[The presentation suggested that RDA used the FRBR data model. This isn't entirely accurate; RDA uses a data model based on FRBR, which is entirely appropriate, but changes and fleshes it out in a variety of ways, the FRBR document alone does not describe RDA's data model. ]
So, you know, if RDA succeeds in presenting a coherent and modern data model, and having rules that reference that data model — I guess that’s still a great first step, they can be “de-tangled” appropriately later. [The work Diane Hillman, Karen Coyle, and others are doing with formal expression of RDA's data model is a key step in this direction, more below]
And what I worry about the most — having a clearly defined models and rules at the right level of granularity only helps interoperability and uptake of your data by other communites — if other people can see your model and rules!. Having your “standards” only available behind an expensive paywall is the best way to ruin your data’s chances of actually becoming interoperable and used by other communities.
Disagreements, on “hidden webs” and the proven value of “linked data”.
There’s an sort of idea which seems to have become very popular in talking about these issues. In this presentation it waseexpressed something along the lines of “Our AACR2-MARC data is in the ‘hidden web’, we need to make it in the ‘open web’ by, oh, putting it in XML or something.” (Heh, obviously a rough unfair paraphrase).
There’s an important truth at the basis of this, but the way it gets expressed corrupts it so much that I think it’s misleading, and ends up just waving buzzwords around instead of helping catalogers understand modern metadata practices.
First, it’s not entirely clear what those buzzwords “hidden/deep/dark web” vs “open/public/of-the web” mean. Most commonly the former seems to refer to things that aren’t easily indexable by Google and other similar search engines. The reasons our public-facing OPAC pages aren’t easily indexed by Google et all actually have little to do with MARC-AACR2, but have more to do with the early-1990s-style design of our OPAC software, which uses non-persistent session-dependent URLs to deliver pages. If the software instead uses more 21st century web design approaches of RESTful URLs, with standard methods to retrieve alternate format representations of a document, and heck, through in Google SiteMaps to boot — then our OPAC web pages will be no longer be in the “hidden/deep/dark web”, it’s actually got nothing to do with MARC-AACR2, it’s a seperate issue.
So what’s the kernel of truth at the basis of these sorts of statements? The problem is not “hidden web”, the problem is that our data is not very interoperable with any other data, and our data is very difficult (meaning expensive) or sometimes impossible for software to make effective use of. Just “putting it in XML” won’t solve this, MarcXML or even MODS-automatically-transformed-from-MARC are XML, but they don’t solve this problem at all. What does solve this problem? Well, exactly the goals of RDA from the first section of this blog, fairly clearly elucidated by Christine.
But trying to simplify this as being about “hidden web” confuses more than it clarifies. It would be better to say — and what I think Christine and others really mean – Our current data is very difficult to make interoperable with other peoples data and systems, and even difficult for our own systems to take full advantage of. To fix this problem we need data based on an explicit and clear data model, and data composed of individual semantic elements divorced from their serialization/encoding or presentation. I think Christine actually did a pretty good job of explaining that in the time she had, I just think this “hidden web” stuff does nothing but muddy the waters.
Or, similar, talking about linked data instead
From another perspective, a different formulation of this sort of phrasing, that I think is still mis-leading, that Karen Coyle likes to use , and I heard her use in a different presentation on RDA and Linked Data a couple weeks ago (think it was part of this same ALCTS series). Something along the lines of (and I’m probably unfairly paraphrasing again): “Our data is in ‘record’ format, which is not easily useable on the web, making our data instead free-floating assertions will make it useable on the web.”
Again, I think this is really about interoperability and machine-actionability — and I think that whether it’s chunked into records or not in fact doesnt’ matter that much. ALL a “record” is — if (and it’s a big if) it’s properly modelled — is a collection of “assertions” all about the same subject entity. That’s it, there’s no serious epistemic/metaphysical/ontological gap between “record based metadata” and “assertion based metadata”, record-based metadata is just an aggregation of assertions all about the same subject entity.
The RDF/Linked Data/Semantic Web “project” (in the sense of something worked on by a bunch of people with a common goal, not meaning it’s a centralized formal project) intends to accomplish certain things in the realm of data interoperabiltiy which, yes, depend on this kind of “free floating assertion” — but I think the RDF project is, as Diane Hillman quotes Stu Weibel: “An aspirational target of great promise and unproven benefit” (Consisely stating something I’ve been trying to say for a while, and not found a lot of agreement amongst code4lib-type library coders. Assuming Stu is quoted accurately, thanks Stu, maybe people will take it better coming from you than me!) We shouldn’t be putting all our eggs in the basket of the unproven benefit of RDF, I don’t think it’s a settled question as to whether free-floating “assertions” really will give us much benefit over “records”.
Now, don’t get me wrong, the work Karen and Diane Hillman are doing with DCMI/RDA on RDF modelling of RDA is really important — but it’s not actualy important because it’s RDF, or because it’s about “assertions not records” — it’s important because it is doing the same thing we keep talking about here — creating the kind of explicit formally defined model of our data, based on individual semantic data elements divored from encoding or presentation. That they’ve chosen to do that within the framework of RDF/linked data is a fine choice, you’ve got to chose something to start with — but/and it will have value whether or not the overall RDF/linked-data project itself ends up showing value or not, and it’s value is not really about “assertions instead of records”. Again, all a properly modelled record is, is an aggregated set of assertions about the same subject entity. No big thing. A sloppily modelled record, on the other hand, is just a buncha text. RDF is a fine avenue to creating formal and explicit modelling, no problem. But I think it’s overly confusing and misleading to suggest that “assertions vs records” is a big deal, or a resolved argument.