ALCTS RDA presentation

I just left viewing an ALCTS presentation on RDA, by Christine Oliver, entitled “Benefits for Users and Catalogers.”

What the presentation actually was, though, was a focus on the goals and principles of RDA.  Which is a fine topic for a presentation. I thought it actually did a pretty good job of explaining what those goals were, and why they mattered, I thought it got that accross pretty well.

Basically, to let us record data that is machine-actionable, data elements that represent individual elements of meaning (rather than phrases in a narrative description, as the ISBD perspective of AACR2 kind of leads to); that is based on a clear conceptual (entity-relation) model; and to have instructions about semantic data elements without assuming a particular encoding/serialization (eg Marc), or a particular way it will be presented.

All that should make it more feasible for our data to be made interoperable with other non-RDA data (which is often called ‘other communities data’, but even our library metadata community has not been limited to AACR2 or MARC for some time).  The basis in an explicit conceptual model should also allow the standard(s) to be more easily extended in a flexible way later, without making a patchwork mess of things.

Additionally, the standard is meant to allow cataloger and community judgements about how best to combine the supplied data elements to meet user needs, rather than AACR2’s style of very precise instructions for when to fill out each data element. If the data element applies, you can provide it. And, RDA intends to make instructions somewhat simpler, without whole separate chapters for different media types or what have you, but instead it’s just the data elements and how to fill them out, with more specific instructions for how a certain element is to be handled certain kinds of materials following general instructions.

So far, so good — these are the right things that a modern metadata standard should be trying to do.  The presentation was, I thought, a good one at laying these things out and explaining why they matter, in a fairly short period of time to a somewhat diverse cataloger audience.

My Worries

I remain somewhat worried about whether RDA actually accomplishes these reasonable and appropriate goals.  Last time I tried to read RDA, I found it to be a pretty impenetrable monster, and kind of gave up.

I worry that even if RDA succeeds in providing a proper focus on explicitly modelled semantic units (not sure if it does), shoe-horning that data into MARC-with-a-few-tweaks, as is currently planned, might wind up being the worst of both worlds.  Catalogers will, I worry,  have to spend more time, still creating the legacy elements which aren’t covered by RDA and in many cases are unneccesary duplication of RDA elements.  And for that extra time, data created according to RDA but shoe-horned into MARC (with loss of granulirty in several places) may not actually provide the benefits to easier machine actionability and interoperability that are designed. And the catalogers will, I worry,  remain in the somewhat confusing environment of having a bunch of different rule sets governing how to fill things out (RDA, then MARC, or only MARC for things not covered by RDA but which still need doing, etc.)

Data model vs. data creation rules

I also worry about the fact that RDA combines two things that should really be seperate:  The entity-relational-attribute model on the one hand, and the rules for supplying data to attributes and making relationships on the other. These are two things that should really be de-coupled, with the latter depending on the former but the former being independent.

Of course, our legacy environment is even worse, with the ‘data model’ being supplied by an unholy combination of ISBD (which AACR2 is theoretically based on, but which is a wholly inadequate model for the 21st century computer environment) and MARC on the other (which was never really designed to be a ‘data model’, but which has come to fill that role for historical reasons, and which AACR2 pretends does not exist; the data model should not have to reference the rules for filling it out; but the rules for filling it out absolutely need to reference the data model!  MARC/AACR2 has it backwards).

[The presentation suggested that RDA used the FRBR data model. This isn't entirely accurate; RDA uses a data model based on FRBR, which is entirely appropriate, but changes and fleshes it out in a variety of ways, the FRBR document alone does not describe RDA's data model. ]

So, you know, if RDA succeeds in presenting a coherent and modern data model, and having rules that reference that data model — I guess that’s still a great first step, they can be “de-tangled” appropriately later. [The work Diane Hillman, Karen Coyle, and others are doing with formal expression of RDA's data model is a key step in this direction, more below]

And what I worry about the most — having a clearly defined models and rules at the right level of granularity only helps interoperability and uptake of your data by other communites — if other people can see your model and rules!. Having your “standards” only available behind an expensive paywall is the best way to ruin your data’s chances of actually becoming interoperable and used by other communities.

Disagreements, on “hidden webs” and the proven value of “linked data”.

There’s an sort of idea which seems to have become very popular in talking about these issues. In this presentation it waseexpressed something along the lines of “Our AACR2-MARC data is in the ‘hidden web’, we need to make it in the ‘open web’ by, oh, putting it in XML or something.”  (Heh, obviously a rough unfair paraphrase).

There’s an important truth at the basis of this, but the way it gets expressed corrupts it so much that I think it’s misleading, and ends up just waving buzzwords around instead of helping catalogers understand modern metadata practices.

First, it’s not entirely clear what those buzzwords “hidden/deep/dark web” vs “open/public/of-the web” mean.  Most commonly the former seems to refer to things that aren’t easily indexable by Google and other similar search engines. The reasons our public-facing OPAC pages aren’t easily indexed by Google et all actually have little to do with MARC-AACR2, but have more to do with the early-1990s-style design of our OPAC software, which uses non-persistent session-dependent URLs to deliver pages.  If the software instead uses more 21st century web design approaches of RESTful URLs, with standard methods to retrieve alternate format representations of a document, and heck, through in Google SiteMaps to boot — then our OPAC web pages will be no longer be in the “hidden/deep/dark web”, it’s actually got nothing to do with MARC-AACR2, it’s a seperate issue.

So what’s the kernel of truth at the basis of these sorts of statements?  The problem is not “hidden web”, the problem is that our data is not very interoperable with any other data, and our data is very difficult (meaning expensive) or sometimes impossible for software to make effective use of.  Just “putting it in XML” won’t solve this, MarcXML or even MODS-automatically-transformed-from-MARC are XML, but they don’t solve this problem at all. What does solve this problem? Well, exactly the goals of RDA from the first section of this blog, fairly clearly elucidated by Christine.

But trying to simplify this as being about “hidden web” confuses more than it clarifies. It would be better to say — and what I think Christine and others really mean —  Our current data is very difficult to make interoperable with other peoples data and systems, and even difficult for our own systems to take full advantage of. To fix this problem we need data based on an explicit and clear data model, and data composed of individual semantic elements divorced from their serialization/encoding or presentation. I think Christine actually did a pretty good job of explaining that in the time she had, I just think this “hidden web” stuff does nothing but muddy the waters.

Or, similar, talking about linked data instead

From another perspective, a different formulation of this sort of phrasing, that I think is still mis-leading, that Karen Coyle likes to use , and I heard her use in a different presentation on RDA and Linked Data a couple weeks ago (think it was part of this same ALCTS series). Something along the lines of (and I’m probably unfairly paraphrasing again):  “Our data is in ‘record’ format, which is not easily useable on the web, making our data instead free-floating assertions will make it useable on the web.”

Again, I think this is really about interoperability and machine-actionability — and I think that whether it’s chunked into records or not in fact doesnt’ matter that much.   ALL a “record” is — if (and it’s a big if) it’s properly modelled — is a collection of “assertions” all about the same subject entity.  That’s it, there’s no serious epistemic/metaphysical/ontological gap between “record based metadata” and “assertion based metadata”, record-based metadata is just an aggregation of assertions all about the same subject entity.

The RDF/Linked Data/Semantic Web  “project” (in the sense of something worked on by a bunch of people with a common goal, not meaning it’s a centralized formal project)  intends to accomplish certain things in the realm of data interoperabiltiy which, yes, depend on this kind of “free floating assertion” — but I think the RDF project is, as Diane Hillman quotes Stu Weibel:  “An aspirational target of great promise and unproven benefit”  (Consisely stating something I’ve been trying to say for a while, and not found a lot of agreement amongst code4lib-type library coders.  Assuming Stu is quoted accurately, thanks Stu, maybe people will take it better coming from you than me!)  We shouldn’t be putting all our eggs in the basket of the unproven benefit of RDF, I don’t think it’s a settled question as to whether free-floating “assertions” really will give us much benefit over “records”.

Now, don’t get me wrong, the work Karen and Diane Hillman are doing with DCMI/RDA on RDF modelling of RDA is really important — but it’s not actualy important because it’s RDF, or because it’s about “assertions not records” — it’s important because it is doing the same thing we keep talking about here — creating the kind of explicit formally defined model of our data, based on individual semantic data elements divored from encoding or presentation. That they’ve chosen to do that within the framework of RDF/linked data is a fine choice, you’ve got to chose something to start with — but/and it will have value whether or not the overall RDF/linked-data project itself ends up showing value or not, and it’s value is not really about “assertions instead of records”. Again, all a properly modelled record is, is an aggregated set of assertions about the same subject entity. No big thing. A sloppily modelled record, on the other hand, is just a buncha text.  RDF is a fine avenue to creating formal and explicit modelling, no problem. But I think it’s overly confusing and misleading to suggest that “assertions vs records” is a big deal, or a resolved argument.

Chrsitine Oliver, titled Christine Oliver
About these ads
This entry was posted in General. Bookmark the permalink.

10 Responses to ALCTS RDA presentation

  1. “I also worry about the fact that RDA combines two things that should really be seperate: The entity-relational-attribute model on the one hand, and the rules for supplying data to attributes and making relationships on the other. These are two things that should really be de-coupled, with the latter depending on the former but the former being independent.”

    Code4Libbers have been saying this for some time. Can we propose a model that would work, and be acceptable to catalogers? It’s worth trying!

  2. Bibliographic “records” don’t makes sense outside of libraries, so I think Karen’s way of talking about this is helpful. One worry is that catalogers may think that we’re throwing the baby out with the bathwater: things are still bundled even when they’re in more interoperable formats.

  3. SemWeb is not about “free-floating” records–it’s about expressing rich metadata that others can
    1) at least parse
    2) depending on the vocabularies we use, pull into their own environments and reuse. And use as much or as little as they want, because they’re already “unbundled” as well as being “bundled”

    You usually have interesting objections, so I hope we get to talk more about them on #code4lib sometime soon!

  4. Jonathan–I think you’re absolutely right about what a record is, (and isn’t), but Karen’s (and my) aim in talking about record-less ways of looking at our data is to try and change thinking, not necessarily pushing a particular solution. Without this change in thinking, we get nowhere trying to talk to folks about why RDA can’t be thought of as new wine in old bottles (MARC), and why the old ‘hub-and-spoke’ distribution model with OCLC in the middle no longer works very well for us. In a more open world, using data outside our usual ‘silo’, it really does matter how we manage it, and in particular how we THINK about the data and the challenges we have in managing it over time. Records are fine for various kinds of exchange, but that’s output (and perhaps input), not necessarily how we think about the data itself, much less how we improve it using the right balance of humans and machines.

    And I don’t think anyone’s really talking about ‘free-floating assertions’. The fact is that there’s data out there we can use, and that others can use ours (once we get it out of the MARC straightjacket). All these assertions need to come with provenance so we know who said it, and when.

    And yes, we know that what we’re doing with the RDA Vocabularies will work in a broad range of encoding and systems–we built it that way, knowing that significant numbers of folks will be in the XML world for some time to come. But if we hadn’t thought about RDF, weren’t paying attention to that simple but powerful way of thinking about and building data, we would have failed miserably in our task.

    BTW, that quote from Stu was verbatim, and might even be on his slides, which are well worth looking at–but if you do look, don’t miss Mike Bergman’s either (he did the final keynot). A great set of bookends for a very good conference.

  5. jrochkind says:

    That’s actually the least of my concerns — if RDA is written well, the data model part can be easily extracted from the rules/guidance part at a later date. And if it’s not written well, then it’s a heck of a lot of work, including creative original work, to do it, that is unlikely to catch on if it’s done by ‘code4libbers’ and presented to catalogers, and unlikely to be worth it anyway if RDA doesn’t end up being valuable because of other concerns.

    As far as “propose a model”, I think it’s just what’s been said: Seperate the model definition (with a ‘formal’ machine-oriented description) from the rules. This is theoretically the relationship AACR2 had to ISBD, so it should not be a shocking ide ato catalogers. The difference is that ISBD was in fact not (at least not any longer by the internet era) a sufficient/suitable model, and, well, that ISBD itself combined the model with presentation.

    The work Karen Coyle, Diane Hillman, et al, are doing on the DCMI-RDA vocabulary work is the first step to taking the model into a separate document, where it is formally defined. So the right things are happening already, while the issue is worth pointing out to hopefully increase knowldge, approval, and support of their efforts — the issue is not the greatest of my concerns.

  6. jrochkind says:

    Thanks Diane. I think you and I are on the same page at least, and I’ve always thought that.

    By “free-floating assertions” I just meant the linked-data approach of individual atomic subject-predicate-object assertions as distinct from aggregated assertions in records. I didn’t mean to say assertions without provenance or what have you.

    All of us in this conversation agree that ti’s important to facilitate education of “traditional” catalogers in thinking about metadata using a more ‘computer scientific’ data modelling approach. I continue to think that implying (or outright saying) that the linked-data “atomic assertion” approach is the only right one — is an oversimplification which hinders rather than helps this education. As is this implication that if data is “XML” that means it’s modelled properly, and if it’s not that means it’s not. The important thing is the proper modelling, and the trick, we all know, is figuring out how to teach this to non-computer-scientist/programmers who are used to thinking of metadata in a different way.

  7. Melanie says:

    I think jargon may be getting in the way. (As a cataloger, I know an entirely different set of jargon than a computer person.) In reading your comments here, I think I get the point, but it’s still really amorphous and vague. Would it be possible for you to give some concrete examples of how these different elements can be separate and then work together? What information should RDA require?

  8. jrochkind says:

    Yeah, it’s hard to talk about this stuff because it’s so abstract, even within a single jargon community, heh.

    For the thing me and Jodi are talkign about about de-coupling the two parts — I’m talking about one document that says “Here is the list of entities, attributes, and relationships ” (ie, “elements”, more or less), and another document that says “Here are the rules for what text you put in what element, and when you are required to fill out a given element.” Two seperate ‘standards’, because the first one, the set of elements and what they mean (entities relationships attributes) can be used with several _different_ rules for how you fill them out; and additionally because then someone wanting to use, but not produce, data produced by others under RDA could theoretically just look (or just start with) the first ‘standard’, the model, the list of entity/ relationship/attributes, to understand the data — without having to dive into the rules for how to produce the data.

    You’ve got the standard that describes your data model — what elements you have — what the entities, attributes, and relationships are. And that’s the starting document for someone who wants to know how to understand or work with already existing data. This is, theoretically, more or less the role that ISBD traditionally played — it defined a data model, a set of elements. In ISBD’s case the elements were grouped into “areas”, which are mostly presentation- oriented, and ISBD also prescribes all sorts of other presentation for how you must display these elements to users (ISBD assumes you must decide this at the point you record the data, because ISBD was constructed in a world of card catalogs). ISBD also doesn’t do a very good job of describing the overall “model”, it’s just kind of a list of “elements” without talking about their “meaning” exactly, what they are meant to represent, how they map to the ‘real world’ perhaps because those writing ISBD saw it as self-evident. So ISBD isn’t succesfully the sort of “data model” document we need now, but it’s useful to consider as a specific example of something that DID describe a set of elements in a seperate document from the rules for “filling out” or “constructing” the data to “fit into” those elements. As well as useful to consider for how it _isn’t_ succesful, to try and explain what would be.

    So on the one hand you’ve got that standard that describes your “data model”, your list of elements, your entities with attributes and relationships, describes the “shape” of the data you use to represent the real world. And then you’ve got the a different standard that describes how to “fill out” that shape, how to decide for a specific given real thing in front of you, what to put (if anything) in each “slot” defined by the “data model”.

    For example, the data model, in our case, says there is such thing as a Work, and there is such thing as a Manifestation, and two Manifestations may or may not share the same Work. The rules , on the other hand, might give you guidance in deciding when two Manifestations share the same Work and when they don’t. The data model might say there is a Transcribed Title that we record for a Manifestation. The rules tell you how to decide what the Transcribed Title is in non-obvious cases, or maybe even tell you times to not bother bother recording a Transcribed Title.

    So ISBD and AACR2 is the best example we can come up with from traditional/legacy practices. Does that make any sense?

    So that was my attempt to say that in several different ways. Did any of them ‘stick’? Again though, while I think this is a very important point to understand, it’s actually not my biggest concern about RDA; I think RDA is moving in this direction already.

  9. Pingback: Defining Metadata and Making Metadata Accessible | Disruptive Library Technology Jester

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s