serialization vs metadata schema/vocabulary

In a conversation that no doubt continues to be frustrating for all involved, because we all think we’ve had the exact same conversation a dozen times before, Bernhard Eversberg wrote:

No, to learn MARC does not consist of learning all the numbers and codes, that’s rather trivial. You have to learn the precise meanings and the concept, and that’s the same with verbal tags.

So, okay, here’s what this means to me:  You are saying that MARC serves  as our metadata _schema_ or _vocabulary_.  It is NOT just a  serialization format or an exchange format, it is in fact our schema, it  defines what elements are available and what they mean.

Now, to me, THAT is in fact the biggest problem with MARC.  We’ve taken  what was originally designed as simply a transport format, and turned it  into a schema.  In the process, by having ONE standard that is BOTH our  metadata schema and serialization format, by entangling these two  concepts, it makes any kind of movement or inter-operability much more  complicated.  It makes it nearly impossible to have a serialization of  our data in some _other_ serialization format in a ‘lossless’ way,  because the serialization format and the schema are so entangled.

It makes our ‘content guidance’ like AACR2 _very_ difficult to  understand in practice, because the only reasonable way to write content  guidance like AACR2 is to refer to a metadata schema.  AACR2 refers to  ISBD — which “officially” was designed as a metadata schema (although  we/they didn’t use that term way back then, that’s what it was; library  people were actually doing ‘metadata engineering’ FIRST).    But in  _fact_, in most/all AACR2-using countries, it’s MARC21 that BECAME the  true metadata schema. AACR2 keeping up the fiction that it’s ISBD makes  these various parts of our metadata control regime mesh with “broken  gears”, making everything _much_ harder to understand for both library  and non-library sector people (who might want to inter-operate with our  data).

It makes it insanely complicated to make any changes to ANY of the parts  of the metadata regime, because the parts inter-relate in ill-defined  ways. If what you really need is a change in our ‘metadata schema’, does  that mean you need a change to ISBD, MARC, or AACR2?  Or all of the above?

RDA _theoretically_ uses FRBR (rather than ISBD) as the referenced  ‘metadata schema’.   This to my mind is actually the _most important_  part of RDA, the  The problem is that the RDA effort didn’t really  realize how important and how challenging this was, they didn’t really  realize what it entailed, and didn’t take it seriously — perhaps until  fairly recently.  FRBR needs/needed some work to do the job, and it  needs to effect the whole of how RDA is structured. Diane Hillman is  waging an epic struggle to make RDA take seriously the idea that (a  further formalization/specification of) the FRBR model is the metadata  schema which RDA applies guidance to.  If she and RDA are successful,  that will be the biggest contribution of RDA, and will make possible  alternate serialization formats that are still “high fidelity”.

Jim Weinheimmer makes a followup post that I think to him is about why you can never move your data out of MARC “losslessly”, but to me is instead evidence of exactly the kind of problems you run into when you aren’t clear about your metadata schema/vocabulary as distinct from your serialization format. Jim says:

A few points.

Here is an example in the mapping from MARC21 to MODS for uniform titles from http://www.loc.gov/standards/mods/mods-mapping.html:
130, 240 $a$d$f$k$l$m$o$r$s
730 $a$d$f$k$l$m$o$r if ind2 is not 2
Maps to:
<title> with <titleInfo> type=”uniform” and
130, 240, 730 $n (and other subfields following as above)
Maps to
<partNumber>
130, 240, 730 $p (and other subfields following as above)
Maps to:
<partName>
130, 240, 730 $0 add xlink=”contents of $0″ (as URI)

Now, compare this to the MARC Guidelines for the 240 field:
http://www.loc.gov/marc/bibliographic/bd240.html

and the LC Rule Interpretations (these are the additions to AACR2, not AACR2 itself) for uniform titles:
http://sites.google.com/site/opencatalogingrules/aacr2-chapter-25

Here is the Uniform title in UNIMARC:
http://archive.ifla.org/VI/3/p1996-1/uni5.htm

A non-cataloger will probably ask: What is a uniform title? And that would be the correct response because there are different types of uniform titles for different purposes and they are terribly complex. It should also be accepted that there are legitimate reasons for this complexity. The above rules work together intimately to ensure standards for both the coding and standards for the information. (Except for the UNIMARC, which has different standards)

So here is where the rubber meets the road when it comes to talking about “computational thinking” and cataloging.  What we’re doing when we’re creating standards for “bibliographic control” or “metadata engineering” — we’re doing data modelling for a computer environment.  And there are 50 years of practice, experience, and theory on how to do data modelling for a computer environment.  And if you ignore all that….   well, you’re trying to re-invent the wheel, and you’re probably not going to come up with a very good wheel.

Now, I think there ARE some things that aren’t entirely solved in data modelling practice, trying to do things on the web raises new issues that communities are trying to solve, RDF and Entity-Attribute-Value modelling in general is one approach to some of these issues which itself raises some questions (that in my opinion) are not entirely solved.  But these are things built upon 50 years of practice in data modelling for the computer environment.

At one point, library cataloging was ahead of everyone else in structured data modelling, we were kind of the only game in town. That point ended around 50 years ago.  And we’re still data modelling like computers don’t exist, forget data modelling for the web in particular.  There are still challenges and unanswered questions, I don’t (some on code4lib might disagree) think these are all answered questions.  But there are answered questions, you can’t engage with this without understanding the lessons of 50 years of data modelling for the computer environment, and that’s what discussions on NGC4Lib and RDA-L often seem to be doing to me.

So, okay, here's what this means to me:  You are saying that MARC serves 
as our metadata _schema_ or _vocabulary_.  It is NOT just a 
serialization format or an exchange format, it is in fact our schema, it 
defines what elements are available and what they mean.

Now, to me, THAT is in fact the biggest problem with MARC.  We've taken 
what was originally designed as simply a transport format, and turned it 
into a schema.  In the process, by having ONE standard that is BOTH our 
metadata schema and serialization format, by entangling these two 
concepts, it makes any kind of movement or inter-operability much more 
complicated.  It makes it nearly impossible to have a serialization of 
our data in some _other_ serialization format in a 'lossless' way, 
because the serialization format and the schema are so entangled. 

It makes our 'content guidance' like AACR2 _very_ difficult to 
understand in practice, because the only reasonable way to write content 
guidance like AACR2 is to refer to a metadata schema.  AACR2 refers to 
ISBD -- which "officially" was designed as a metadata schema (although 
we/they didn't use that term way back then, that's what it was; library 
people were actually doing 'metadata engineering' FIRST).    But in 
_fact_, in most/all AACR2-using countries, it's MARC21 that BECAME the 
true metadata schema. AACR2 keeping up the fiction that it's ISBD makes 
these various parts of our metadata control regime mesh with "broken 
gears", making everything _much_ harder to understand for both library 
and non-library sector people (who might want to inter-operate with our 
data). 

It makes it insanely complicated to make any changes to ANY of the parts 
of the metadata regime, because the parts inter-relate in ill-defined 
ways. If what you really need is a change in our 'metadata schema', does 
that mean you need a change to ISBD, MARC, or AACR2?  Or all of the above? 

RDA _theoretically_ uses FRBR (rather than ISBD) as the referenced 
'metadata schema'.   This to my mind is actually the _most important_ 
part of RDA, the  The problem is that the RDA effort didn't really 
realize how important and how challenging this was, they didn't really 
realize what it entailed, and didn't take it seriously -- perhaps until 
fairly recently.  FRBR needs/needed some work to do the job, and it 
needs to effect the whole of how RDA is structured. Diane Hillman is 
waging an epic struggle to make RDA take seriously the idea that (a 
further formalization/specification of) the FRBR model is the metadata 
schema which RDA applies guidance to.  If she and RDA are successful, 
that will be the biggest contribution of RDA, and will make possible 
alternate serialization formats that are still "high fidelity".

Jonathan


This entry was posted in General. Bookmark the permalink.

4 Responses to serialization vs metadata schema/vocabulary

  1. mj says:

    Jonathan, your input on this thread has been invaluable — and it *has* been frustrating for exactly the reasons you detail.

    One thing that keeps coming to my mind is how nebulous the specific functional requirements are among the cataloguing community. Getting data out of MARC *losslessly* is implied, but nowhere is it explicitly said that this is necessary or, if so, why. In fact, LoC is very clear that MARC-to-MODS is lossy.

    What would be the purpose of a lossless crosswalk? So that we can convert back to MARC at some point on the future? Hopefully not.

  2. Jonathan,

    Thank you for this post (and others). Your comments help clarify the issues around cataloging and metadata engineering.

    I wonder what you think would be the likeliest path or paths to actually creating an alternate serialization to replace MARC with?

    There is good work going on. RDA is problematic in many ways, but it is a step in the right direction. Diane Hillman is doing good work with RDA vocabularies, and Karen Coyle is doing good work thinking about a way to serialize RDA.

    But it strikes me that a non-MARC serialization format might be better based directly on FRBR than on RDA. And that serialization would have to interpret FRBR and extend it some.

    I know of no one who is doing that work. Just my ignorance, I suspect.

    Again, I thank you for writing clearly about an often confusing set of ideas and issues.

  3. jrochkind says:

    Matthew, I think the work Diane and Karen are doing with RDA _is_ meant to be the “interpret FRBR and extend it some”. They are doing that work not as a ‘serialization’ but as a ‘schema’ or ‘vocabulary’ (which is the approximate thing FRBR is meant to be too) — so it’s part of the “the RDA effort”, but that work is really “FRBR interpreted and extended and further formalized for the needs of RDA” as distinct from “RDA the set of content guidelines for filling in the blanks.”

    So in this case, I think “starting with RDA”, if it means Diane and Karen’s work, may be exactly what meets your suggestion after all.

    Thanks for the kind words.

  4. Karen Coyle says:

    Actually, I would love to have started with FRBR/FRAD + attributes, then applied RDA to that… but they weren’t available in that order, and I have no idea how closely RDA tried to follow FR attributes (I’m working on a comparison of RDA data elements and FR attributes, but FRAD is only now being added to the registry and isn’t complete). In my mind (and I think in Diane’s as well) RDA should be an Application Profile based on a declaration of the model (entities, relationships) and guidance rules. If, as I suspect, RDA and FR elements will not be the same, then it’s going to be hard to reconcile. It would be easier to reconcile if we had the FR –> RDA/AP, because we could add RDA extensions to FR in the RDA/AP. Trying to meld RDA and FR without recognizing that RDA must be based on FR, not applied to it after the fact, is going to create a mess.

    We are also hindered by the fact that we have no access to the RDA or FR process other than reading documents that are issued and commenting on them (usually too late to have any real effect on fundamentals). The process has built into it some of the same assumptions that have made our data problematic: that cataloging is a thought exercise that doesn’t need to adhere to data management principles. Or that data management can be applied after the fact.

    When I do get things like comparison tables done, I post them at http://kcoyle.net/rda/ — partly for my own convenience, but feel free to check in and see if you find anything useful for your own analyses.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s