And more on software data formats

A post on the excellent Catalogablog makes me realize that thread on NGC4Lib is part of a veritable epidemic, an epidemic of what in my opinion is irresponsible trend-hopping hand-waving.

This business of “An entity-relational model is the wrong thing, now we’re in an RDF world!”.

I simply don’t think this makes any sense at all, and most of the people I hear advocating it are not in fact software engineers or computer science trained. Not that you need to be such to be educated, but we ARE talking about software systems here.

I disagree that an entity-relation model is any less relevant or important for RDF-type data.   RDF-type data still requires you to create what RDF calls “vocabularies”, which are specifications of WHAT you are talking about (entities), and WHAT you can say about them (attributes and relationships).

I find all this “RDF changes everything” stuff to be over-blown, and have found few software engineers or computer science trained people in the “an E-R model isn’t the right thing for RDF” camp.

RDF or not, it doesn’t matter, we still need a formalized model of what we are talking about, and what we say about it.  You need this in order to make your data inter-operable with other systems, and to write reasonable software that can understand it.  This is just the way software engineering is done, and I think metadata engineering is a part of software engineering.

And the best, most time-tested, way of formally describing what you are talking about and what you say about continues to be an Entity-Relationship model, I have not seen a convincing argument that RDF chagnes that, I’ve just seen trendy buzzword waving.

Since most of the the cataloging world is still struggling to grasp basic principles of software engineering and computational thinking (which I think are _crucial_ for ‘cataloging’ or metadata engineering, for creating metadata for software use) — I think this kind of trend-hoping buzzword-waving is very dangerous in destroying the little consensus we have about actually moving forward into the computer world. (I could say “into the 20th century”, heh).   We already have enough people taking a “reactionary” position against change, we don’t need to add people who think they are taking a “visionary” position against the actual clear software-engineering-based ways forward, advocating instead a vague non-solution of “I don’t know what it is, but it’s got something to do with RDF!”.

And I also agree with Shawne Miksa’s comment on the catalogablog post, although I’m not sure if Miksa means to be agreeing with me.

Miksa says: “I don’t feel we’ve taught catalogers to understand the catalog system in terms of a database–not truly, in any case.”

This is so true, and it is so much a problem.  In the contemporary environment, we create metadata, we do cataloging, for a computer environment.  But we are  still creating metadata as if it were destined for printed out cards instead.  Creating metadata for use in software systems is technology, catalogers need to be metadata engineers, need to be technologists. In fact, are whether they like it or not, whether they’re good at it or not.   They don’t need to be programmers, but they do need to be “comptutational thinkers”.

And having that kind of computational thinking skills would give catalogers the ability to critically evaluate statements like “an entity-relationship model is so 20th century,”  instead of simply thinking it sounds trendy and RDF is a nice buzzword, so it must be right.  As Miksa says, it is VERY important that catalogers absolutely must “understand the catalog system in terms of a database” — or really, in case you think “database” is needlessly narrow and think RDF is our savior and is something different than “a database”, we could say catalogers absolutely must “understand the catalog system in terms of a software system.”


8 thoughts on “And more on software data formats

  1. Thanks for the thoughtful reply to my concern. It does put my mind at rest a bit.

    This is the kind of discussion, myth-busting, we need more of. Thanks.

  2. In one of my posts I state about RDF: “The “triple” model is of course exactly the same as the long standing methodology of Entity Relationship Diagrams (ERD)” (
    A triple is nothing more than a relationship between two entities (and with some creativity an entity’s attributes can be seen as entities in their own right: “book – has – colourBlue”).

    So you’re absolutely right! RDF=ERD!

  3. Now, if people want to say that FRBR isn’t the _right_ entity-relational model for our data… then there could be an argument there, but it has to be made, and not just by hand-waving about RDF. And ideally with some as-concrete-as-possible suggestions about what something better would look like.

    Personally, I think FRBR, based on years of development which was based on explicit formalization of a _century_ of practice, is as good a starting point as any. We won’t really know how it needs to be changed until we actually start trying to use it in a real way. We can tweak it then. But the hand-waving stuff only threatens to put off the already ponderous process of actually starting to test it in practice.

    (I don’t believe she means it this way, but when I read Karen Coyle’s comments on this issue, sometimes the only actionable conclusion I can draw is that she’s arguing we only need ONE bibliographic entity in our model — which would still be an E-R model, just a simple one! But she’s told me that’s not what she means after all. I think a “FRBR isn’t the right E-R model” argument needs some specifics about what the right E-R model would look like to be useful to us. And if it’s really a “we don’t need an E-R model at all!” argument, than see above post — I just don’t think that’s consistent with any actual software/data/metadata engineering practice or theory, even in an RDF world. )

  4. I think perhaps we are agreeing, but approaching from different pathways. Metadata engineers is a good description, but I prefer to think of a cataloger as information translator or interpreter. (For some reason women weaving on looms always pops into my head and so sometimes I call them information weavers.)

    Database is a database is a database….”a collection of data arranged for ease and speed of retrieval, as by a computer” according my beat-up American Heritage dictionary. Doesn’t have to be a computer. Now I must wrap my head around a “triple” model.

  5. I agree that we need a formal, logical model and I do think FRBR is a very good jumping off point. However, I ran into trouble with FRBR E-R modeling when I was trying to put together some sample moving image data for a model we’re building. Granted our model is not quite orthodox FRBR, but I think the fundamental problem remains.

    I’m not sure I can articulate the problem I think I’m having very well yet, but it has to do with parts of things (sometimes arbitrary parts; sometimes parts that are inherent in the entity and perhaps are sortable), differing degrees of description, recursive relationships, and gappy data (where you don’t have or don’t want to record all the logical levels of data). For the demo, I just excised all the messy cases or reduced them to a simpler model, but in the real world that’s not an option. Of course, this all may just be some failure of imagination or lack of understanding of E-R modeling on my part. But I do worry that some real world examples are much messier than the clear-cut ones that are usually used to talk about FRBR.

    I am intrigued by RDF because it seems like it might deal better with the gaps. However, I don’t know enough about RDF to say if it would solve this problem or just obscure it. If RDF is just another view of E-R than it seems it would make no difference. In the end, my problem is how to get data that will display in a coherent fashion while also having a model that deals with all these gaps and allows differing levels and degrees of description.

  6. I am not surprised you ran into trouble with FRBR, it needs testing like you’re doing to find the problems and gaps in it!

    But “RDF” is not an alternative to FRBR. FRBR can be expressed in RDF — that’s what Karen and Diane are in part working on, I think. FRBR lives at a higher level than RDF or the “Entity-Value-Attribute data model” that RDF is based on.

    RDF –or an EAV model — isn’t neccesarily just another way of Entity-Relationship model. EAV modelling is, I’ve been told, in fact a superset of E-R modelling. So it’s possible EAV modelling allows more.

    But still, just saying “EAV” or “RDF” doesn’t solve the problems. To solve the problems in FRBR, you have to do exactly what you are doing — identify them through implementation, take what you learned to go back and suggest modifications to the FRBR model to accomodate what you need. If it’s hard to figure out how to make those modifications in an E-R model, then it might suggest making FRBR an EAV model instead, or at least that’s what the EAV proponents would argue. I’m agnostic on it. But deciding “RDF must solve things” is putting the cart before the horse — FIRST, you’ve got to identify what’s wrong with FRBR, THEN you’ve got to figure out how to fix it, and in that process you might discover that fixing it is easier in an EAV model than an E-R model.

    At least I think so.

  7. Interesting points of view in this blog. (although i am not really in the subject, more like an interested amateur.)

    The reason for my reply:
    “…catalogers need to be metadata engineers, need to be technologists….They don’t need to be programmers, but they do need to be “comptutational thinkers”.

    That statement could be widened (at a more basic level) to at most everybody involved in using, storing or creating any form of sensitive/important/sensible (in whatever definition) information using technology.
    Even down to, say, saving files on a corporate shared fileserver would benefit from a basic technical understanding of hierarchical filesystems and how a computer operates with it. Most employees don’t have a clue, and therefore obey the absolute determined constraints set by administrators on where and how to save files, wich can easily end up in a bad-performing (in terms of searching/retrieving) mess, possibly periodically archived to multiple buckets full of bad-structured and -distinguishable bulk-data.
    …and the stereotype dialog: “I am searching an email, why can’t i find it?” “Have you tried searching for the subject of that email?” “Yes, but i think i didn’t type it into the subject field when i was writing it” also boils down to the problem, at a more basic level, though.
    Just some unacademic parallels imho, of wich there are many, and wich to some degree counteract technological benefits in a lot of areas.

    Best regards

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s