Getting publication date out of Marc

The SolrMarc example/default configuration tries to get a publication date out of 260$c.

This is a tricky thing to do, because you’re trying to parse not entirely coded data. And on top of that, I just discovered that dates in other calendar systems can legally appear in 260$c, if that’s how they appear on the title page. A title page has Hebrew Callendar 5750 in it? That’ll be in the 260$c. Oops.

So it’s probably better to try and get dates out of the 008 fixed field. One problem here is it’s a lot more confusing, you’ve got to get ascii decimal digits out of fixed byte positions (machine readable what?), and you really need to talk to a cataloger to get to the bottom of “date1”  and “date2”, as well as the “date types” and what they mean.

Beware f date type “q”, for “questionable date”, meaning that the publication date is somewhere in in the range of date1 and date2.  (These would seem , by examples in the OCLC documentation, to be inclusive boundaries, although the documentation doesn’t actually explicitly say that).

On top of that, dates in date1 and date2 can show up with “u”s in them for unknown digits. “19uu” means sometime in the 20th century.

And in the final note in the this is really meant to be machine readable? column, let’s say you know something was published in the 19th or 20th century.  You might think you’d use the “q” date type and put date1=1800 and date2=1999, that would certainly express what you know. But no, the OCLC examples say to put this in as “q” date type, with date1=18uu and date2=19uu. huh?

The other problem with getting dates out of 008 fixed bytes is that since so many of our traditional ILS’s completely ignore them, it’s not clear to me how correct they’ll be, since a mistake didn’t matter much before.  But in a testament to years of catalogers entering correct data even though their systems did nothing with it, the data seems at first analysis to be pretty good. I think it’s going to be better than trying to get a date from 260c, especially with the “hebrew date” issue.


7 thoughts on “Getting publication date out of Marc”

  1. You know, there’s a “MARC Foibles” book in all this somewhere. You should shop it around to publishers, if you think you could stand to write it without fatally headdesking yourself.

  2. Good to see someone actually trying to work with MARC.

    I’ve suspected that MARC is really not the tool we need to be using, but there are those defenders of MARC (and, to be fair, the status quo in general) who say it can do everything we need.

    This kind of example shows just how unworkable MARC really is for what we need in order to move forward with our metadata.

  3. Well, but to be fair — how do you deal with dates like this? You have something you *think* was created in the 1860s. What better way exists (currently) than how MARC does it?

    This isn’t meant as a defense of MARC as much as it’s dealing in a domain that most other standards aren’t. There’s the extended datetime format from LoC, but it’s not standardized, yet (and it’s not used anywhere, that I know of).

    I agree that the 008’s publish dates are maddening (like most of MARC), but I think that’s because reality is pretty messy and dealing with these messes in a machine readable way is sort of an n+1 problem and incredibly hard to model.

  4. Better ways would include (and Ross, I wrote half of this before I realized it was YOU writing this comment, I’m sure you’ll agree with the following):

    Don’t use ‘u’ in dates at all. Always provide ranges like the “q” date type. Don’t allow something weird like q 18uu 19uu (which is exactly an OCLC example!), that should be q 1800 1999.

    In non-q types currently, you have to use ‘u’ because you only get TWO dates, date1 and date2. There’s no reason to limit to two dates. If you have a serial that began some time between 1900 and 1950, and ended sometime between 1980 and 1990, there ought to be a way to encode that, not “19uu 19uu” or worse “19uu uuuu”. If you have a book with copyright 2010, but all you can say is it was published sometime around 2005-2015, you ought to be able to encode that, not “20uu 2010”.

    The arbitrary restriction to “you get only two dates, date 1 and date2, and you can express uncertainty only by powers-of-10-magnitudes by putting u’s in there” is just silly.

    There are other examples of really weird data because of the limitation to ‘date1’ and ‘date2′ Data that is encoded at LESS specificity than the cataloger actually had to figure out or think about in order to encode it. Or data that is way harder for software to figure out, or for the software _engineer_ to figure out how to write the software to figure out — then it has to be.

    Certainly date data will always be complicated when we’re dealing with a world of uncertainty and complexity. The way it is recorded in MARC is way crappier than it needs to be though, without losing specificity, and in some cases even gaining it.

    Oh, and there should be a different way of encoding “we dont’ know when or if this serial has ceased publication” compared with “this serial is still publishing”. It’s possible there IS such a way, the fact I don’t know is a testament to the (unneccesary) over-complexity of the manner in which this stuff is encoded. The fact that we have a LOT of data that doesn’t even follow ‘the rules’, at first I was going to say is not the standards fault, but the fact that the standard is so confusing (as a result of shoehorning everything into ‘date type’, ‘date1’, and ‘date2’) that even catalogers frequently get confused is the standards fault — and the fault of the incredibly poor tools catalogers have been given for data entry.

    (But I’ll add that Ross is right it IS a somewhat hard problem, and MARC did a pretty good job for 45 years ago in a world where there weren’t hardly any computers that WERE going to do anything with it. It’s 45 years later now, and we’ve learned how to do things a lot better in the last 45 years. In 2010, it is not hard to improve on MARC in some obvious ways. )

  5. I definitely agree with your point (again — this is not a MARC apology!) and wasn’t actually trying to defend MARC as ‘doing it right’. I guess what I’m asking for is some exemplar to crib from. I fear we have too much legacy data to fumble around for the answer.

    I also fear that our legacy data and the fact that (I hope!) it will outnumber any new MARC records would render whatever new way of dealing with dates even more frustrating, since we’d have to deal with the AACR2 way /and/ the new way.

    But, yeah, I agree — improvements would be seriously welcome.

  6. There might not be any exemplar to crib from because, you’re right, our domain is particular in some ways. (I’ve always thought so, but am surprised that you agree for once). We might indeed need to invent it ourselves. Are we capable of doing it well? I don’t know. History is not reason for optimism.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s