Our legacy MARC

Xiaoming Liu talks about what the OCLC xID service does to translate from the WorldCat MARC to some simple metadata fields:

As of now, we make a rather subjective “educated guess” when we implement xISBN system. For the record, I would like to list the mapping we are doing in xISBN right now:

Author->245#c
City->260#a
Ed->250#a
Form->complex logic of pulling marc header, 008, 245#h, and applying a Bayesian trainer
Lang->008
Lccn->010
Oclcnum->001
OriginalLang->041#h
Publisher->260#b
Title->245#a,b
Url->context related (e.g. Ebook, wikipedia, hathitrust may have different URL)
Year->008

(From wc-devnet-l listserv, which doesn’t seem to have publically accessible archives. royt, you want another mission?)

Okay, who can guess what I’m going to rant about now?

He needs a freakin bayesian trainer to figure out a best guess of form (book, audio, video, journal, etc.) from MARC?  A bayesian trainer is a method of computer ‘learning’ (that maybe people would have used to have called ‘artificial intelligence’) for applying a statistical guess.  That’s what our MARC requires of us?

And from the other examples there, Liu obviously wasn’t making things any more complicated than neccesary for a good enough approximation. This topic came up when some people took issue over him just looking at 245$c instead of the several dozen MARC fields in complicated combinations that an author might be in (or what you find there might not be an author at all).  If he’s using computer learning there, I’d guess it’s because he couldn’t avoid it.

What’s wrong with us? And it’s not just legacy data, we’re still creating cataloging records that require this.  How can anyone not see a problem here?

Nice quote, scary quote

Bryan, in a comment to the previous post, pointed out some remarks from a nice tribute to Lubetzky.

This paragraph from Martha Yee, made five years ago, struck me with the terror of truth:

Call me Cassandra, but the fact that we can’t carry out the objectives of the catalog so eloquently described and urged upon us by Lubetzky does not bode well for our future as a profession. The rest of the world has become enamored of Google. Google cannot carry out the objectives of the catalog either. But if our choice is between online public access catalogs that are expensive but cannot carry out the objectives of the catalog, and Google that is cheap and cannot carry out the objectives of the catalog, I know what the choice is likely to be. And when we try to argue for the continuing existence of our profession on the basis of our expertise in the organization of information, what scholar in the humanities is going to stand up for us, after spending a career trying to navigate the chaos we have created in our catalogs for searchers of known prolific works?

It’s certainly not just ‘searchers of known prolific works.’

This entry was posted in General. Bookmark the permalink.

2 Responses to Our legacy MARC

  1. Ryan Shaw says:

    I’m totally with you on the ridiculousness of MARC, but just to play devil’s advocate: I think we should embrace the “machine learning + metadata” approach, rather than lamenting the use of ML as some kind of failure to achieve perfect metadata. There seems to be a kind of turf battle between people designing metadata architectures and people advocating statistical ML, with Semantic Web evangelists in one corner and Google in the other. But these are complementary technologies: machine learning can partially automate metadata production and metadata in turn provides higher-quality features upon which machine learning can do its thing. I think we’ll see more and more metadata pipelines with ML components in them. We should be working to make ML tech so straightforward that no one bats an eye when they have to stick in a Bayesian classifier to get stuff to work better.

  2. jrochkind says:

    There is definitely room for our machine learning in our systems. I agree that we will (hopefully) see various machine learning approaches more often in our metadata pipelines.

    But when we’re spending lots of expensive staff time to record metadata, and we need machine learning to even figure out whether to call a given item a ‘book’ or a ‘movie’ — something is not right. Our staff time is not being well spent.

    As you say, metadata in turn provides higher quality features upon which machine learning can do it’s thing. But garbage in, garbage out.

    We have so much data at present, that human effort alone is not going to be sufficient to describe it fully for our needs. Which is why some machine learning is going to be good. But it’s also why there’s no excuse for spending precious human judgement unwisely.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s