of ISBD punctuation, string matching, identifiers

245 $a ISBD and Me $h [electronic resource] ; $b wherein jrochkind actually finds a use for ISBD punctuation, and then gets frustrated again / $c  by Jonathan Rochkind

So one of the most useful things the Umlaut link resolver does is look in the ILS for electronic and print available for a work (article, journal, monograph, whatever) citation sent to the link resolver.

In some cases the match can be made based on a numeric identifier (ISBN, ISSN, oclcnum, lccn) — when an identifier is present in the citation given to the link resolver, AND our catalog record has that same identifier in it. In some cases this isn’t possible, and now that we have more e-book records loaded in our catalog and want Umlaut to succesfully match them, these cases were proliferating. So some kind of matching on title/author strings is needed.

This is tricky. The citation sent to the link resolver could be from a ‘library-standard’  (we’ll see later why i couldn’t just say AACR(2)) MARC source (including our own catalog), or could be from a non-MARC source ( Google Scholar, Amazon via LibX, potentially many things via LibX).   But either way the title ends up just in a single title string when sent to the link resolver, no sub-fields or what have you.

For complicated technical reasons of efficiency, when Umlaut is looking at ILS records to compare them, it ALSO just has a title string, it’s lost the resolution of MARC.(This may have to be changed later, although I’m not sure it would provide much an improvement).

If the two title strings match, easy enough. But so so often they represent the same thing (even but not always the exact same edition/manifestation), but don’t match exactly. Especially when looking at old 18th century titles which tend to have annoyingly long titles for our modern standards, and which we have an e-book collection of.  One record has a different sub-title than the other, or doens’t include the subtitle at all. One record includes the author’s name as a possessive in the title, the other does not. One winds up including that “General Material Designation” (eg “[electronic resource]”) in the title string, the other does not.  One has a different spelling of a word than another.

So in figuring out how to ‘normalize’ strings to compare despite these differences, one thing that occurs is removing ‘extra’ parts that may be present in one but not another. Different sub-titles, or one has a sub-title and the other doesn’t? REMOVE any sub-titles, compare again!  Hey, ISBD just came in handy, ISBD punctuation that ends up in the title string can help me tell what a sub-title is.  One has a GMD and the other does not (or has a different one)! Square brackets!

Wow, I was amazed to find I was, for the first time ever, finding ISBD puncutation at least moderately useful.

What ISBD is not, and Frustrations With Actually Existing Data

Now, ISBD punctuation alone does NOT really give us machine parseable data. It kind of looks like it ought to, I think it was intended to, some cataloger’s seem to believe it ought to, but it just isn’t so. It’s just not quite un-ambiguous enough, especially when you add in (frequent) cataloger error and ‘judgement’.

This is made no easier by the weird way in which ISBD punctuation is embedded in MARC. A puncutation mark really defines the element that comes after it, but winds up at the end of the MARC subfield before that element.  Which means a ‘:’ introducing the sub-title is often at the end of an $a subfield. Unless there’s a GMD, in an $h subfield, in which case it’s found there.  Unless some cataloger has put the GMD directly in the $a subfield (prior/different rules? Error? Who knows), in which case it’s there again.

But it does, at least in some cases, provide some useful not entirely trustworthy clues for software forced to make heuristic guesses. Better than if it weren’t there.

That is until I realized that even the ISBD punctuation as it was wasn’t reliable in actually existing data. I have all sorts of weird data in my catalog.  The punctuation seperating the title from the sub-title is usually ‘:’, but look, sometimes it’s a period, or a comma, or a semi-colon instead.  Sometimes it’s not there at all. Huh?

Some of these records may be AACR(1), some of them may be pre-AACR, some of them may be Just Plain Wrong, and as anarchivist pointed out to be in channel, especially dealing with these 18th century titles, some of them are cataloged according to rare book standards which are in fact entirely different than AACR(2) and ISBD in the first place! (And theoretically there’s a place in MARC to notate what standards were used; but in many of my records it’s blank; and even if it were there, trying to use it would add orders of magnitude to the complexity of my already fragile and complex heuristics).

It’s pretty much entirely a spin of the roullette wheel what I’m going to get grabbing a record from the catalog. Who knows?

While catalogers might argue over how many ISBD punctuation marks fit on the point of a quill pen, it’s actually not of much concern to me whether these records are “wrong”, or are “right” (for the era/context they were created), but where “right” still leads to this enormous mish-mash of data infeasible to use in machine-processing with any reliable granularity.

However we got here, contrary to popular belief, our library data is not very consistent. Or reliable.  IF the problem is people Not Following The Rules, then after many decades of this, the lesson is not just trying harder to follow the rules, but that the rules need to be changed to make compliance more likely. IF the problem is that the rules even when followed lead to this mess, then eve more obviously the rules must be changed.

We spend an inordinate amount of staff time trying to follow complex rules meant to ensure that our data is detailed and reliable — and it turns out to be neither, not detailed in the right way to be used by machine processing, not particularly reliable. Everything is Not Okay.

Our Forebears

I actually have a great deal of respect for our library geek forebears of 50 and more years ago, who came up with things like ISBD punctuation.

They were trying to build systems for a future they could barely imagine, trying at the very dawn of machine computation to create records that would live in a computational world of the future.  As of 2008, they didn’t do a great job.  But at least they were trying, and they did it intelligently and forward-thinking-ly.

Right now, are we trying to create record structures for the future, the present (which is a tiny slice of ‘now’ on top of a fast moving wave of change), or are we still creating record structures for the past?

I’ve said it before and I’ll say it again, where is our Lubetzky or Cutter or even our Henrietta Avram of 2008?  I can think of a few examples trying, but they often seem to me to be struggling against the stream of the cataloging community. Perhaps Lubetzky and Cutter felt similarly.

Algorithmic Research

One of Umlaut’s main tasks is trying to match records from disparate databases, to identify records in a foreign database that represent the same manifestation or an alternate manifestation of the same work (and ideally tell which) as an incoming OpenURL citation.

This is something we’re going to have to be doing more and more of in the future.Two strings representing a title, or author, make some measurement of how similar they are, how likely they are to represent the same thing.

I’ve started to realize that weird rule-based heuristics aren’t going to cut it, we need a more algorithmic “data mining” approach.  Machine learning is all the rage these days, but I don’t know much about it, and that’s not what I’ve been thinking of.

Instead, I’ve been thinking of various patterns and algorithmic classifications in computer science for analyzing strings.  You start looking into this, and you start seeing things about ‘edit distance’ and such.   It will become apparent that none of the simple algorithms you see as examples are up to this particular case.

But string computation computer science has become much more sophisticated then that, although also often targetted at domains different enough such that the solutions aren’t exactly applicable here (like DNA comparison). And sometimes proprietary and non-published (whatever Google is doing).

But when I found this paper on a set of algorithms for computing N-gram Similarity and Difference, it appeared useful to me. And after I read it like four times, I even mostly understood it. (It’s been a while since I had to read mathematical computer science papers in school).

(And thanks so much to the authors for putting the pre-print online open access. I probably never would have found it otherwise).

Unfortunately, it would take more time then I currently have avaialable to implement that algorithm, and then tweak it’s various possible parameters and permutations to see which, if any, work best in this “comparing bibliographic metadata” domain.  I generally don’t have a lot of time for “R&D” type stuff like that which isn’t guaranteed to work at all.

But I have a good feeling that if some smart person did find the time, and implemented an algorithm in open source, it would be useful all over the place.

Remind me again why our field isn’t served by an academic sector that does such useful research for us?


A note for catalogers on paradigm shifts.

One obvious point here is how much better it is if our records do have identifiers that can be used instead of this kind of weird heuristic string comparison.

ISBN, LCCN, and OCLCnum all actually serve pretty well, in practice even if they have theoretical issues.  Some people like to bring up that LCCN and OCLCnum aren’t meant to identify a manifestation/edition, but just a particular record.  But in actual practice, they work out pretty well to identify a manifestation — the manifestation described by the record identified!  I have used them this way, and they work well — when the records have em.

Many of our records don’t have them, especially when they come from an external vendor.

Now, as a practical matter, if whoever compiled the records didn’t include an oclcnum or lccn, it’s not going to be there, and there isn’t much (easy) we can do about it. But when talking with catalogers, I often am told that not only are they not there as a practical matter, but it would be wrong if they were there.

This particular record did not come from OCLC, and is not linked to holdings in OCLC, so it can’t have an OCLC number in it, that would be wrong.

But how can somethign that is so useful — the OCLC used effectively as a manifestation identifer — be so wrong?

Maybe a specific way needs to be found of marking up OCLCnums (and lccns) used not to indicate the origin of a record but instead only as a manifestation identifier. I dunno.   But I’d like to encourage catalogers to start thinking about the utility of OCLCnum and lccn used as a manifestation identifier. The more people who realize it’s not wrong but incredibly useful, the more often they’ll find their way into data sets.


4 thoughts on “of ISBD punctuation, string matching, identifiers”

  1. One thing I did on a project where we were able to be pretty fuzzy (that is, a title/author match was desired but there was a high tolerance for both false positives and missed items) was squished the contents of a 245 subfields together, lowercased it, stripped out all the characters that weren’t a..z, and then took something like the first 15 characters. (Did some playing around to find a length that seemed to work pretty well). Did something similar for author names. It actually worked pretty decently, if not great. Not sure how effective it would be on a large dataset that included things like the 18th century works.

  2. Yeah, that’s more or less what I’m doing now, except not limiting to the first 15 chars. You found the first 15 chars was pretty good at avoiding false positives?

    Oh, I guess you say you had a high tolerance for false positives. i’ve got a pretty low tolerance for false positives. Sadly, my colleagues have a low tolerance for negatives too. And it’s a balance, hard to minimize both. But there are ways to try.

    Of course, the funny thing is when this STARTED people had a pretty high tolerance for false negatives. Because it was replacing… 100% false negatives, essentially. But then they get used to it, and they find it intolerable that it’s missing anything. :)

  3. Jonathan,

    I’m summarizing this for my list of library aphorisms as: “We spend an inordinate amount of staff time trying to follow complex rules meant to ensure that our data is detailed and reliable — and it turns out to be neither, not detailed in the right way to be used by machine processing, not particularly reliable. Everything is Not Okay.”

    Just keep saying that. :)

    Thanks for pointing out the N-gram paper. I’d like to talk some more about what sort of experimentation you think would be helpful here.

  4. Basically, the math of n-gram similiarity and difference in that paper, while NOT that complicated, is a bit complicated for us quick and dirty programmers not used to dealing with complex algorithms.

    So the main thing is just implementing the algorithms there in a re-useable open source library. And seeing how it works. There are some possible parameters to the algorithsm there, tweaking them to see if any work markedly better than others in our domain. There are also some slightly different permutations I thought of reading the paper, that I’ll have read the paper another time (or another couple times, probably, to understand it again :) ) to remember.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s