245 $a ISBD and Me $h [electronic resource] ; $b wherein jrochkind actually finds a use for ISBD punctuation, and then gets frustrated again / $c by Jonathan Rochkind
So one of the most useful things the Umlaut link resolver does is look in the ILS for electronic and print available for a work (article, journal, monograph, whatever) citation sent to the link resolver.
In some cases the match can be made based on a numeric identifier (ISBN, ISSN, oclcnum, lccn) — when an identifier is present in the citation given to the link resolver, AND our catalog record has that same identifier in it. In some cases this isn’t possible, and now that we have more e-book records loaded in our catalog and want Umlaut to succesfully match them, these cases were proliferating. So some kind of matching on title/author strings is needed.
This is tricky. The citation sent to the link resolver could be from a ‘library-standard’ (we’ll see later why i couldn’t just say AACR(2)) MARC source (including our own catalog), or could be from a non-MARC source ( Google Scholar, Amazon via LibX, potentially many things via LibX). But either way the title ends up just in a single title string when sent to the link resolver, no sub-fields or what have you.
For complicated technical reasons of efficiency, when Umlaut is looking at ILS records to compare them, it ALSO just has a title string, it’s lost the resolution of MARC.(This may have to be changed later, although I’m not sure it would provide much an improvement).
If the two title strings match, easy enough. But so so often they represent the same thing (even but not always the exact same edition/manifestation), but don’t match exactly. Especially when looking at old 18th century titles which tend to have annoyingly long titles for our modern standards, and which we have an e-book collection of. One record has a different sub-title than the other, or doens’t include the subtitle at all. One record includes the author’s name as a possessive in the title, the other does not. One winds up including that “General Material Designation” (eg “[electronic resource]”) in the title string, the other does not. One has a different spelling of a word than another.
So in figuring out how to ‘normalize’ strings to compare despite these differences, one thing that occurs is removing ‘extra’ parts that may be present in one but not another. Different sub-titles, or one has a sub-title and the other doesn’t? REMOVE any sub-titles, compare again! Hey, ISBD just came in handy, ISBD punctuation that ends up in the title string can help me tell what a sub-title is. One has a GMD and the other does not (or has a different one)! Square brackets!
Wow, I was amazed to find I was, for the first time ever, finding ISBD puncutation at least moderately useful.
What ISBD is not, and Frustrations With Actually Existing Data
Now, ISBD punctuation alone does NOT really give us machine parseable data. It kind of looks like it ought to, I think it was intended to, some cataloger’s seem to believe it ought to, but it just isn’t so. It’s just not quite un-ambiguous enough, especially when you add in (frequent) cataloger error and ‘judgement’.
This is made no easier by the weird way in which ISBD punctuation is embedded in MARC. A puncutation mark really defines the element that comes after it, but winds up at the end of the MARC subfield before that element. Which means a ‘:’ introducing the sub-title is often at the end of an $a subfield. Unless there’s a GMD, in an $h subfield, in which case it’s found there. Unless some cataloger has put the GMD directly in the $a subfield (prior/different rules? Error? Who knows), in which case it’s there again.
But it does, at least in some cases, provide some useful not entirely trustworthy clues for software forced to make heuristic guesses. Better than if it weren’t there.
That is until I realized that even the ISBD punctuation as it was wasn’t reliable in actually existing data. I have all sorts of weird data in my catalog. The punctuation seperating the title from the sub-title is usually ‘:’, but look, sometimes it’s a period, or a comma, or a semi-colon instead. Sometimes it’s not there at all. Huh?
Some of these records may be AACR(1), some of them may be pre-AACR, some of them may be Just Plain Wrong, and as anarchivist pointed out to be in channel, especially dealing with these 18th century titles, some of them are cataloged according to rare book standards which are in fact entirely different than AACR(2) and ISBD in the first place! (And theoretically there’s a place in MARC to notate what standards were used; but in many of my records it’s blank; and even if it were there, trying to use it would add orders of magnitude to the complexity of my already fragile and complex heuristics).
It’s pretty much entirely a spin of the roullette wheel what I’m going to get grabbing a record from the catalog. Who knows?
While catalogers might argue over how many ISBD punctuation marks fit on the point of a quill pen, it’s actually not of much concern to me whether these records are “wrong”, or are “right” (for the era/context they were created), but where “right” still leads to this enormous mish-mash of data infeasible to use in machine-processing with any reliable granularity.
However we got here, contrary to popular belief, our library data is not very consistent. Or reliable. IF the problem is people Not Following The Rules, then after many decades of this, the lesson is not just trying harder to follow the rules, but that the rules need to be changed to make compliance more likely. IF the problem is that the rules even when followed lead to this mess, then eve more obviously the rules must be changed.
We spend an inordinate amount of staff time trying to follow complex rules meant to ensure that our data is detailed and reliable — and it turns out to be neither, not detailed in the right way to be used by machine processing, not particularly reliable. Everything is Not Okay.
I actually have a great deal of respect for our library geek forebears of 50 and more years ago, who came up with things like ISBD punctuation.
They were trying to build systems for a future they could barely imagine, trying at the very dawn of machine computation to create records that would live in a computational world of the future. As of 2008, they didn’t do a great job. But at least they were trying, and they did it intelligently and forward-thinking-ly.
Right now, are we trying to create record structures for the future, the present (which is a tiny slice of ‘now’ on top of a fast moving wave of change), or are we still creating record structures for the past?
I’ve said it before and I’ll say it again, where is our Lubetzky or Cutter or even our Henrietta Avram of 2008? I can think of a few examples trying, but they often seem to me to be struggling against the stream of the cataloging community. Perhaps Lubetzky and Cutter felt similarly.
One of Umlaut’s main tasks is trying to match records from disparate databases, to identify records in a foreign database that represent the same manifestation or an alternate manifestation of the same work (and ideally tell which) as an incoming OpenURL citation.
This is something we’re going to have to be doing more and more of in the future.Two strings representing a title, or author, make some measurement of how similar they are, how likely they are to represent the same thing.
I’ve started to realize that weird rule-based heuristics aren’t going to cut it, we need a more algorithmic “data mining” approach. Machine learning is all the rage these days, but I don’t know much about it, and that’s not what I’ve been thinking of.
Instead, I’ve been thinking of various patterns and algorithmic classifications in computer science for analyzing strings. You start looking into this, and you start seeing things about ‘edit distance’ and such. It will become apparent that none of the simple algorithms you see as examples are up to this particular case.
But string computation computer science has become much more sophisticated then that, although also often targetted at domains different enough such that the solutions aren’t exactly applicable here (like DNA comparison). And sometimes proprietary and non-published (whatever Google is doing).
But when I found this paper on a set of algorithms for computing N-gram Similarity and Difference, it appeared useful to me. And after I read it like four times, I even mostly understood it. (It’s been a while since I had to read mathematical computer science papers in school).
(And thanks so much to the authors for putting the pre-print online open access. I probably never would have found it otherwise).
Unfortunately, it would take more time then I currently have avaialable to implement that algorithm, and then tweak it’s various possible parameters and permutations to see which, if any, work best in this “comparing bibliographic metadata” domain. I generally don’t have a lot of time for “R&D” type stuff like that which isn’t guaranteed to work at all.
But I have a good feeling that if some smart person did find the time, and implemented an algorithm in open source, it would be useful all over the place.
Remind me again why our field isn’t served by an academic sector that does such useful research for us?
A note for catalogers on paradigm shifts.
One obvious point here is how much better it is if our records do have identifiers that can be used instead of this kind of weird heuristic string comparison.
ISBN, LCCN, and OCLCnum all actually serve pretty well, in practice even if they have theoretical issues. Some people like to bring up that LCCN and OCLCnum aren’t meant to identify a manifestation/edition, but just a particular record. But in actual practice, they work out pretty well to identify a manifestation — the manifestation described by the record identified! I have used them this way, and they work well — when the records have em.
Many of our records don’t have them, especially when they come from an external vendor.
Now, as a practical matter, if whoever compiled the records didn’t include an oclcnum or lccn, it’s not going to be there, and there isn’t much (easy) we can do about it. But when talking with catalogers, I often am told that not only are they not there as a practical matter, but it would be wrong if they were there.
This particular record did not come from OCLC, and is not linked to holdings in OCLC, so it can’t have an OCLC number in it, that would be wrong.
But how can somethign that is so useful — the OCLC used effectively as a manifestation identifer — be so wrong?
Maybe a specific way needs to be found of marking up OCLCnums (and lccns) used not to indicate the origin of a record but instead only as a manifestation identifier. I dunno. But I’d like to encourage catalogers to start thinking about the utility of OCLCnum and lccn used as a manifestation identifier. The more people who realize it’s not wrong but incredibly useful, the more often they’ll find their way into data sets.