A bit ago, I asked for advice on dealing with loading of the new provider-neutral-type e-book bibs. There was some interesting discussion there that helped me figure out how to think about this, but I guess most people have very different uses of e-book bibs than us, because there wasn’t much specifics on how I can deal with this (although there were some people insisting it shouldn’t be a problem, none of them seemed to actually have a recipe for doing it). Perhaps most people don’t load nearly as many e-book records as we do? Or perhaps they’re all paying vendors to do this step for them, which we are trying to do ourselves? I really have no idea.
So current update on what we are doing/plan to do. It’s not really entirely satisfactory. But I’m pretty sure nothing better is _possible_, if someone thinks it is, please do give me details (not just insist that if I do something or other, it’ll work better!).
So our workflow involves starting with a batch of bibs from some vendor, that are supposed to represent all titles in a certain package (and none that are not in that package). Let’s call that the “package-at-hand”. The problem, again, is that the 856 URLs on those bibs are NOT (post-provider-neutral) just for that package, but all over the place.
So the basic goal is, before loading such a set in the ILS, remove all 856’s from the bibs in this set that do not belong to the package-at-hand.
We use MarcEdit to do this, specifically it’s regular expression features.
The first problem is that there is in fact no reliable way to create a regexp that will preserve all-and-only the URLs from the package-at-hand.
First, we tried to filter on the 856 subfield $3, as it appeared at first that records would have the name of the platform/provider in the $3. (Which still seems to me a very odd and improper use of $3, but if it’s in the real world data, we’ll take advantage of it). But, since this is a completely uncontrolled cataloger-written narrative field, there turned out to be no way to do this reliably. Some records did not have the provider name in the $3, others used differing unpredictable language to mention it. So if you’ve got 10k or 100k records, and you try only keeping 856 urls with a certain string (or set of strings) in the $3, there’s no real way to know which valid 856’s you may have missed, resulting in missing access urls for users. So we gave up on the $3.
So what’s left is the $u, the url itself. At first, this too seemed very promissing. Hey, we want to keep all (and only URLs) from SomeProvider, no problem, filter on urls that have hostname http://www.someprovider.com, or perhaps have one of a certain set of hostnames or other url patterns known to be used by that provider.
That works out okay, except when you run into a provider that uses dx.doi.org (or another general purpose redirection, like purl.org) for their URLs. Now you’ve got to include dx.doi.org in your ‘keep’ set. Which means you might also end up preserving other provider’s URLs, that you don’t in fact license, which also use that redirector. (You also need to do a bunch of research in the set you are trying to load, to figure out what general purpose redirectors the provider-at-hand might be using in there).
Still, that’s what we’re doing. So we know we’ll wind up with some records in our catalog that present the user with URLs that, once they click on them, they’re not going to have access to. But at least we’re minimizing it. Frustrating to me that I can’t figure out a feasible way to actually do what we want though.
Same title from multiple providers
Okay, but now consider the case of a title that exists in multiple providers/packages.
Ideally, we’d want a single bib record that had a URL for each provider (that we license).
The work/data flow outlined above will never in fact result in that. Instead, it can result in one of two things. Imagine, first we load bibs for Package A. Then we load bibs for Package B, which includes a bib representing a title we already loaded. In some cases, the new bib will just go in separate, and we’ll have two bibs in the ILS/catalog, one for each provider. Not a disaster. In other cases, the new bib will overlay the old bib, and only the ‘last loaded’ URL will end up in the catalog, so one of our licensed urls for the content will be missing and not displayed to users. That’s kind of unfortunate.
Other fantasy ideas
Kind of a mess. If anyone has any better ideas, please let me know. I would think other people would be running into this too, but when I asked before on this blog, I didn’t succesfully get any specific advice for this particular problem, just general advice like “Oh, just filter in MarcEdit.” Which is indeed what we’re trying to do.
Some people say they’re ERM just handles it. I’m not entirely sure how it does that, I guess perhaps a whole bunch of value-added manual work from the ERM vendor, which explains why it’s so expensive. Or perhaps since provider-neutral e-book bibs are just starting to come up, it’s going to become a problem even for ERM-mediated loading, just one people haven’t run into yet. We actually do use the SerSol ERM. But we are not paying the (non-trivial) extra charge to get marc bibs from SerSol for e-books, based on our ERM records. We can’t/don’t want to pay for it. I’m not entirely sure how SerSol would manage to deal with this problem either, other than a whole bunch of manual labor (which does not go back into the cooperative cataloging commons, of course).
Another idea would be some kind of local programming based on our SerSol ERM knowledge base, but avoiding paying anyone for records. If we don’t actually need good AACR2 copy, but would be happy with a minimal basic record (with title and author and maybe an ISBN or two, maybe basic publisher information, definitely no LCSH), I could conceivably write a script that created minimal basic non-AACR2 marc from our ERM knowledge base, adding URLs provided by the knowledge base, and adding only those URLs that the ERM says we have. I think the ERM knowledge base would be sufficient to provide this, but I’m not entirely sure it’s exports or APIs would give me sufficient data to write such a thing. But we’re already paying (a significant amount) for the ERM and it’s vendor knowledge base maintenance, we just can’t/don’t want to pay a supplementary significant amount for marc record generation.
I suppose this script could be further enhanced to, instead of creating a basic non-AACR2 bib based on just info in the knowledge base, actually try to fetch additional non-AACR2 data to flush out the non-standard marc (from Amazon?), or actually try to fetch marc records from free sources, or standard affordable sources (OCLC, OpenLibrary, LC catalog, British Library catalog, other catalogs), and then add on the right 856 URLs as per above. This would get tricky to do reliably too, but might be worth investigating.
Another idea to deal with the dx.doi.rog (and other general purpose redirector) URLs would be, at pre-load time, run a script that actually loads each of these URLs, sees where it redirects to, and then can determine if the redirection target is in our desired list of URLs or not. That would be a pretty slow script to run (and might do odd things to your vendor-supplied usage statistics), but it theoretically possible.
But all/any of this would take a significant time and effort to develop. Which gets you thinking, gee, we’re already paying a lot for records (from OCLC and elsewhere), we’re already paying a lot for an ERM, we’re already paying a lot for an ILS, and then we have to spend a lot of (implicit staff time) money on actually getting this to work, what are we even paying for?
Theory and reality
I am only thinking more and more that our cooperative cataloging commons is slowly dying. More and more of our stuff is going to be e-books, but this is just one example. In general, the amount of value you get from cooperative cataloging records (which after all, you still need to pay OCLC for) is shrinking, to get your records to actually work for you, you need to spend a whole bunch more money on either additional vendor services, in-house manual cataloging, or in-house manual software development. The cost-benefit for the cooperative cataloging participation is getting lower and lower, which is a shame, because it really would be great if it worked.
But if we want our shared records to actually be useful, we’ve got to do things differently. The provider-neutral policy perhaps was an effort at doing things differently, but from my seat it’s caused as many problems as it’s solved — either resulting in an increase of cost or a decrease in service to users. [If you approached the design of how to catalog these things thinking about how they’d actually be USED, one idea might be to establish a controlled vocabularly for providers and/or platforms, and actually include this in the 856 in a specific subfield designed for that, to support reliable machine filtering based on provider or platform. That would have some issues too, this stuff does indeed get tricky.]