Further adventures in provider-neutral e-book bibs

A bit ago, I asked for advice on dealing with loading of the new provider-neutral-type e-book bibs. There was some interesting discussion there that helped me figure out how to think about this, but I guess most people have very different uses of e-book bibs than us, because there wasn’t much specifics on how I can deal with this (although there were some people insisting it shouldn’t be a problem, none of them seemed to actually have a recipe for doing it). Perhaps most people don’t load nearly as many e-book records as we do?  Or perhaps they’re all paying vendors to do this step for them, which we are trying to do ourselves? I really have no idea.

So current update on what we are doing/plan to do. It’s not really entirely satisfactory. But I’m pretty sure nothing better is _possible_, if someone thinks it is, please do give me details (not just insist that if I do something or other, it’ll work better!).

So our workflow involves starting with a batch of bibs from some vendor, that are supposed to represent all titles in a certain package (and none that are not in that package).  Let’s call that the “package-at-hand”. The problem, again, is that the 856 URLs on those bibs are NOT (post-provider-neutral) just for that package, but all over the place.

So the basic goal is, before loading such a set in the ILS, remove all 856’s from the bibs in this set that do not belong to the package-at-hand.

We use MarcEdit to do this, specifically it’s regular expression features.

Filtering logic

The first problem is that there is in fact no reliable way to create a regexp that will preserve all-and-only the URLs from the package-at-hand.

First, we tried to filter on the 856 subfield $3, as it appeared at first that records would have the name of the platform/provider in the $3. (Which still seems to me a very odd and improper use of $3, but if it’s in the real world data, we’ll take advantage of it).  But, since this is a completely uncontrolled cataloger-written narrative field, there turned out to be no way to do this reliably. Some records did not have the provider name in the $3, others used differing unpredictable language to mention it. So if you’ve got 10k or 100k records, and you try only keeping 856 urls with a certain string (or set of strings) in the $3, there’s no real way to know which valid 856’s you may have missed, resulting in missing access urls for users.  So we gave up on the $3.

So what’s left is the $u, the url itself.  At first, this too seemed very promissing. Hey, we want to keep all (and only URLs) from SomeProvider, no problem, filter on urls that have hostname http://www.someprovider.com, or perhaps have one of a certain set of hostnames or other url patterns known to be used by that provider.

That works out okay, except when you run into a provider that uses dx.doi.org (or another general purpose redirection, like purl.org) for their URLs. Now you’ve got to include dx.doi.org in your ‘keep’ set.  Which means you might also end up preserving other provider’s URLs, that you don’t in fact license, which also use that redirector.  (You also need to do a bunch of research in the set you are trying to load, to figure out what general purpose redirectors the provider-at-hand might be using in there).

Still, that’s what we’re doing. So we know we’ll wind up with some records in our catalog that present the user with URLs that, once they click on them, they’re not going to have access to. But at least we’re minimizing it. Frustrating to me that I can’t figure out a feasible way to actually do what we want though.

Same title from multiple providers

Okay, but now consider the case of a title that exists in multiple providers/packages.

Ideally, we’d want a single bib record that had a URL for each provider (that we license).

The work/data flow outlined above will never in fact result in that.  Instead, it can result in one of two things. Imagine, first we load bibs for Package A.  Then we load bibs for Package B, which includes a bib representing a title we already loaded.  In some cases, the new bib will just go in separate, and we’ll have two bibs in the ILS/catalog, one for each provider.  Not a disaster.  In other cases, the new bib will overlay the old bib, and only the ‘last loaded’ URL will end up in the catalog, so one of our licensed urls for the content will be missing and not displayed to users. That’s kind of unfortunate.

Other fantasy ideas

Kind of a mess.  If anyone has any better ideas, please let me know.  I would think other people would be running into this too, but when I asked before on this blog, I didn’t succesfully get any specific advice for this particular problem, just general advice like “Oh, just filter in MarcEdit.” Which is indeed what we’re trying to do.

Some people say they’re ERM just handles it. I’m not entirely sure how it does that, I guess perhaps a whole bunch of value-added manual work from the ERM vendor, which explains why it’s so expensive. Or perhaps since provider-neutral e-book bibs are just starting to come up, it’s going to become a problem even for ERM-mediated loading, just one people haven’t run into yet. We actually do use the SerSol ERM.   But we are not paying the (non-trivial) extra charge to get marc bibs from SerSol for e-books, based on our ERM records. We can’t/don’t want to pay for it.  I’m not entirely sure how SerSol would manage to deal with this problem either, other than a whole bunch of manual labor (which does not go back into the cooperative cataloging commons, of course).

Another idea would be some kind of local programming based on our SerSol ERM knowledge base, but avoiding paying anyone for records. If we don’t actually need good AACR2 copy, but would be happy with a minimal basic record (with title and author and maybe an ISBN or two, maybe basic publisher information, definitely no LCSH), I could conceivably write a script that created minimal basic non-AACR2 marc from our ERM knowledge base, adding URLs provided by the knowledge base, and adding only those URLs that the ERM says we have. I think the ERM knowledge base would be sufficient to provide this, but I’m not entirely sure it’s exports or APIs would give me sufficient data to write such a thing. But we’re already paying (a significant amount) for the ERM and it’s vendor knowledge base maintenance, we just can’t/don’t want to pay a supplementary significant amount for marc record generation.

I suppose this script could be further enhanced to, instead of creating a basic non-AACR2 bib based on just info in the knowledge base, actually try to fetch additional non-AACR2 data to flush out the non-standard marc (from Amazon?), or actually try to fetch marc records from free sources, or standard affordable sources (OCLC, OpenLibrary, LC catalog, British Library catalog, other catalogs), and then add on the right 856 URLs as per above.  This would get tricky to do reliably too, but might be worth investigating.

Another idea to deal with the dx.doi.rog (and other general purpose redirector) URLs would be, at pre-load time, run a script that actually loads each of these URLs, sees where it redirects to, and then can determine if the redirection target is in our desired list of URLs or not. That would be a pretty slow script to run (and might do odd things to your vendor-supplied usage statistics), but it theoretically possible.

But all/any of this would take a significant time and effort to develop.  Which gets you thinking, gee, we’re already paying a lot for records (from OCLC and elsewhere), we’re already paying a lot for an ERM, we’re already paying a lot for an ILS, and then we have to spend a lot of (implicit staff time) money on actually getting this to work, what are we even paying for?

Theory and reality

I am only thinking more and more that our cooperative cataloging commons is slowly dying. More and more of our stuff is going to be e-books, but this is just one example. In general, the amount of value you get from cooperative cataloging records (which after all, you still need to pay OCLC for) is shrinking, to get your records to actually work for you, you need to spend a whole bunch more money on either additional vendor services, in-house manual cataloging, or in-house manual software development.   The cost-benefit for the cooperative cataloging participation is getting lower and lower, which is a shame, because it really would be great if it worked.

But if we want our shared records to actually be useful, we’ve got to do things differently. The provider-neutral policy perhaps was an effort at doing things differently, but from my seat it’s caused as many problems as it’s solved — either resulting in an increase of cost or a decrease in service to users.  [If you approached the design of how to catalog these things thinking about how they’d actually be USED, one idea might be to establish a controlled vocabularly for providers and/or platforms, and actually include this in the 856 in a specific subfield designed for that, to support reliable machine filtering based on provider or platform. That would have some issues too, this stuff does indeed get tricky.]


6 thoughts on “Further adventures in provider-neutral e-book bibs”

  1. Regarding PURLs and DOIs: perhaps figure out which prefixes your providers use, and build regexes off those?

    Could still fail (e.g. through provider merger, meaning one provider owns more than one prefix), but might be workable.

  2. Jonathan, I sympathize with your troubles and have pretty much the same set of them myself, but some details in your two posts don’t fit my experience. When we get MARC records from a Vendor (like ebrary or CRC or Springer), the records contain only the URL that pertains to the vendor’s product and our access. For example, her is an ebrary url in one of our ebrary-provided records:


    We often have a number of problems with records provided by vendors, but we never see the problem of many URLs in records provided by a vendor for a particular product. I can’t imagine I’d want records from a vendor that didn’t have the URLs we needed and only those URLs.

  3. Matthew, what local catalogers tell me is that the “provider neutral e- bib policy” has resulted in records — even those delivered by an individual content provider — increasingly coming with multi-provider URLs on them. They expect this to start going up even more. That’s what I said “Why would we want records from a vendor that had some other vendors URLs on them, and why would the vendor want to give those to us?” They told me “See, there’s this new provider-neutral policy for e-monograph bibs….”

    So, if I understand things right, and local catalogers telling me things understand things right, you’re going to start seeing more and more of these as I describe.

  4. Maybe, but I don’t think you are right about this vendor-neutral record affecting vendor-supplied records.

    Catalogers who are working with individual titles and OCLC do see records with many URLs in them, and that is, at least in part, a result of the vendor-neutral record. However, the problem the vendor-neutral record mitigates is the proliferation of unique bib records for every URL (i.e. every vendor’s product). It is much better to have the vendor-neutral record.

    The vendor-neutral ebook record is following a path set by the vendor-neutral e-serial record, and ebook cataloging will follow the path set by e-serial cataloging. Libraries will increasingly depend on a 3rd party (like SerSol or Ex Libris) to provide ebook metadata to a knowledge base and MARC records for those ebooks. It works pretty well for e-serials, and we’ll find out how well or how poorly it works for ebooks.

    I do think you are right about the negative implications for cooperative cataloging and shared catalogs like OCLC.

  5. Some of our records are coming from a partnership with vendors and OCLC, from… a web page I can’t find anymore. But it’s got a bunch of downloads for a buncha different packages, for free, from an oclc.org url.

    We get a bunch of records from there, and they definitely have the vendor-neutral ‘problem’. My local catalogers suggest that records we get directly from vendors are ALSO starting to be like this, but perhaps I misunderstood.

    Having to pay a lot more money to a third-party vendor to do post-processing that we didn’t need to pay for before doesn’t sound like an improvement to me, but hey, it’s not like the status quo was great either.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s