Any public data is better than none, but…So the Bri

So the British library will apparently be releasing their complete bib corpus publicly, in RDF. So it first must be said that this is a very welcome precedent, hopefully encouraging others to do the same, and even more so.

The RDF dump is far less information than you’d find in a marc record, only certain kinds of data are present. But additionally, the data that is present is kind of shoehorned into a naive simple dublin core skeleton, with much less of semantics. Here’s some BL provided sample data.

For many things I’d want to do with it, the data isn’t really clear enough to do it. For example,  the dc:subject element has both LCSH (or LCSH-style) subjects, and dewey (or dewey style) class numbers.  For many things I’d want to do, I need to know which is which.  Dewey numbers might go in a shelf browse, but LCSH subjects don’t. LCSH subjects might go in a subject search or subject heading display, but dewey numbers probably don’t.

My own consuming software can use heuristics to try and determine which is which (not that hard in this case), but that increases my barriers/cost, adds possible bugs, and is a chunk of work that different consumers need to duplicate, because the provider, although it surely knew which was an LCSH and which a DDC at the source, dumped them to the same data element where they aren’t clearly disambiguated.

An even less feasible example is the dc:relation element. I’m not really even sure what they’re putting in there, it looks like some kind of controlled headings, but perhaps several kinds mixed together. Some of them look like they might be series titles? Some of them look like they might be subject terms, but maybe subject terms beginning with names go in dc:relation instead of dc:subject? I’m kinda just guessing.  This is data in the BL RDF I couldn’t do much at all with, because it seems to be several different kinds of things with no way to tell which is which, or what it is exactly.

So, making any data public is a start. There are certainly some things you could do with this data. You could feed it all to some machine learning automated clustering algorithm, and generate clusters of “similar” bibs, such algorithms just work on text tokens and dont’ have to care that it doesnt’ know what the heck a dc:relation actually is.  Although really, even for such algorithm, the more specific data you have, the better they can. At the source, the BL certainly knew which strings were DDC and which were LCSH, and what was what that they jammed into dc:relation.

The more specificity you are able to provide, the more use cases you’re going to cover.  So if you put your data out there and get frustred “Hey, everyone said they wanted our data, and now nobody is using it,” one reason could actually be that it’s too hard (or impossible) to use for the use cases people want, due to the way your data is structured.

Data modelling is actually kind of hard. So just making the data available isn’t always enough, if it’s not enough data, or not modelled well enough.

Now, this seems to be a work in progress. I don’t think the complete RDF dump is even available yet? Or is only available if you email them and you send them a zip? Perhaps this a work in progress, and will improve.  I don’t know if the BL has a business like LC’s in selling cataloging, and might intentionally want to decrease ‘resolution’ of this data to avoid cutting into their other business?   Their free data services web page cited above suggests that they encourage you to use at least their current z39.50 api for “cataloging”, so apparently they don’t mind?   Their z39.50 interface probably returns marc?

So one thing I’d suggest is to provide an easy bulk download for the marc records too, if they really want to share the data. I am no marc lover, believe me.  But data modelling is hard — it’s in fact very very important that we work on modelling our data better than marc using modern technologies, but in the meantime and in parallel, why not share the marc to, to allow people to use that if your other data formats lose ‘resolution’ required to make someone’s particular use case possible or easier?

Advertisement

One thought on “Any public data is better than none, but…So the Bri

  1. As you mention, the RDF/DC structured data is currently a work in progress. As we receive feedback we will continue to revise the file format with the aim of making it as useful as possible to researchers. Since the MARC format is only widely used in the library world and requires specific translation tools we have initially selected RDF/DC in XML in an attempt to make the data more widely usable. Inevitably, use of basic DC can give rise to a suggestion that it is a ‘lowest common denominator’. However, we believe it offers a good starting point from which to begin structuring the data in a non proprietary format that has a more general appeal.

    We hope to make the next version of the sample file available via the BL’s website by September 6th and will continue to update the format in line with comments received. We also intend to release a version with embedded URIs for linked data experimentation later in the year.

    Having looked at the samples, researchers wishing to use copies of the full database can send any enquiries to metadata@bl.uk .

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s