We’ve all got a lot of data in MARC (that statement making sense shows that MARC is effectively a data vocabularly, not just a transmission standard, but anyway, moving on), that we need to sling around between applications, including for many of us “next generation” discovery tools that need to index it.
Marc21 binary format is the ‘native’ marc transmission format for our data. It’s got some benefits; it’s a ‘lowest common denominator’ that systems we work with are most likely to produce and consume; it’s fairly fast to de-serialize (I was going to say ‘parse’, but ‘deserialize’ is probably more accurate for a format like Marc21).
However binary Marc21 has got some significant problems too:
- If your programming language of choice doesn’t already have a robust, well-performing, free library for serializing/deserializing Marc21, it’s kind of a bear to write one. It’s a very weird format in some ways (offset data encoded as ascii numerals?), and an overly complex data format for contemporary standards. Just because you think you have a library available doesn’t neccesarily mean that open source library is as robust or well-performing as you might hope.
- Just because an existing system (like an ILS) says it outputs Marc21 doesn’t neccesarily mean it outputs legal Marc21. If some records are structurally illegal in certain ways, they may not be de-serializable on the other end, or may take more complex and less-well-performing de-serialization code on the other end. The weirdness and complexity of the marc21 format (see above) contributes to this prevalence of non-compliant output.
- Perhaps most significantly, binary Marc21 has a maximum length. A legal marc binary Marc21 record can’t be any larger than 99999 bytes (10k). While this must have seemed larger than you’d ever want in the 1960s, currently it’s often not large enough for us — especially when you try to include ‘item’ information in a marc bib record (which isn’t standard, but is often done for various reasons).
To get around these problems, many people choose to work with MarcXML instead of binary Marc21 when they can. And MarcXML does get around the problems listed above pretty well, but involves a couple trade-offs which in some circumstances don’t matter, but in others do:
- A MarcXML file generally has a much larger file size than it’s equivalent Marc21.
- A MarcXML file is often significantly slower to deserialize than it’s equivalent Marc21.
In many cases, those issues don’t matter at all. But in some cases, they are unfortunate. (Like when you are exporting, re-indexing, and re-storing your entire multi-million-record Marc corpus).
So some people came up with the idea of marc in Json. If you can serialize marc in xml, why not do something very similar to serialize marc in Json in a standard way? Json is much more compact than XML, and typically faster to parse. While still being a standard beyond the library world (meaning there are tools to support it and validate it). And without the issues of marc21 binary including length limits.
In fact, I know of a couple people who independently had this idea of marc-json, but Bill Dueber did a little proto- mini- spec for a standard way to do marc in json, so different people writing tools can do it can be inter-operable.
I encourage anyone dealing with these issues to consider marc-json per Bill’s proto-mini-spec. I plan/hope to!
conceptually this sounds nice, but it’s an awkward trade off between JSON and XML. JSON is great at being open and accessible, faster and simpler, but lacks some of the tools like xslt and xpath that xml provides. Yes, I cringe a bit as I write this.
You don’t need to pick just one or the other, you can use both! The right tool for the right job. Different use cases you need different things.
Adam — the problem with marc-xml, at least in my mind, is that none of the technologies you list are worth damn with the format because it’s so brain-dead. The reality of MARC is that you’re not going to get much useful out of the structure because there is so little structure; you’re going to have to muck with the actual strings.
If you’re doing something simple — get a list of all the titles (for a very narrowly-defined meaning of ‘title’), say — then XPath/XSLT would make for light work. For most stuff, though, many of us are more comfortable working in a language with side-effects :-)
Point taken, though — often the solution to “we need another format” is “learn how to use the tools that work well with the existing format.”
I’ve played with taking bib data (in MARC format specifically) and outputting it as JSON, with some success.
Specifically, the [php-based] deserializer is used to create an outward-facing-yet-undocumented-to-the-public API for our catalog:
http://www.cantonpl.org/apis/cat.php?q=awesome
The API only works with MARC at the single-bib level (so an ISBN or bib number query), but returns full-featured results that most any language can then easily ingest.
Some of the applications for this method can be found at http://www.cantonpl.org/tools , along with the search suggestions on the main search bar and added-info tooltips on bib record links
Hope this helps.
Brad, that’s interesting to look at, but important to point out that what you seem to have done a transformation of marc to semantic tags, sort of MODS-like, but I think you’ve used a custom mapping rather than MODS? You’ve sort of created your own schema/vocabulary/element set. Perhaps useful, but a different than sort of thing than a ‘lossless’ ’round-trippable’ representation of MARC in json. (It would be hard to go back from your custom schema to an internal representation of a MARC record, or a legal MARC serialization.)
What I’m talking about is just an additional serialization of MARC, lossless and round-trippable, just like MARCXML, but more compact in json.
Somewhere recently that now I can’t find (something by Karen Coyle?) I saw an article that provided a taxonomy of metadata ‘layers’. I wish I could find that to refer to it now, it was helpful to have a common vocabulary to talk about this stuff. But, basically, I think MARC serves as a schema/element-set, and one thing you can do is translate or ‘cross-walk’ MARC records to a different schema/element-set (which may or may not be a reversible operation depending on the destination schema). What we’re discussing here though, is keeping the MARC schema, but just providing a different serialization/transmission-format for it.
Thanks for bringing this up. Last year at code4lib ’09 I had a start at supporting json round-tripping in pymarc; this will prompt me to find that, update it, and publish it as a fork.
http://www.loc.gov/pictures/item/2008660390/marc?fo=json&at=marc
It is hard to find information about this topic, can anyone update and post if there is already a standard .xsl file to get JSON from MARC-21-XML? Or another way to transform MARC-21 to JSON? Is the posted link from thatcher just an appetizer? Thank you very much for your amazing blogs (at least for me :DD) by the way!
FWIW – LibreCat.org – Catmandu pearl scripts. Including export to ElasticSearch. I love conversion. Although, had not good output of JSON yet. Easy to convert MARC to MARC-XML, JSON, YAML, including different FIX options. Currently the best I’ve got, is of awkward way (due to my current knowledge) – MARC > JSON > to ElasticSearch > from ElasticSearch to JSON. Data mapping as a result – very good. At least for my needs.