This is not your typical ‘why MARC must die’ post. It’s instead about very low level structural problems in a Marc21 binary file that my ILS outputs. It’s not about the semantics of MARC at all, it’s about the structural features of the Marc21 format.
I never had to know much about low-level Marc21 format details before, and wish I still didn’t, but I had to because my ILS (Horizon) is outputting certain bibs as MARC that the Marc4J Java library used by SolrMarc refused to read, claiming they were structurally invalid in various ways. (Never would have figured this stuff out with the invaluable help of sesuncedu, robcaSSon, and others in #code4Lib).
But this may help someone else figuring out why Marc4J can’t read their MARC.
1. Invalid leader bytes
In the leader of a Marc21 record, byte 10 is always ascii ‘2’, byte 11 is always ‘2’ as well, and bytes 20-24 are always ‘4500’. At least they’re supposed to be. Theoretically these bytes allow a record to specify details about the nature of it’s binary format — but these details are fixed in Marc21, and in all other Marc variants we know of, it’s a flexibility that was rarely or never taken advantage of in any Marc format.
However, Horizon actually stores most of it’s leader bytes in a db column. And if the leader bytes are something other than these invariants in that db column, Horizon’s marc export will include those leader bytes — even if they are invalid, even if they do _not_ accurately describe the Marc record they are attached to (which wouldn’t be a valid Marc21 record if it was true).
Since these values are invariant in Marc21 and most (all) other Marc formats, most Marc parsers ignore them.
However, Marc4J doesn’t, it actually treats them as gospel. So if those bytes were wrong, Marc4J will try reading the record improperly. And if those bytes weren’t ascii decimal digits at all, Marc4J will claim it can’t read the leader.
So I just had to fix those in our production ILS. And figure out where they’re coming from, and try to stop them from coming in again? Really, I blame our ILS here for even allowing such completely wrong bytes to be in it’s internal db.
(A perhaps better solution from the other end is fixing the Marc4J PermissiveReader to not pay attention to those bad bytes, assuming the invariant values. sesuncedu has prepared a patch doing some of that for Marc4J, hopefully it’ll get in there.)
2. Bib Records Too Long For Marc
Because of the nature of MARC21’s ‘directory’ structure, there is a maximum length that a MARC record can be. If it’s above this length, the MARC directory doesn’t have enough bytes in it to describe where the fields beyond this length are in the record, and the MARC record is unreadable. (Incidentally, it’s very odd that MARC includes internal byte offsets recorded as ascii decimal chars, rather than ordinary binary data. If it used more typical simple binary encoding of integers for byte offests, the maximum length of a MARC file would be quite a bit larger. But it doesn’t. Oh well.)
So what does Horizon marc export do if it has a record which has too much data, which will go over the maximum record length in MARC? It outputs it anyway. But the marc record it outputs is seriously messed up. It’s got a MARC directory which may be entirely illegal (not a multiple of 12), it’s got a wrong leader bytes 0-4 ‘length’, possibly other problems. Depending on the individual record and exactly how Horizon ended up outputting it, Marc4J might just skip it as a bad record and go on. That’s the best that could be expected.
However, more often, Marc4J gets entirely confused because of the bad leader bytes 0-4 length, and doesn’t understand where the subsequent record in the marc file actually begins. So every other record after this too long one in the marc file is a loss to Marc4J/SolrMarc indexing. Either every subsequent record can’t be indexed at all, or even worse, every subsequent record is indexed by Marc4J/SolrMarc, but completely wrong, because Marc4j/SolrMarc got the wrong data.
I need to work out a patch for Marc4J PermissiveReader so when encountering such a record, Marc4J can at least recover by properly finding the beginning of the NEXT record, using the Marc Record Separator character.
3. Blank/null tags
This one might be Horizon-specific. Horizon allows the operator to accidentally add a tag to a record that has a null tag value. Not 100, 245, or something else, but just null. This accident could have been made manually, or could have been made by some sort of automated import script when we batch loaded records into Horizon. When the Horizon marc exporter encounters such a record, it does output marc21 for it, but completely invalid and wrong marc21.
I blame Horizon for this, it ought not to allow null tag values to even exist in the db, and if they do, ought to be ignored on export, not create an invalid marc record on export.
This is another problem that often results in Marc4J getting completely confused about where one record ends and the next starts, making the entire rest of the Marc file after such a record un-readable. Probably because of a bad leader bytes 0-4 length value, so perhaps if I can work out a patch to above, it will at least result in Marc4J succesfully skipping such a record and going on to the rest of the file.
4. Illegal chars in Marc values?
This one I haven’t completely gotten to the bottom of yet, because I made the mistake of fixing the couple examples I found in the Horizon Staff Client, where it didn’t really show me exactly what was going on.
But I think some Marc control characters (Field Terminator or Record Terminator) wound up in some of my record values in the db. (No doubt as the result of an import gone wrong at some point in the past). The Horizon marc exporter simply included them unescaped in it’s marc output. Resulting in special marc control characters in illegal places, or places where they don’t mean what they mean, in the marc file. This also messed up Marc4J something awful.
Again, I kind of blame Horizon here, for allowing bad data in it’s internal store, and then for writing bad data out in marc export when such bad data is in it’s internal store.
NEW! 4 Feb 2010:
Marc control character in internal data value.
I’ll describe this one in Horizon-specific terminology, cause it’s clearer.
The horizon “bib” table holds an individual marc field in the ‘text’ column. Every ‘text’ column ENDS in the Marc Field Terminator character (decimal 30, hex 1E, sometimes displayed as “^^”).
However, some of our values have that Marc Field Terminator character _not_ as the last character, but internally. This creates problems in marc export, where the marc created by marcout is invalid unparseable marc. (as it includes marc Field Terminator control character in illegal position).
This problem is not visible in Horizon Staff Client, the control character is not shown. But it’s hiding there in the database anyway. If you open an individual record in Horizon Staff Client and then simply re-save it, it SEEMS to fix the problem in at least some cases (not sure about all), but probably makes more sense to fix it in bulk through an automated process anyway.
As a technical note: I used this SQL against hzdev db to find the number of bibs which contained char(30) as some char OTHER than the last in dbo.bib. It takes quite a while ro run. This would have to be re-done for dbo.bib_longtext.longtext, another table that data destined for marc export can hide. You could base an automated fix off of this SQL technique.
select count(distinct bib#) from dbo.bib where (charindex(char(30), text) != char_length(text))
Correction: that SQL will also find values that do not end in the FT at all. While Horizon ordinarily does so that’s sort of an error, it doesn’t cause any problems. Here’s one to find only ones with an internal FT after all, not including ones with no FT whatsoever:
select count(distinct bib#) from dbo.bib where (charindex(char(30), text) not in(0, char_length(text)))
New! 9 Marc 2010: 5. Illegal Marc8 encoding
Records in our catalog are not in UTF-8 but in the library “Marc8” encoding. However, if there’s weird data in the db, our catalog will happily export marc that actually includes bytes which are not proper Marc8. For instance, in the situation I ran into, two “ESC” characters in a row. Fortunately, I have not run into very many of these, since I have no idea how (or if it’s possible) to automate discovery (let alone fixing) of all possible types of bad Marc8 encoding output.
A note on MARC control character terminology
One confusing thing in dealing with this stuff that took me a while to figure out is how MARC uses it’s own special weird names for certain control characters.
MARC has a “Field Terminator” (which is sometimes called ‘field separator’ in marc docs instead of ‘terminator’) and a “Record Terminator” (also sometimes called ‘record separator’ in docs instead of ‘terminator’).
But the ascii values used for these special MARC control codes already had names in ascii, and they are confusingly similar but not the same names! This certainly leads to confusion.
Marc “Field Terminator” == Hex 1E == Decimal 30 == Ascii “Record Separator” == “control-^” or “^^”, which is how stock vim will display it.
Marc “Record Terminator” == Hex 1D == Decimal 29 == Ascii “Group Separator” == “control-]” or “^]” which is how stock vim will display it.
(Correction 3 Feb, this next is also part of the marc standard):
Marc “Subfield Delimiter” == Hex 1F == Decimal 31 == Ascii “Unit Separator” == “control-_” or “^_” which is how it will show up in vim.
Also update 3 Feb 2010. I made this little sign and now keep it on my wall next to my desk, so I can refer to it ha.