structural marc problems you may encounter

This is not your typical ‘why MARC must die’ post. It’s instead about very low level structural problems in a Marc21 binary file that my ILS outputs. It’s not about the semantics of MARC at all, it’s about the structural features of the Marc21 format.

I never had to know much about low-level Marc21 format details before, and wish I still didn’t, but I had to because my ILS (Horizon) is outputting certain bibs as MARC that the Marc4J Java library used by SolrMarc refused to read, claiming they were structurally invalid in various ways. (Never would have figured this stuff out with the invaluable help of sesuncedu, robcaSSon, and others in #code4Lib).

But this may help someone else figuring out why Marc4J can’t read their MARC.

1. Invalid leader bytes

In the leader of a Marc21 record, byte 10 is always ascii ‘2’, byte 11 is always ‘2’ as well, and bytes 20-24 are always ‘4500’.   At least they’re supposed to be.  Theoretically these bytes allow a record to specify details about the nature of it’s binary format — but these details are fixed in Marc21, and in all other Marc variants we know of, it’s a flexibility that was rarely or never taken advantage of in any Marc format.

However, Horizon actually stores most of it’s leader bytes in a db column.  And if the leader bytes are something other than these invariants in that db column, Horizon’s marc export will include those leader bytes — even if they are invalid, even if they do _not_ accurately describe the Marc record they are attached to (which wouldn’t be a valid Marc21 record if it was true).

Since these values are invariant in Marc21 and most (all) other Marc formats, most Marc parsers ignore them.

However, Marc4J doesn’t, it actually treats them as gospel.  So if those bytes were wrong, Marc4J will try reading the record improperly. And if those bytes weren’t ascii decimal digits at all, Marc4J will claim it can’t read the leader.

So I just had to fix those in our production ILS.  And figure out where they’re coming from, and try to stop them from coming in again? Really, I blame our ILS here for even allowing such completely wrong bytes to be in it’s internal db.

(A perhaps better solution from the other end is fixing the Marc4J PermissiveReader to not pay attention to those bad bytes, assuming the invariant values. sesuncedu has prepared a patch doing some of that for Marc4J, hopefully it’ll get in there.)

2. Bib Records Too Long For Marc

Because of the nature of MARC21’s ‘directory’ structure, there is a maximum length that a MARC record can be. If it’s above this length, the MARC directory doesn’t have enough bytes in it to describe where the fields beyond this length are in the record, and the MARC record is unreadable. (Incidentally, it’s very odd that MARC includes internal byte offsets recorded as ascii decimal chars, rather than ordinary binary data.  If it used more typical simple binary encoding of integers for byte offests, the maximum length of a MARC file would be quite a bit larger.  But it doesn’t. Oh well.)

So what does Horizon marc export do if it has a record which has too much data, which will go over the maximum record length in MARC?  It outputs it anyway. But the marc record it outputs is seriously messed up. It’s got a MARC directory which may be entirely illegal (not a multiple of 12), it’s got a wrong leader bytes 0-4 ‘length’, possibly other problems.   Depending on the individual record and exactly how Horizon ended up outputting it, Marc4J might just skip it as a bad record and go on. That’s the best that could be expected.

However, more often, Marc4J gets entirely confused because of the bad leader bytes 0-4 length, and doesn’t understand where the subsequent record in the marc file actually begins.  So every other record after this too long one in the marc file is a loss to Marc4J/SolrMarc indexing. Either every subsequent record can’t be indexed at all, or even worse, every subsequent record is indexed by Marc4J/SolrMarc, but completely wrong, because Marc4j/SolrMarc got the wrong data.

I need to work out a patch for Marc4J PermissiveReader so when encountering such a record, Marc4J can at least recover by properly finding the beginning of the NEXT record, using the Marc Record Separator character.

3. Blank/null tags

This one might be Horizon-specific. Horizon allows the operator to accidentally add a tag to a record that has a null tag value. Not 100, 245, or something else, but just null.   This accident could have been made manually, or could have been made by some sort of automated import script when we batch loaded records into Horizon.  When the Horizon marc exporter encounters such a record, it does output marc21 for it, but completely invalid and wrong marc21.

I blame Horizon for this, it ought not to allow null tag values to even exist in the db, and if they do, ought to be ignored on export, not create an invalid marc record on export.

This is another problem that often results in Marc4J getting completely confused about where one record ends and the next starts, making the entire rest of the Marc file after such a record un-readable.  Probably because of a bad leader bytes 0-4 length value, so perhaps if I can work out a patch to above, it will at least result in Marc4J succesfully skipping such a record and going on to the rest of the file.

4. Illegal chars in Marc values?

This one I haven’t completely gotten to the bottom of yet, because I made the mistake of fixing the couple examples I found in the Horizon Staff Client, where it didn’t really show me exactly what was going on.

But I think some Marc control characters (Field Terminator or Record Terminator) wound up in some of my record values in the db. (No doubt as the result of an import gone wrong at some point in the past).  The Horizon marc exporter simply included them unescaped in it’s marc output. Resulting in special marc control characters in illegal places, or places where they don’t mean what they mean, in the marc file. This also messed up Marc4J something awful.

Again, I kind of blame Horizon here, for allowing bad data in it’s internal store, and then for writing bad data out in marc export when such bad data is in it’s internal store.

NEW! 4 Feb 2010:

Marc control character in internal data value.

I’ll describe this one in Horizon-specific terminology, cause it’s clearer.

The horizon “bib” table holds an individual marc field in the ‘text’ column.  Every ‘text’ column ENDS in the Marc Field Terminator character (decimal 30, hex 1E, sometimes displayed as “^^”).

However, some of our values have that Marc Field Terminator character _not_ as the last character, but internally.  This creates problems in marc export, where the marc created by marcout is invalid unparseable marc.  (as it includes marc Field Terminator control character in illegal position).

This problem is not visible in Horizon Staff Client, the control character is not shown. But it’s hiding there in the database anyway.   If you open an individual record in Horizon Staff Client and then simply re-save it, it SEEMS to fix the problem in at least some cases (not sure about all), but probably makes more sense to fix it in bulk through an automated process anyway.

As a technical note:  I used this SQL against hzdev db to find the number of bibs which contained char(30) as some char OTHER than the last in dbo.bib.  It takes quite a while ro run. This would have to be re-done for dbo.bib_longtext.longtext, another table that data destined for marc export can hide. You could base an automated fix off of this SQL technique.

select  count(distinct bib#) from dbo.bib where (charindex(char(30), text) != char_length(text))

Correction:  that SQL will also find values that do not end in the FT at all.  While Horizon ordinarily does so that’s sort of an error, it doesn’t cause any problems. Here’s one to find only ones with an internal FT after all, not including ones with no FT whatsoever:

select  count(distinct bib#)  from dbo.bib where  (charindex(char(30), text) not in(0, char_length(text)))

New! 9 Marc 2010: 5. Illegal Marc8 encoding

Records in our catalog are not in UTF-8 but in the library “Marc8” encoding. However, if there’s weird data in the db,  our catalog will happily export marc that actually includes bytes which are not proper Marc8. For instance, in the situation I ran into, two “ESC” characters in a row.   Fortunately, I have not run into very many of these, since I have no idea how (or if it’s possible) to automate discovery (let alone fixing) of all possible types of bad Marc8 encoding output.

A note on MARC control character terminology

One confusing thing in dealing with this stuff that took me a while to figure out is how MARC uses it’s own special weird names for certain control characters.

MARC has a “Field Terminator” (which is sometimes called ‘field separator’ in marc docs instead of ‘terminator’) and a “Record Terminator” (also sometimes called ‘record separator’ in docs instead of ‘terminator’).

But the ascii values used for these special MARC control codes already had names in ascii, and they are confusingly similar but not the same names!  This certainly leads to confusion.

Marc “Field Terminator” == Hex 1E == Decimal 30 == Ascii “Record Separator” == “control-^” or “^^”, which is how stock vim will display it.

Marc “Record Terminator” == Hex 1D == Decimal 29 == Ascii “Group Separator” == “control-]” or “^]” which is how stock vim will display it.

(Correction 3 Feb, this next is also part of the marc standard):

Marc “Subfield Delimiter” ==  Hex 1F == Decimal 31 ==  Ascii “Unit Separator” == “control-_” or “^_” which is how it will show up in vim.

Also update 3 Feb 2010.  I made this little sign and now keep it on my wall next to my desk, so I can refer to it ha.

Advertisement

8 thoughts on “structural marc problems you may encounter

  1. You are awesome. Thanks for writing this up. I’m working with some Horizon MARC output and I have a feeling this post will save me a lot of pain and frustration.

  2. Just to clarify on Galen’s point, look in the glossary at http://www.loc.gov/marc/specifications/specrecstruc.html. The $,| etc symbols you see are actually just mapped visualizations of of that 1F character.

    This is combined with the data element identifier to make the subfield code. (The data element indentifier can pretty much be any character except blank. So it would be legal, at least as far as raw marc is concerned, to have something like $$), where the delimiter only appears to be $, but the data element identifier is actually $.

    A variation on the above I’ve actually seen in the wild is when I saw someone without a lot of experience essentially do a character set conversion using a tool intended for text/xml files. This lead to three issues:

    1) the leader/09 indicating character set was wrong, still being a.
    2) The bytes were all off in the marc directory. This was because some characters that got mapped from one character in Marc-8 to two characters in the MARC binary file. However, none of the leaders of the marc records in that collection were changed by this tool.
    3) Since I suspect the tool mis-identified the encoding, some of the replaced Unicode characters were wrong.

    I told the person to use yaz or to convert the file into MARCXML, run the file, and then run it back to see if that works. This highlights why it’s difficult to work with binary MARC compared to some more recent metadata standards. Everything relies on just a small set of software or you have to roll your own.

  3. To add another example to Jon’s excellent comment, occasionally I run into binary MARC records that try to use the literal $ or ‡ characters as the subfield delimiter instead of hex 1F.

    To compound on the general problem of determining a MARC record’s current character encoding (alas, no, the Leader/09 cannot be trusted), I’ve also run into records that use more than one character encoding, e.g.,, where parts use MARC-8, others ISO-8859-1, others a random Windows codepage, etc.

  4. I have to edit 46,357 Netlibrary ebook bib records in the 856 field so that they will work with ez-proxy authentication. I have tried exporting a few, editing them using MarcEdit, and then importing them back into Horizon. I keep getting this error message in the data import error log:

    Application Batch NoError No. Date Time Error
    ———– —– ——- — —- —- —–
    Marc Import 11929 1 06/16/10 09:18AM FilePos: 0. Database Error| MARC record does not end with
    Marc Import 11929 2 06/16/10 09:18AM FilePos: 1451. MARC record length field non-numeric.

    I thought it might be MarcEdit, so I decided to try editing in notepad, and I got the same message. All I am doing in editing is changing the URL in the 856 field. Why can I not import after doing this?

    I’m assuming from reading your post here that something has happened to my record terminator in these records. How do I fix it?

    Perhaps you all know some other way of editing this field in batch (other than SQL run on the database–we are on SAAS now.) If so, please let me know.

    Thanks so much.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s