spec for a better ILS marc exporter

Here is a draft (in progress) spec for a better Marc exporter from Horizon. We’ve had some problems with the out of the box SirsiDynix “marcout” functionality, and are considering either writing in-house or contracting out work to write a better marc exporter.  Since Horizon has all it’s data in an ordinary rdbms in fairly well normalized form, this makes it quite feasible (although still potentially some significant grunt work) to write a custom marc exporter.

While we haven’t finalized our spec or decided how or if we’re going to move forward with it, I thought it might be useful to others to see what problems you can have with a vendor marc exporter, and what my experience shows me I’d want in an ‘ideal’ marc exporter.

Selected issues with vendor/current solution

  • It runs only on Windows. This will complicate efforts to provide routine automation for Horizon/Solr-Blacklight sync.
    • Marc records to export must be specified in a database table, there is no facility for ‘export all’, ‘export all public’, or ‘export these specific bibs’.
  • It does not allow us sufficient control over copy/item holdings information attached to export. We can include all items, but not a selection of items (for instance, only ‘public’ ones), and can not include copies. Sometimes information we need at indexing-time (call number, location) is contained only in copy-level information.
    • There are certain bibs for which the exporter crashes trying to include information, which need to be manually excluded.
  • Marc21 has a maximum record limit. Some of our records, with attached item information, are longer than this limit. The SD exporter in some of these cases produces corrupt records that can be read by software – in other cases it can be read only via strange heuristic workarounds.
    • Exporting in either Marc-XML or Marc-in-Json would potentially work around the Marc21 maximum record size, but the SD tool does not have these capabilities.

Requirements for a better Horizon Marc exporter

This work could be completed in-house, or could be contracted out to Alpha-G (or someone else?). These requirements represent a ‘best case’, to get a flexible exporter that will meet current and future needs, regardless of our choices about formats, index-time vs. display-time lookup, etc. Some of those decisions are difficult to make in advance without being able to test out different options performance in production. However, if these requirements end up too expensive, they could be narrowed down based on a pre-selected specific choices for use.

  • Execution
    • Should be runnable in batch mode on a standard unix OS. (Java is fine.)
    • Should be invokable with “export all bibs”, “export all not-staff-only bibs”, or “export specific list of bibs”. Specific list of bibs can come from a database table (SD-export-style), a flat file argument list of bib#s, or bib#s supplied on command line.
    • Should optionally output to stdout instead of a file. (For piping directly to indexer stdin, to skip the performance penalty of disk bottleneck).
    • Error log file should be provided listing error conditions encountered as below, and bib# of record with error where available.
    • Should be no slower than SD ‘marcout’, and ideally a bit faster.
  • Copy/Item-level information
    • Both copy and item level information should be includable in the Marc in custom 9xx tags of choice. (For instance, copy in 991, item in 992; exact fields specifyable in config or arguments, not hard-coded).
    • Any data attached to Horizon copy should be configurably included in a Copy Marc Field, in seperate addressable subfields, with subfields repeating if neccesary for repeated values. Exact subfields used for each data element should be configurable. Data elements include:
      • copyID
      • location code
      • user-displayable location name
      • collection code
      • user-displayable collection name
      • call number (copy-level)
      • call number type code (copy-level)
      • copy statement
      • note (copy level, repeatable)
      • staffOnly boolean
      • unsure how to encode in a marc bib, but would be nice Run Statements (main, supplement, index), included somehow with both Run Statement and Note(s). I can think of some hacky ways to do this, needs to be fleshed out, or stripped from requirements.
    • Similarly, any data attached to a Horizon item row should also be included in addressable subfields, repeatable, exact subfields configurable. Including:
      • itemID
      • copyID (if applicable)
      • location code
      • user-displayable location name
      • collection code
      • user-displayable collection name
      • staff-only boolean
      • call-number (item-level)
      • call-number type (item-level)
      • copy statement
      • note(s) (repeatable, item level)
    • Should be able to specify at run-time the breadth of item/copy info desired for inclusion:
      • All items and all copies
      • Only copies/items not marked staff-only. (This would mean a copy would be included only if not marked staff-only; an item would be included only if not marked staff-only AND it belongs to a copy not marked staff-only).
      • Special “one level holdings” output – include copies, include items only if they are “top level” and do NOT belong to a copy. Should be combinable with “not staff only” mode, to impose both restrictions.
  • Tolerance of bad Horizon data integrity
    • Certain kinds of data integrity seem not enforced in Horizon, and cause other exporters to sometimes crash when encountering. The exporter should not crash, but should either skip the record with an error log of records skipped, or should compensate still outputting valid marc, when encountering:
      • Item records with invalid non-existent copyIDs.
      • dbo.bib rows for partial records, missing tag=’000′ row, a “bib” that has ONLY a 999, etc, dbo.bib rows with null ‘text’ columns, etc.
      • dbo.bib or dbo.bib_longtext rows with null ‘tag’ column.
  • Marc8 encoding issues
    • (Optional, but would be nice) Ability to translate a Marc8 encoded db to UTF-8 on output (marcxml, json, marc21; with proper leader byte set for UTF-8 if that is what’s being produced).
    • Illegal Marc8 encoding in the db, like two ‘ESC’ characters in a row, or an unterminated ‘ESC’ block, should be fixed on output using reasonable guess, or record should be skipped with error message.
  • Bad data in Horizon cleanup on output: All formats
    • Leader/directory
      • Even if record is too long, leader and directory bytes should be at proper byte offsets. (Ie, do not let the leader 0-4 expand into byte 5 and ruin byte offsets for other leader bytes. Same for directory).
      • Leader byte 09 (“Character coding scheme”) should be appropriate value for actual data in Horizon – if Horizon is Marc8, ‘#’, if Horizon is UTF-8, ‘a’. Even if leader data is incorrect/corrupt in Horizon rdbms.
    • Marc control characters: The control character 0x1E/30/Marc “Field Terminator” can sometimes appear in the middle of a dbo.bib.text or dbo.long_text.text column. If passed through to output untouched, it results in illegal Marc. Should be stripped before output. 0x1E Marc Field Terminator’s appearing at the END of a column should also be stripped from Json/XML output – in Marc21, all fields should end with one (not two) 0x1E bytes regardless of whether db row ends in one. If Marc Record Terminator (0x1E/29) appears in db row, it should always be stripped before output as well. These control characters should be stripped from item and copy information before inclusion in Marc record in any format, as well.
  • Bad data cleanup: Marc21 specific (if we decide we’re not going to use Marc21, but instead use MarcXML or Marc-in-json exclusively, this could be omitted).
    • Leader
      • leader bytes 10 and 11 should be fixed to marc21-required ’22’, even if wrong in Horizon dbo.bib tag=’000′ row.
      • leader bytes 20-22 should be fixed to marc21-required ‘450’ even if wrong in Horizon db.
    • Too-long records.
      • A record (with item/copy info) that is too long for marc21 because it exceeds the number of digits provided length/offset in leader/directory, should be output with ‘99999’/’9999′ or ‘00000’/’0000′ in relevant offset/length areas of Marc21, with all other byte offsets remaining correct and legal.

Contractor sourcing.

To consider. If we purchase this code from third party.

  • Can we/do we want to purchase the source, so we can make changes or bug fixes in the future using internal resources if neccesary?
  • Are there other institutions  who may be interested in this functionality, can we split the price with another institution? (If vendor will sell licensing like that).
  • Even if we purchase it by ourselves, can contractor agree to a future ‘commodity’ price for already written code funded by us, so other Horizon users and partners can the purchase code at a lower price than the original development cost. Giving back to the community.
This entry was posted in General. Bookmark the permalink.

10 Responses to spec for a better ILS marc exporter

  1. unsure how to encode in a marc bib, but would be nice Run Statements (main, supplement, index), included somehow with both Run Statement and Note(s). I can think of some hacky ways to do this, needs to be fleshed out, or stripped from requirements.

    Not sure if I’m grokking the Horizonism correctly, but if a “Run Statement” is a summary serials holdings statement, the canonical way in MARC21 to do this is to use tag 866 for the main statements, 867 for the supplement statements, and 868 for the index statements. The statement itself would go into subfield $a and the note in $x (if it’s a staff note) or $z (if it’s a public note).

  2. I agree with Galen. Since coding is arbitrary, I recommend using MARC Format for Holdings Data for both serials summary statements and item data that is defined by that format.

    I do not see any benefit to the proposal for too long records because your specification generates a corrupt record. Moreover, with an incorrect length, it may not always be possible to meet the requirement that “all other byte offsets remaining correct and legal,” especially since UTF-8 could cause invalid field lengths.

    For too long records, my gut reaction is to have a MARCXML out option so you can deal with the problem based on the individual circumstances. This solution will be painfully slow, but at least it will work. My own experience with working with records with too many items attached is that the easiest thing to do is just duplicate the bibs and spread the items over the records. That will make the programming a bit strange.

    Just as an observation, outputting item level call numbers is essential. However, if no item level call number is present and it is inheriting it from the bib, I recommend that the bib level call number be output in the item call number field. Otherwise, migrations and analysis gets much harder because many libraries keep retain multiple call number fields in the bib and there’s often no programmatic way to determine which is used.

  3. Galen: But that’s in an MFHD, right? I’m trying to pack it into a Marc Bib.

    I am packing _multiple_ “holdings” into a single Marc Bib. That’s fine until we run into hieararchical information like run statements and notes.

    Each “holding” gets a single 99x in my marc bib. And has a subfield for location, call number, etc. But each holding may have from 0 to _many_ “run statements”, and each “run statement” can have, if I understand it right, from 0 to many “notes”.

    holding
    -> run statement
    -> note
    -> note
    -> run statement
    -> note

    I can’t figure out how to pack that into a marc bib (NOT an mfhd). If you have any ideas, please do share!

    Kyle: The way it works in our system, is that bib level call numbers are (currently) completely ignored. They aren’t used at all. We _always_ have “holding” level call numbers. But Horizon has two levels of holdings, which it calls “copy” and “item”. Some bibs have only “items”. Some bibs have “copies” which themselves include “items”. The call numbers on “copies” and “items” are both relevant — except if it is a bib with “copies”, then _only_ the “copy” call number is used, not the (subserviant) “item” call number.

    Confused enough yet?

    As far as too-long marc. I discovered an amazing thing. If you have too long marc, as long as your exporter does not upset the _byte order_ in the leader/directory, there is still enough semantic info there to read the marc record, and most language marc libraries CAN read it. If the length is over 99999, then if you just make the first five bytes “99999” or “00000”, then ruby-marc can still read it, pymarc can still read it, and subsequent to a set of patches me and Bob made to Marc4J a couple weeks ago, Marc4J can still read it.

    So that’s the benefit to my “too long record” stuff — it _actually works_.

    I don’t follow what you mean about “especially since UTF-8 could cause invalid field lengths” though. Can you expand on that? A Marc record _can_ legally be UTF-8, we know that, right. So what’s the problem with field lengths and UTF-8 Marc records? I haven’t dealt with UTF8 marc yet, my ILS puts out solely MARC8, but as you can see I’m contemplating having the exporter translate to UTF8 on the way out. What should I be worried about here regarding leader and directory?

  4. The 85x and 86x fields are not restricted to MFHDs – it’s perfectly legal to embed them in bibliographic records.

  5. Let’s say you have the “Journal of Feline Psychology”, with a run of main holdings. Assume that the journal is classified and that a couple volumes are barcoded and thus have item records in Horizon. Using the 85x and 86x fields, the summary holdings and item information could be represented along these lines; note that the subfield $8 links related fields:

    245 .. $aJournal of feline psychology
    852 0| $81 $aJHU$bOdd Serials Department $hQR 234.45 F245 $zSome copies damaged by mice.
    866 || $81 $av. 1 (1901) - v. 99 (1999)
    876 || $8.1 $p 31234000000123 $3 v.45 (1945)
    876 || $8.2 $p 31234000000456 $3 v.57 (1957)
    

    Because the 876 tag doesn’t contain equivalents for all of the fields in the Horizon item record, including the item-level call number, you may prefer to define a 9xx field instead and use the subfield $8 to link the item field to the 852 and 866 fields representing the applicable serial run. Note that for regular monographic holdings, the 866/7/8 fields would generally not be applicable.

  6. I don’t follow what you mean about “especially since UTF-8 could cause invalid field lengths” though. Can you expand on that? A Marc record _can_ legally be UTF-8, we know that, right. So what’s the problem with field lengths and UTF-8 Marc records?

    Currently, your data is in an RDBMS that probably doesn’t care about field lengths the same way that the MARC format does. In MARC, the maximum field length is 9999 bytes. This is not the same as characters because one character in UTF-8 can take one, two, three, or four bytes. In plain English, this means that an extended summary (e.g. 505, 520, or other field that is often lengthy) could theoretically only need to be 2500 characters long to bust the limit.

    This will occur only rarely, but when it does, it will be a PITA. The general effect of UTF-8 on MARC is to shorten both the maximum length of the record as well as the fields while complicating parsing because the parser must know how many bytes each character takes.

  7. jrochkind says:

    Kyle, I get it thanks. Actually, my current rdbms doesn’t care about field lengths AT ALL, it’s perfectly happy to contain records and fields way too long for marc binary (which is probably a good thing, since marc binary isn’t the only possible output format!).

    The field lengths will just need to be set correctly on output, which shouldn’t be a problem if the outputter (what I’m spec’ing here) knows that byte counts aren’t the same as char counts in UTF-8, not too hard to deal with.

    Again, i was quite pleasantly surprised that ‘too long’ Marc records (both in terms of total record length and field lengths) _can_ be read by a variety of language marc libraries, so long as the leader and directory have byte positions right, if the total length, offsets, and field lengths are totally wrong — it can still be parsed, using the field and record delimiters, and ruby-marc and pymarc both WILL parse it, in my limited tests. And so will Marc4J now.

  8. What should I be worried about here regarding leader and directory?

    Just reading over what I’ve written, I see I’m making things clear as mud.

    In a completely incoherent way, I was trying to say that the leader and directory entries depend on the exact characters needed in the MARC record.

    If you break or build a MARC record the old fashioned way, all you need to do is know the number of characters, and the calculations are straightforward. But when UTF-8 is involved, the number of bytes per character varies, so the directory entries have to be based on the exact number of bytes actually used.

    To complicate matters, your MARC data will contain sequences such as {grave}A, {tilde}n, etc (certain foreign characters) and the like where the preceding diacritic looks like the beginning of a 2 byte utf-8 character, but in reality is a one byte diacritic which must be combined with the following character to display a certain letter.

    This means a single subfield can contain single and multibyte UTF-8 characters as well as single byte high ASCII characters (which look like the first byte of a 2 byte UTF-8 character) preceding other characters which need to be combined to form a single letter. As crazy as this all sounds, this is common. Naturally, there is no reliable way to discern what is what other than through heuristics.

    The guys that write the MARC parsers know this stuff far better than me, but my concern would be that even good programmers could get thrown for a loop if they’re not in tune with the type of data you’ll be dealing with and MARC geeks who have a good grip on the byte twiddling.

  9. jrochkind says:

    I think I understand you. But counting bytes doesn’t seem like a difficult thing to me? You take the string buffer that’s going to become part of the marc record, and you count the bytes in it. Counting bytes is not a hard thing to do any language I know of, as long as you remember that it’s bytes you want to be counting, not characters. But maybe I’m wrong and will discover it’s trickier than I think if I try.

    You’re right though about Marc8 being a pain with it’s combining diacritics. Not in terms of byte counts at all, but just in terms of conversion to UTF-8. There _are_ combining diacritics in UTF-8 too (or any unicode encoding for that matter), as an alternative way to represent certain characters with diacritics (and as the only way to represent certain characters that don’t have their own unicode codepoints).

    Marc4J has some code in it to convert Marc8 to UTF8. I _believe_ that it converts Marc8 diacritics to UTF-8 combining diacritics — which would make sense as the easiest way to the do the conversion, since it’s analogous to how it’s represented in Marc8 in the first place. But I believe these UTF-8 combining diacritics are the explanation for why certain characters are displaying really _oddly_ in my Rails app. With the diacritics not matching up quite properly with the characters they’re supposed to be over (or under). Either Marc4j isn’t doing the Marc->UTF8 translation quite right, or Firefox has trouble displaying unicode with combining diacritics sometimes. Is my guess. But I haven’t had the chance to investigate exactly what’s going on.

  10. jrochkind says:

    Ah wait, NOW I think I get you though.

    This means a single subfield can contain single and multibyte UTF-8 characters as well as single byte high ASCII characters (which look like the first byte of a 2 byte UTF-8 character) preceding other characters which need to be combined to form a single letter. As crazy as this all sounds, this is common. Naturally, there is no reliable way to discern what is what other than through heuristics.

    You’re talking about having a record that combines Marc8 encoding and UTF-8 encoding in the very same record, indeed in the very same subfield? I have no doubt this is common ‘in the wild’, but it’s completly illegal, and you’re absolutely right that there’s no way to know what the bytes are ACTUALLY supposed to represent except through heuristics.

    I am not trying to deal with that. If the record claims to be Marc8, and my software treats it as Marc8, I consider that software to work properly. If some chars don’t display ‘as intended’ because they were entirely illegally and incorrectly encoded as UTF-8 inside that record that claims to be Marc8–it will not be treated properly by my software, and this is to spec. The records have to be fixed (probably manually) at the source.

    This does not seem to be a _huge_ problem in our actual corpus (although I have no doubt it exists in our records) perhaps because our corpus has always been Marc8 only. Sometimes UTF-8 encoded chars do get in there anyway, but my software will not handle them correctly (and I don’t try to make it do so), and my local cataloging department agrees that those are errors which should be fixed in the source, and not something software should be expected to handle.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s