a note on MARC8 to UTF8 transcoding: Character references

Do you do sometimes deal with MARC in the MARC8 character encoding? Do you deal with software that converts from MARC8 to UTF8?

Maybe sometimes you’ve seen weird escape sequences that look like HTML or XML “character references”, like, say “‏”.

You, like me, might wonder what the heck that is about — is it cataloger error, a catalgoer manually entered this or something in error? Is it a software error, some software accidentally stuck this in, at some part in the pipeline?

You can’t, after all, just put HTML/XML character references wherever you want — there’s no reason “‏” would mean anything other than &, #, x, 2, etc, when embedded in MARC ISO 2709 binary, right?

Wrong, it turns out!

There is actually a standard that says you _can_ embed XML/HTML-style character references in MARC8, for glyphs that can’t otherwise be represented in MARC8. “Lossless conversion [from unicode] to MARC-8 encoding.”

http://www.loc.gov/marc/specifications/speccharconversion.html#lossless  (thanks to dan scott for finding that part of the marc spec and figuring that out!)

Phew, who knew?!

Software that converts from MARC8 to UTF-8 may or may not properly un-escape these character references though. For instance, the Marc4K “AnselToUnicode” class which converts from Marc8 to UTF8 (or other unicode serializations) won’t touch these “lossless conversions” (ie, HTML/XML character references), they’ll leave them alone in the output, as is.

yaz-marcdump also will NOT un-escape these entities when converting from Marc8 to UTF8.

So, then, the system you then import your UTF8 records into will now just display the literal HTML/XML-style character reference, it won’t know to un-escape them either, since those literals in UTF8 really _do_ just mean & followed by a # followed by an x, etc. It only means something special as a literal in HTML, or in XML — or it turns out in MARC8, as a ‘lossless character conversion’.

So, for instance, in my own Traject software that uses Marc4J to convert from Marc8 to UTF8 — I’m going to have to go add another pass, that converts HTML/XML-character entities to actual UTF8 serializations. Phew.

So be warned, you may need to add this to your software too.

Some lessons

We can’t get rid of MARC8 soon enough.  Dealing with global alphabets is confusing enough when you are just dealing with unicode. It’s already inherently really confusing, just because it’s actually a very complicated matter to have computers doing the right thing. When you add MARC8 transcodings into it, it becomes a royal mess, with many possible places for things to get messed up.

Library standards are really confusing.  There are so many places to look, standards — and standard practices that aren’t actually written down in standards — are written in so many places, and so often take their own path rather than doing what has become a standard best practice in the rest of the computer world that isn’t libraries. I’m not sure how Dan Scott actually tracked down the relevant part of the MARC spec. But meanwhile, in our Horizon db, other unicode chars are ‘losslessly’ stored in MARC instead as “<U+nnnn>”, where nnnn is a unicode hex codepoint. Ie, instead of “&#xNNNN;”.  I assumed the “<U+nnnn>” thing was a custom proprietary Horizon thing… but another local colleague insists she saw a different standard somewhere else that says to do things that way!

Catalogers and metadata experts have to start learning more about character encoding.  When you are dealing with textual data that goes beyond ascii, that deals with world alphabets:  There is no way to enter data correctly, there is no way to troubleshoot data issues, there is no way to do quality assurance on your data — without starting to learn about character encodings. And also about world alphabet issues that go beyond encodings, like the unicode bidirectional algorithm, unicode normalization, and issues in sorting world alphabets.  It’s technical stuff, and it’s confusing and challenging material even for the technically-minded, becuase the problems are inherently complex. But if you want to be a cataloging/metadata expert in a computer world, there’s no way around it.  You don’t need to learn to write PHP to be a metadata wrangler, you do need to learn about issues in software engineering world alphabets.

Bi-directional text is really confusing. So the original HTML-style character reference I found in my own data?  Was “‏”, the escaped version of the unicode right-to-left mark.  Even trying to explain what this mark is gets us into a world of confusingness.  But it’s basically used to provide some explicit display instructions for text that combines right-to-left scripts and left-to-write scripts, when the default unicode algorithm doesn’t get display right. But the ironic thing is, as far as I can tell, in the particular records we are downloading from OCLC that already have a “‏” in them — the right-to-left mark is not actually improving display, but actually messing up display that would have been right using the standard unicode bidi algorithm without it. So it is a sort of cataloger error, but a different sort.   In trying to figure out what was going on here though, I broached the topic with local catalogers, who now mention that they know of a bunch of cases where directionality of text is getting messed up in display. Figuring out what’s really going on in each of these cases is really really complicated. (Does the text need a right-to-left or left-to-right mark hint to display right? Has the text been entered improperly in the original record? Is it, possibly, actually a bug in browsers not implementing unicode bidi algorithm properly?).   Trying to explain what’s going on to catalogers… well, see above point.

The remaining mystery.  What if you actually wanted to literally put a “ሴ” in a MARC8 encoded record? You know, my new memoir, “Unicode Headaches: From &#x0001; to &#xFFEE / Jonathan Rochkind.”  How would you do it, without those literals being “un-escaped” by things prepared to deal with “lossless conversions to MARC8”? Hell if I know. Apparently those who wrote the MARC8 spec for this ‘lossless conversion’ never considered it. Oops.

This entry was posted in General. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s