what every cataloger or metadata technician needs to know about character encoding

Okay, I tricked you, I’m not going to tell you myself, I’m going to direct you to this very useful blog post:

What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text

Guess what, even if you’re not a programmer, if you:

  • A) Manage metadata that includes diacritics, non-Latin characters, or anything else that’s not “ascii” (if you don’t know what ascii is, read the post!);  and/or…
  • B) Have to deal with MARC-21 records in the MARC-8 encoding…

You need to know about character encoding too. This is a great example from the “not every cataloger needs to know how to program, but every cataloger needs to know this technical topic anyway” pile.

The author of the “what every programmer” post even goes beyond programmers in his suggestion:

If you are dealing with text in a computer, you need to know about encodings. Period. Yes, even if you are just sending emails.

Being a professional metadata manager, oh yeah, you’ve gone beyond just sending emails, data stored in a computer is your business, and you’ve got to learn about encodings.

It is a bit confusing, especially if you’re not a programmer. (Believe me, it is confusing even for programmers! The concepts to some extent, but even more so debugging it in practice. Dealing with corrupted character encodings is the most confusingthing that’s regularly part of my work. Which is why it’s important for metadata managers to know the basics too and try to keep our character encodings clean…)

But it’s a great topic to apply yourself to, if you don’t get it at first, think about what you do get, what you don’t get, and what you might need to research or practice or learn about in order to figure out what you don’t get.

Sadly, the north american use of the legacy Marc-8 encoding, which isn’t used by anyone but north american libraries, and isn’t supported by most tools, makes character encoding for catalogers even more confusing.  To begin with, all you need to know is that Marc8 is yet another encoding, like the ones discussed in that article, but a different one. And, importantly:

If a document has been misinterpreted and converted to a different encoding, it’s broken. Trying to “repair” it may or may not be successful, usually it isn’t. Any manual bit-shifting or other encoding voodoo is mostly that, voodoo. It’s trying to fix the symptoms after the patient has already died.

That is, let’s say you take a Marc record that is really Marc8, but you add it to your ILS that thinks it’s UTF-8 and mis-interprets it as such. (Or vice versa).  Once you’ve taken that step, it may or may not be possible to ‘rescue’ it. This is why you need a basic understanding of encodings.

(To be fair, if that’s all you did, and you knew exactly which records you had done it to, it should be possible to rescue it. The real problems start if you have a record which was misinterpreted as the wrong encoding and then further edited by a tool operating under that misinterpretation. At that point, all bets are off.

And in reality, the problem is that once you’ve crosssed the encoding misinterpretation threshold, it can be infeasible to figure out which records are really rescueable and which are not, or even which records need rescueing and which don’t, in any automated bulk fashion.

And could take many hours of error-prone work to even try to do manually. It ends up being like, um…. trying to, um, take a bunch of needles and pins you’ve dumped into a haystack, but some of those pins were the wrong sized pins, you’ve got to find the wrong sized pins in the haystack, and replace them with other pins… while looking at the haystack only using binoculars, and only using tweezers to touch any of the hay or needles or pins. That made no sense, but that’s my best analogy for what trying to debug messed up char encodings feels like.  The amount of time and pain it takes is just not worth it to correct corrupted data — although the pain is inevitable when you need to debug and fix software that may be creating the corrupted char encoded data. But the only way to win this game is to not create the corrupted data in the first place, fixing it after the fact is a losing battle)

5 thoughts on “what every cataloger or metadata technician needs to know about character encoding

  1. Sadly, MARC-8 is used pretty frequently outside of North America as well. However, what’s more fun — many North American Libraries will utilize MARC8 for their bibliographic character set, but will utilize ISO-8859-1 when exporting data via Z39.50. This is particularly true of many III libraries if they haven’t specifically defined their output. Personally, what I find more problematic — MARC21 doesn’t have an easy define character encoding. You basically can say MARC8 or Unicode via Byte 9 (if it’s even set). This causes me all kinds of problems when working with folks in Asian countries that use Big5 format since it can look a lot like MARC8 at a byte level.

    –tr

  2. yep, indeed it’s a mess. Terry, those Asian countries use Marc21 (with it’s “marc8” or “unicode” only designated encodings), but use it with Big5? Argh.

    And yeah, software that ignores (both on read and write) the marc8 vs ‘unicode’ designation that MARC21 _does_ have is more common than not. (Also, have you ever tried to read the specs to know if “unicode” means UTF8 or UTF16 or what? It’s very unclear, I think maybe it was written in the unicode ‘1.0’ days, when “UCS” was the only unicode encoding, and never updated. UCS is sort of like UTF-16, but not actually UTF-16. Our standards bodies are not very good at keeping their standards up to date with the actual world.)

    MARC21, and the infrastructure that supports it, is definitely showing it’s age. it’s a mess out there.

    Oh, and of that software you say use MARC8 internally but ISO-8859-1 over Z39.50 — I bet much of it messes up the transcode too.

  3. I Can’t blame software that ignores byte 9 too much because very often the value isn’t set or set correctly in the record. MarcEdit for one just uses the byte as a signal — but evaluates the characterset of data using an algorithm derived from the Unicode Consortium to determine if data is ASCII, UTF8, UTF16 or UTF32. I found very early that the byte cannot be trusted, and a more heuristic approach was required to handling charactersets correctly. Just yesterday, I was working with data coming from a Swedish library where the data was in UTF8, but the byte 9 was set as MARC8. MarcEdit will fix the byte as it reads the data, but that is certainly an issue out there as well.

    –tr

  4. Even when it is set correctly, some routines will escape it as hexadecimal (fine) while it goes over MARC-8, but not reconvert it to UTF-8 (not fine) depending on the range of Unicode the character belongs to.

    The Africana Librarians Council & the Committee on Cataloging Asian and African Materials are both working more closely with OCLC to prioritize scripts for support. More cooperation would be welcome from other corners, as long as the issue is on the radar. Are there good points of contact at LITA, NISO, or PCC?

Leave a comment