Okay, I tricked you, I’m not going to tell you myself, I’m going to direct you to this very useful blog post:
Guess what, even if you’re not a programmer, if you:
- A) Manage metadata that includes diacritics, non-Latin characters, or anything else that’s not “ascii” (if you don’t know what ascii is, read the post!); and/or…
- B) Have to deal with MARC-21 records in the MARC-8 encoding…
You need to know about character encoding too. This is a great example from the “not every cataloger needs to know how to program, but every cataloger needs to know this technical topic anyway” pile.
The author of the “what every programmer” post even goes beyond programmers in his suggestion:
If you are dealing with text in a computer, you need to know about encodings. Period. Yes, even if you are just sending emails.
Being a professional metadata manager, oh yeah, you’ve gone beyond just sending emails, data stored in a computer is your business, and you’ve got to learn about encodings.
It is a bit confusing, especially if you’re not a programmer. (Believe me, it is confusing even for programmers! The concepts to some extent, but even more so debugging it in practice. Dealing with corrupted character encodings is the most confusingthing that’s regularly part of my work. Which is why it’s important for metadata managers to know the basics too and try to keep our character encodings clean…)
But it’s a great topic to apply yourself to, if you don’t get it at first, think about what you do get, what you don’t get, and what you might need to research or practice or learn about in order to figure out what you don’t get.
Sadly, the north american use of the legacy Marc-8 encoding, which isn’t used by anyone but north american libraries, and isn’t supported by most tools, makes character encoding for catalogers even more confusing. To begin with, all you need to know is that Marc8 is yet another encoding, like the ones discussed in that article, but a different one. And, importantly:
If a document has been misinterpreted and converted to a different encoding, it’s broken. Trying to “repair” it may or may not be successful, usually it isn’t. Any manual bit-shifting or other encoding voodoo is mostly that, voodoo. It’s trying to fix the symptoms after the patient has already died.
That is, let’s say you take a Marc record that is really Marc8, but you add it to your ILS that thinks it’s UTF-8 and mis-interprets it as such. (Or vice versa). Once you’ve taken that step, it may or may not be possible to ‘rescue’ it. This is why you need a basic understanding of encodings.
(To be fair, if that’s all you did, and you knew exactly which records you had done it to, it should be possible to rescue it. The real problems start if you have a record which was misinterpreted as the wrong encoding and then further edited by a tool operating under that misinterpretation. At that point, all bets are off.
And in reality, the problem is that once you’ve crosssed the encoding misinterpretation threshold, it can be infeasible to figure out which records are really rescueable and which are not, or even which records need rescueing and which don’t, in any automated bulk fashion.
And could take many hours of error-prone work to even try to do manually. It ends up being like, um…. trying to, um, take a bunch of needles and pins you’ve dumped into a haystack, but some of those pins were the wrong sized pins, you’ve got to find the wrong sized pins in the haystack, and replace them with other pins… while looking at the haystack only using binoculars, and only using tweezers to touch any of the hay or needles or pins. That made no sense, but that’s my best analogy for what trying to debug messed up char encodings feels like. The amount of time and pain it takes is just not worth it to correct corrupted data — although the pain is inevitable when you need to debug and fix software that may be creating the corrupted char encoded data. But the only way to win this game is to not create the corrupted data in the first place, fixing it after the fact is a losing battle)