Unicode normalization forms

So I didn’t even know anything about Unicode normalization before I had to learn it to debug my RefWorks issues, but it ends up mattering for a whole bunch of other things not related to RefWorks. An esoteric issue which it actually does pay to know about.

You can check out the official Unicode documentation on Unicode Normalization Forms.

Basically, in any given unicode encoding, say UTF-8 (but equally true for any encoding), there can be several ways to encode any given glyph on the screen.

For instance, a lowercase e with an acute accent can be encoded as a single unicode codepoint for the lowercase e with an acute accent, or can be encoded as two unicode codepoints, a lowercase e followed by a combining diacritic acute accent. It gets even more complicated than this, when you recall that a single latin character can theoretically have multiple diacritics applied to it, there can in fact be more than two ways to encode some glyphs. And then we get to non-Latin alphabets, which have their own “composed” (as few unicode codepoints as possible) or “decomposed” (multiple codepoints in a row) alternatives, which I didn’t even know about until I read the report above.

Why does it matter?

So anyway, the most obvious place this matters is when you’re comparing two unicode strings to see if they are “the same”.  And the Unicode Normalization Form report above is written mainly in terms of that use case. And that use case matters, for instance, if you are indexing unicode in Solr, and you want a string with one unicode encoding to match in the index a string that is really the ‘same thing’ in another encoding. And there are a variety of possible approaches to do that.

But that’s not what I’m going to talk about.

It turns out that unicode normalization forms seem to matter for display too.  I have found that both Firefox and IE on Windows (at least) will end up displaying decomposed unicode, well, screwily.  For instance, many decomposed forms, if you try to put them in a browser title bar with html <title>, seemed to end up being displayed just as blocks, rather than their proper characters.  In the browser window itself, decomposed unicode forms faired better, but still often seemed to be displayed in a variety of kind of screwy messy ways (diacritics not lining up properly with the letters they applied to, etc.).

Making sure all the unicode was in NFC (“composed”) form before displaying it in the browser seemed to result in significantly better display.

One example

Here’s an example in FF3 on Windows, ins ome particular font, yeah, might be different at different fonts and sizes. I have no idea if this is in fact a correct way to write this word according to any system, but it’s how it is in my database, it’s got an “i” which is suppose to have both a horizontal bar AND an acute accent over it, somehow.

NFC form:

Non-normalized decomposed form:

Actually, here they are right in the browser as text, how does your browser display these, is one better than the other?

NFC form: Sharīʻat and ambiguity in South Asian Islam

Non-normalized decomposed: Sharīʻat and ambiguity in South Asian Islam

(Sometimes it gets worse than this too, this is just an example I had at hand. Also in this case, BOTH ways do NOT display correctly in the Firefox title bar, when I try to put the UTF8 in an HTML title, although they display wrong in different ways! Oh well. I guess window title bars have additional limitations or bugs, perhaps OS-level? I have seen other cases where NFC displays correctly in title bar, but non-normalized decomposed does not. And in this case, BOTH display fine in firefox tab title, even though not in the browser window title bar! Go figure. )

W3C Recommendation

And indeed that Unicode report above suggests that:

The W3C Character Model for the World Wide Web, Part II: Normalization [CharNorm] and other W3C Specifications (such as XML 1.0 5th Edition) recommend using Normalization Form C for all content, because this form avoids potential interoperability problems arising from the use of canonically equivalent, yet different, character sequences in document formats on the Web. See the W3C Requirements for String Identity, Matching, and String Indexing [CharReq] for more background.

When to do it?

If you are starting from Marc records in Marc8, and using Marc4J to convert them to UTF-8, they will NOT wind up in NFC by default, they’ll wind up in a decomposed form (which may or may not be Normalized Form Decomposed, I’m not sure if it always is, but it generally is).   If there’s any other Marc8 to UTF8 converter around other than the Java one Marc4J uses, I wouldn’t be surprised if it does similar, this is the most obvious (and only reliably round-trippable) way to convert from Marc8 to UTF8, since Marc8’s method of representing non-ascii characters is analagous to “decomposed” unicode.

So, in a Solr discovery layer type application, there are a variety of places you could do the normalization. You could do it at indexing time, before anything is added to the index, so everything that goes into the index is in NFC. Or you could do it in your app, after you pull things out of the index, but before you send them accross the HTTP wire to a browser. And there are probably a few different control points in your application you could do this, at different levels.

I decided just doing it as early in the data chain as possible made sense, just get it done at the root, don’t worry about it again. So that’s at the indexing stage.

There’s probably some way to get Solr itself to do this, regardless of what unicode you throw at it. But you’d probably have to make sure you configure every single Solr field in your schema to do that, and you might want to do it differently for indexed vs stored fields (maybe NFKC for indexed vs NFC for stored), and I haven’t quite figured out how that works at the Solr end yet.

How to do it?

Or you can just have your indexing application do it before it feeds things to Solr. If you’re using SolrMarc, then the SolrMarc 2.1.1 release (currently a tag in the svn repo, but not yet a downloadable binary release) offers this option:

marc.unicode_normalize = C

In my case, my incoming Marc is in Marc8, and I’m having SolrMarc translate to UTF8 (via Marc4J), and this flag tells it when it does the translation also apply NFC normalization rules. I’m not entirely sure if that config would still be used by SolrMarc if your incoming Marc were in UTF8 to begin with, but you still wanted to make sure to NFC normalize it before adding to the index.

If you find yourself having to write your own code to do this normalization, it can be done pretty easily in most languages. In modern Java versions, there is a built in class.  If you are stuck in Java 1.4 as I am for a certain application, there’s icu4j, which I used no problem in my Java 1.4 app. (Also C/C++ libraries available there). In ruby, there’s the ruby unicode gem (which is a C-compiled gem, not sure if it’s based on the icu libraries or not), which I am also using no problem in a Rails app. (For some reason the simple methods I’m using don’t show up in the unicode api docs: Unicode.normalize_C, Unicode.normalize_KC, etc.).

About these ads
This entry was posted in General. Bookmark the permalink.

5 Responses to Unicode normalization forms

  1. Vojtech Sefler says:

    Hi, I checked the icu4j as u suggest for java 1.4 applications, but I am not able to figure out how to normalize and denormalize characters from unicode to iso-8859-2 and back.
    For example:
    input:á > normalization > output: \u00E1
    and back
    input:\u00E1 > normalization > output: á

    All I’ve found is ICU C documentation, but not the Java one.

  2. Vojtech Sefler says:

    correction:
    input:\u00E1 > denormalization > output: á

  3. jrochkind says:

    Trans-coding from a unicode encoding to ISO-8859-2 and back is not in fact the kind of normalizing I’m talking about here. The kind of normalizing I’m talking about here is about different valid forms of the same display text within unicode (within any of the several unicode encodings, such as UTF-8 or UTF016). What you’re talking about is not generally called “normalization”, but “conversion” or “transcoding”.

    The Java API (javadoc) documentation for ICU4J can be found here: http://icu-project.org/apiref/icu4j/ But it probably doesn’t serve as a very good tutorial. The way I approach things, is first to understand the basic concepts of character encoding in general and unicode in particular, then once I understand what I need to get the software to do, then go about figuring out how to get it to do so. Perhaps this web page can help get you oriented: http://www.joelonsoftware.com/articles/Unicode.html

  4. Pingback: rubyists who do Unicode: use TwitterCldr | Bibliographic Wilderness

  5. Pingback: Benchmarking ruby Unicode normalization alternatives | Bibliographic Wilderness

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s