unicode normalization in ruby 2.2

Ruby 2.2 finally introduces a #unicode_normalize method on strings. Defaults to :nfc, but you can also normalize to other unicode normalization forms such as :nfd, :nfkc, and :nfkd.

some_string.unicode_normalize(:nfc)

Unicode normalization is something you often have to do when dealing with unicode, whether you knew it or not. Prior to ruby 2.2, you had to install a third-party gem to do this, adding another gem dependency. Of the gems available, some money-patched string in ways I wouldn’t have preferred, some worked only on MRI and not jruby, some had unpleasant performance characteristics, etc.  Here’s some benchmarks I ran a while ago on available gems giving unicode normalization and performance, although since I did those benchmarks new options appeared and performance characteristics changed , but now we don’t need to deal with it, just use the stdlib.

One thing I can’t explain is that the only ruby stdlib documentation I can find on this, suggests the method should be called just `normalize`.  But nope, it’s actually `unicode_normalize`.  Okay. Can anyone explain what’s going on here?

`unicode_normalized?` (not just `normalized?`) is also available, also taking a normalization form argument.

The next major release of Rails, Rails 5, is planned to require ruby 2.2.   I think a lot of other open source will follow that lead.  I’m considering switching some of my projects over to require ruby 2.2 as well, to take advantage of some of the new stdlib like this. Although I’d probably wait until JRuby 9k comes out, planned to support 2.2 stdlib and other changes.  Hopefully soon. In the meantime, I might write some code that uses #unicode_normalize when it’s present, otherwise monkey-patches in a #unicode_normalize method implemented with some other gem — although that still requires making the other gem a dependency.  Which I’ll admit there are some projects I have that really should be unicode normalizing in some places, but I could barely get away without it, and skipped it because I didn’t want to deal with the dependency. Or I could require MRI 2.2 or jruby latest, and just monkey-patch a simple pure-java #unicode_normalize if JRuby and not String.instance_methods.include? :unicode_normalize.

This entry was posted in General. Bookmark the permalink.

3 Responses to unicode normalization in ruby 2.2

  1. Matz preferred unicode_normalize over normalize (https://bugs.ruby-lang.org/issues/10084#note-7). The UnicodeNormalize module just use normalize.

    If the docs still say String#normalize, then they need to be corrected.

  2. jrochkind says:

    Thanks Steven. The only docs I can find at all are the docs I link to for UnicodeNormalize module, which specify just `normalize`. The ruby 2.2 docs for String at http://ruby-doc.org/core-2.2.0/String.html do not mention either a normalize method or a unicode_normalize method. The new, very useful, functionality seems to be undocumented. But the String#unicode_normalize method seems to work as the UnicodeNormalize#normalize method is documented to.

    I suppose I could file a ticket saying that the new #unicode_normalize method does not appear in the API docs. I’m not sure if that will result in anything, but I’ll try it!

  3. Actually, String#unicode_normalize is part of the standard library, so the documentation for it is here:
    http://ruby-doc.org/stdlib-2.2.0/libdoc/unicode_normalize/rdoc/String.html

    It seems like it’s the first bit of String functionality that is in standard library and not in ruby core.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s