If you ever write code to deal with non-ascii unicode text, you probably need to deal with unicode normalization. If you ever need to compare two non-ascii unicode strings for equality, you definitely need to deal with unicode normalization.
So anyways. ruby 1.9+ is generally so good at dealing with unicode, that it’s a surprising omission that unicode normalization is not included in the stdlib. I expected maybe to see it in MRI 2.1, but I don’t think it’s been added. (We do get the very exciting String#scrub in ruby 2.1, but that’s a different story. Yes, that’s right, I said: Very. Exciting.).
So you’ve got to look around for some other gem to do it. And there are a bunch. With potentially different performance characteristics. As part of gems that may include more or less other functionality, unicode or otherwise. Which may be in pure ruby, or may be using C extensions — which probably won’t work on jruby. In Java, normalization is built into the stdlib, so in jruby using the Java stdlib is probably going to be highest performance — some gems might know that and use it, others might not. Etc.
Ruby Unicode Normalization Alternatives
Here are the ones I know about:
|gem||implementation architecture||other features and notes|
|unicode||C extension, works on MRI only.||Normalization, and other unicode utilities like upcase/downcase, as well as codepoint attributes.|
|unicode_utils||Pure ruby||normalization, case change, codepoint attributes, etc. At one point, `unicode` gem didn’t work on ruby 1.9.3, and this one got a lot of attention. This one also works on jruby — although there might be better Java stdlib alternatives for some/all func.|
|activesupport||I honestly have no idea if ActiveSupport’s unicode normalization code is pure ruby, or if it varies depending on ruby engine platform. It does work on both jruby and MRI.||Yep, turns out ActiveSupport includes unicode normalization too. ActiveSupport does all sorts of other things too of course, beyond just unicode. You know what ActiveSupport is.|
|twitter_cldr||pure ruby||A really cool gem, that does all sorts of unicode things in pure ruby, especially locale-aware things, many of which as far as I know are not available from any other source in ruby. Including locale-aware collation, date and time formatting, pluralization, etc. It also does unicode normalization (because it needs to in order to do the other things).|
|unf||MRI version using C extension, Java version that just passes through to Java stdlib||I never found this gem in my previous googling, until Bill Dueber happened to point it out to me today. It just does unicode normalization. That’s it.|
Okay, let’s benchmark em!
I’m only benchmarking in 1.9.3, cause that’s what I care about. If you care about unicode and are still using 1.8.7, stop. I would be surprised if MRI 2.0 or 2.1 result in different relative performances here, but who knows. I am testing in both MRI and jruby, because I do care about that. (Please do, really, feel free to clone and/or fork my repo and test in other things! Please don’t feel free to ask me to do additional tests because you are interested in them.)
The benchmark script uses a file I assembled of a bunch of random text in a bunch of alphabets. It’s definitely heavy on the non-Latin. Then it takes that file, line by line, and runs all the normalization transformations on it. I honestly am not sure if my test data or operations are properly ‘representative’, or even what that means, I just took a bunch of text and did a bunch of normalizations to it. Using each alternative.
$ ruby -v jruby 1.7.6 (1.9.3p392) 2013-10-22 6004147 on Java HotSpot(TM) 64-Bit Server VM 1.6.0_51-b11-457-11M4509 [darwin-x86_64] $ ruby benchmark.rb Rehearsal -------------------------------------------------- unicode_utils 4.850000 0.080000 4.930000 ( 2.267000) active_support 3.700000 0.060000 3.760000 ( 2.239000) twitter_cldr 104.480000 1.800000 106.280000 ( 98.554000) unf 0.740000 0.010000 0.750000 ( 0.411000) --------------------------------------- total: 115.720000sec user system total real unicode_utils 1.150000 0.020000 1.170000 ( 1.130000) active_support 1.520000 0.010000 1.530000 ( 1.507000) twitter_cldr 93.580000 1.580000 95.160000 ( 92.664000) unf 0.150000 0.000000 0.150000 ( 0.151000)
$ ruby -v ruby 1.9.3p448 (2013-06-27 revision 41675) [x86_64-darwin12.4.0] $ ruby benchmark.rb benchmark.rb:9: warning: already initialized constant Unicode Rehearsal -------------------------------------------------- unicode_utils 1.320000 0.000000 1.320000 ( 1.326081) active_support 2.050000 0.020000 2.070000 ( 2.072979) twitter_cldr 94.980000 0.380000 95.360000 ( 96.081615) unf 0.060000 0.000000 0.060000 ( 0.076245) unicode_gem 0.390000 0.000000 0.390000 ( 0.395390) ---------------------------------------- total: 99.200000sec user system total real unicode_utils 1.360000 0.000000 1.360000 ( 1.354697) active_support 1.920000 0.000000 1.920000 ( 1.923265) twitter_cldr 94.330000 0.350000 94.680000 ( 95.179466) unf 0.060000 0.000000 0.060000 ( 0.057290) unicode_gem 0.390000 0.000000 0.390000 ( 0.387942)
The unf gem is the one you are least likely to know about, it doesn’t really turn up googling. But it’s the one you want!
Just use the `unf` gem! Look how damn fast it is! So much damn faster than anything else! It works on MRI or jruby, it’s fast on both. (No surprise it’s the fastest on jruby, as it uses the Java stdlib on jruby, and I’m not sure anything else does. It uses compiled C extension on MRI; so does the ‘unicode’ gem, but `unf` is still a lot faster, and `unicode` is MRI-only).
It’s so fast, and it just does unicode normalization — all the other alternatives tend to do more than just normalization. I’d strongly suggest all of them consider ditching their own unicode normalization algorithms, and just make `unf` a dependency and use it’s normalization algorithm, which is so much faster than yours. (And unf appears to work just fine on ruby 1.8.7, if you need to support 1.8.7 too; not al the alternatives do. Hooray!). (The only downside of unf is that it monkey-patches String, bah! I wish it would stop, but I’m going to live with it, and just not use the monkey-patched API myself).
twitter_cldr is a really neat gem, and it does a lot of useful things nobody else in rubyland does. If you are using it for some of those things, you might be tempted to think “Hey, I’ve already got twitter_cldr as a dependency, and it does normalization, I’ll just use it’s normalization instead of adding another dependency.” (I was!) Don’t do that, because 2 orders of magnitude slower than unf. I’d suggest the twitter_cldr devs consider just adding `unf` as a dependency, and using it for their normalization needs. Since some of the other algorithms twitter_cldr supports require normalization as a constituent step, using the ultra-fast unf gem might speed up other dependent parts of twitter_cldr too.
Although, as always with this kind of micro-benchmarking, it’s worth pointing out that, depending on how much of your processing time is spent on normalization, it simply might not matter. Who cares if you optimize your unicode normalization time from 0.004% of total time to 0.001% of total time, right? Maybe. Who knows. But if you’ve got alternatives, why not pick the faster one, when there’s no other reason not to? unf doesn’t do anything but unicode normalization, and it does it high performance, so just use it, says me.