TwitterCldr is an awesome gem from the twitter folks for doing all sorts of things you need to do with unicode. It is basically a pure ruby re-implementation of significant portions of the unicode algorithms contained in the ICU library.
This includes normalization, which if you deal with unicode you almost definitely need to be doing, whether you’ve realized it yet or not. But which is missing from the ruby stdlib. (There are a variety of other gems which will do just the normalizatino part, I’m not sure which ones are pure ruby, and what the working-on-jruby status of the others is, etc. TwitterCLDR seems recent, robust, and mature.)
It also includes some fancier ICU stuff, including locale-specific sorting (and production of locale-specific collation keys to store in your rdbms or Solr or other store, for locale-specific sorting), locale-specific number-and-date formatting, names of languages in different languages (eg “español”), etc.
Incidentally, the more I work with Unicode, the more impressed I am with it as a standard. It’s solid. And my being impressed at Unicode isnt’ just about the unicode character set and encodings, but all the supporting algorithms the Unicode folks have created (and made sure Unicode supports) to do things you want to do with text, which turn out to be difficult to do locale-independently. Sorting, comparing, up/downcasing, identifying whitespace/punctuation, etc. To support these algorithms, each unicode codepoint has a number of metadata attributes set about that codepoint — TwitterCLDR gives you direct access to those too. TwitterCLDR really does a bunch of neat stuff.