A while ago, I was looking for a way in ruby to turn text with diacritics (é) and ligatures (Æ) and other such things into straight ascii (e ; AE).
I found there were various gems that said they could do such things, but they all had problems. In part, because the ‘right’ way to do this… is really unclear in the general case, there’s all sorts of edge cases, and locale-dependent choices, and the giant universe of unicode to deal with. In fact, it looks like at one point the Unicode/CLDR suite included such an algorithm, but it kind of looks like it’s been abandoned and not supported, with no notes as to why but I suspect the problem proved intractable. (Some unicode libraries currently support it anyway; part of Solr actually does in one place; communication about these things seems to travel slowly).
For what I was working on before, I realized that “transliterating to ascii” wasn’t the right solution after all — instead, what I wanted was the Unicode Collation Algorithm, which you can use to produce a collation string, such that for instance “é” will transform to the same collation string as “e”, and “Æ” to the same collation string “AE” — but that collation string isn’t meant to be user-displayable, it won’t neccesarily actually be “e” or “AE”. It can still be used for sorting or comparing in a “down-sampled to ascii” invariant way. And, like most of the Unicode suite, it’s pretty well-thought-through and robust to many edge cases.
For that particular case of sorting or comparing in a “down-sampled to ascii invariant way”, you want to create a Unicode collation sort key, for :en locale, with “maximum level” set to 1. And it works swimmingly. In ruby, you can do that with the awesome twitter_cldr gem — I contributed a patch to support maximum_level, which I think has made it into the latest version.
Anyway, after that lengthy preface explaining why you probably don’t really want to “transliterate to ascii” exactly, and it’s doomed to be imperfect and incomplete…
…I recently noticed that the ruby i18n gem, as used in Rails, actually has a transliterate-to-ascii feature built in. With some support for localization of transliteration rules that I don’t entirely understand. But anyhow, if I ever wanted this function it in the future — knowing it’s going to be imperfect and incomplete — I’d use the one from I18n, rather than go hunting for the function in some probably less maintained gem.
I guess you might want to do this for creating ‘slugs’ in URL paths, becuase non-ascii in URL’s ends up being such a mess… it would probably mostly work good enough for an app which really is mostly English, but if you’re really dealing heavily in non-ascii and especially non-roman text, it’s going to get more complicated than this fast. Anyway.
I18n.transliterate("Ærøskøbing") # => "AEroskobing" # When it can't handle it, you get ? marks. I18n.transliterate("日本語") # => "???"
Still haven’t figured out: How to get the ruby irb/pry/debugger console on my OSX workstation to let me input UTF8, which would make playing out stuff like this and figuring stuff out a lot easier! Last time I tried to figure it out, I got lost in many layers of yak shaving involving homebrew, readline libraries, rebuilding ruby from source… and eventually gave up. I am curious if every ruby developer on OSX has this problem, or if I’ve somehow wound up unique.