Be careful of regexes in a unicode world

Check out the following, which I wrote some time ago:

    # remove non-alphanumeric, excluding apostrophe; replace with space
    title.gsub!(/[^\w\s\']/, ' ') 

See any problem with that? What is \w and \s again? The ruby docs helpfully explain:

/\w/ - A word character ([a-zA-Z0-9_])
/\s/ - A whitespace character: /[ \t\r\n\f]/

See the problem yet?

"el revolución".gsub(/[^\w\s\']/, ' ')
# => "el revoluci n"

Oops. ó is not in the class [a-zA-Z0-9_]. \w doesn’t actually mean “a word character” at all, unless your input is only ascii. The docs probably really should warn you about this, describing the class as “an ascii word character”, and warning you to use other metacharacters if you aren’t just dealing with ascii.

Fortunately, ruby also provides some unicode-aware regex character classes, but they’re a lot harder to remember and longer to type. Here it is right, let’s use unicode-aware spacing instead of `\s` too:

"el: revolución".gsub(/[^[[:alnum:]][[:space:]]\']/, ' ')
#=> "el  revolución"

Yep, that’s what we wanted. There are several other unicode-aware character classes, apparently defined by POSX. The docs also say there’s a couple non-POSIX ones, including:

/[[:word:]]/ - A character in one of the following Unicode general categories Letter, Mark, Number, Connector_Punctuation

I wasn’t able to make that work, it didn’t seem to be recognized in my ruby. I am not sure why, and didn’t bother finding out. What works is good enough for me.

But in a non-ascii world, it turns out, you almost never actually want to use those traditional regex character class metacharacters that many of us have been using for decades. \w and \s, no way. \d is less risky since you probably really do mean 0-9 and not digits from some other script, but that better be what you mean.


One thought on “Be careful of regexes in a unicode world”

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s