Be careful of regexes in a unicode world

Check out the following, which I wrote some time ago:

    # remove non-alphanumeric, excluding apostrophe; replace with space
    title.gsub!(/[^\w\s\']/, ' ') 

See any problem with that? What is \w and \s again? The ruby docs helpfully explain:

/\w/ - A word character ([a-zA-Z0-9_])
/\s/ - A whitespace character: /[ \t\r\n\f]/

See the problem yet?

"el revolución".gsub(/[^\w\s\']/, ' ')
# => "el revoluci n"

Oops. ó is not in the class [a-zA-Z0-9_]. \w doesn’t actually mean “a word character” at all, unless your input is only ascii. The docs probably really should warn you about this, describing the class as “an ascii word character”, and warning you to use other metacharacters if you aren’t just dealing with ascii.

Fortunately, ruby also provides some unicode-aware regex character classes, but they’re a lot harder to remember and longer to type. Here it is right, let’s use unicode-aware spacing instead of `\s` too:

"el: revolución".gsub(/[^[[:alnum:]][[:space:]]\']/, ' ')
#=> "el  revolución"

Yep, that’s what we wanted. There are several other unicode-aware character classes, apparently defined by POSX. The docs also say there’s a couple non-POSIX ones, including:

/[[:word:]]/ - A character in one of the following Unicode general categories Letter, Mark, Number, Connector_Punctuation

I wasn’t able to make that work, it didn’t seem to be recognized in my ruby. I am not sure why, and didn’t bother finding out. What works is good enough for me.

But in a non-ascii world, it turns out, you almost never actually want to use those traditional regex character class metacharacters that many of us have been using for decades. \w and \s, no way. \d is less risky since you probably really do mean 0-9 and not digits from some other script, but that better be what you mean.

This entry was posted in General. Bookmark the permalink.

One Response to Be careful of regexes in a unicode world

  1. If you want to stick with shorter-and-more-cryptic, ruby also supports \p and \P (as detailed [in the docs](http://www.ruby-doc.org/core-1.9.3/Regexp.html#class-Regexp-label-Character+Properties)

    ~~~
    “el revolución”.gsub /[^\p{L}\p{Z}\’]/, ‘ ‘
    ~~~

    The only ones I ever remember off the top of my head are `\p{L}` for letters, `\p{Z}` for whitespace, and `\p{P}` for punctuation. The rest I always need to look up :-)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s