removing illegal bytes for encoding in ruby 1.9+ strings

Update 2014: See ruby 2.1 String#scrub, or my scrub_rb gem for a pure-ruby ‘polyfill’ in other ruby versions. 


So it turns out you can have ruby strip illegal bytes for any arbitrary encoding (like UTF-8), or replace them with “?” or the unicode replacement char “�”.

You’ve got to use the second argument to String#encode, “source encoding”, and pass “binary” there.

# replace any bad bytes in `str` with unicode replacement
# char
str = str.encode( "UTF-8", "binary", :invalid => :replace, :undef => :replace)

# or without assuming our string is UTF-8, just remove
# bad bytes from the string regardless of it's encoding:
str = str.encode( str.encoding, "binary", :invalid => :replace, :undef => :replace)

# or of course the in-place mutating version
str.encode!( str.encoding, "binary", :invalid => :replace, :undef => :replace)

Which actually doesn’t make a lot of sense — “binary”, also called “ASCII-8BIT”, is essentially the “null encoding”, it means “no encoding at all, just bytes”. So that call would seem to say “transcode from ‘raw bytes’ to UTF8” — which of course doesn’t mean anything, there is no such transformation defined.

But apparently what it means to ruby is “don’t trans-code, but do be willing to respect the :invalid => :replace and :undef => :replace options.”

If you just do str.encode( str.encoding, :invalid => :replace, :undef => :replace), it’s always a no-op, ruby stdlib says “It’s already IN that encoding, I don’t need to do anything, done!”, and doesn’t touch your invalid bytes to replace them.

This isn’t, as far as I know, documented anywhere. It’s not, in my opinion, very obvious at all.  But, there it is.  I found this out in a blog post that I’ve unfortunately lost so I can’t give credit where it’s due — I have no idea how they discovered it, they just dropped it in passing in their blog as if it was something anyone might know.

The long history of this realization

So, I need to do this. I have input which is theoretically in UTF8. But it sometimes has bad bytes in it — bytes that are illegal for UTF8.

Which means as soon as you try to do much of anything with it, you’ll get a Encoding::InvalidByteSequenceError.  You can rescue this exception — or check #valid_encoding? as soon as you read the input to discover it in advance — but then what?  I guess you could just refuse to do anything else with that input, and say “Skipped that guy, it was illegal.”

But often, what I want to do instead is recover and continue, replacing the bad bytes with question marks to let the user know it was a bad byte which could not be interpreted. (Or sometimes with the empty string, just ignore it).  This doesn’t seem like a weird thing to me to do. Plenty of other software does it, after all — open up a UTF8 doc with bad bytes in it in vi and see what happens. Plenty of software does it, I’d think that would make it fairly obvious this is an ordinary thing to do.

But for some reason, I had a lot of trouble convincing anyone else in rubydom that this is something you’d ever want to do. Except my fellow library programmers, almost all of whom were like “Oh yeah, I need to do that all the time too.” Apparently our domain is such that we need to do this often, but most ruby devs don’t, I dunno.

I tried blogging the question, and posting my blog to reddit as a question.  People either didn’t understand what I was asking, or tried to convince me I didn’t really want to do that after all, or else didn’t have any solution. (Perhaps my attempt at an engaging title back-fired and made people defensive, sorry).  I tried asking on stackoverflow, same thing.

Encouraged by drbrain to do so, I filed as a bug with ruby the fact that String stdlib was missing API to easily remove bad bytes.   The response was again to mostly say they didn’t understand the use case and it didn’t seem neccesary — but even on the ruby tracker, nobody realized it was already in the stdlib! They instead argued that there was no need for it in stdlib, ha.

But I still needed to do it. Not just for strings in UTF8, but sometimes in a library function that will work on a string of any arbitrary encoding — replace or remove the bad bytes in it. Not necessarily just for UTF8.

And it wasn’t completely obvious how to do this, although it ended up not being too hard or complicated.

So I went and wrote my own gem to do it.  drbrain kindly showed me a way to make my gem more reliable and efficient, even though he presumably still didn’t understand why I’d ever want to do this.

Turns out it was built into stdlib all along, but I never knew it until recently, about 10 months after I first started asking about it.

I guess I’ll release a new version of my gem that simply wraps using String#encode with a binary source_encoding argument.

Meanwhile, most of rubydom still won’t understand why anyone would ever want to do this. (Thanks fellow code4libbers for keeping me sane). If you are still unconvinced of why this is a perfectly ordinary thing to do, I’ve learned I’m incapable of explaining it or convincing you, so I won’t try anymore.

I hope it’s not an accident in the ruby stdlib, and won’t go away in the future. If it does, I guess I can go back to my gem with it’s fairly simple implementation. If it is intentional, it seems like it would be nice if it were actually documented. But in the meantime, maybe this blog post will be findable by google, and save someone else that needs this function all the tzuris I went through to get to it!

This entry was posted in General. Bookmark the permalink.

6 Responses to removing illegal bytes for encoding in ruby 1.9+ strings

  1. Patrick says:

    Are you sure this is correct? It seems like transcoding from ‘binary’ just drops everything outside ASCII-8BIT:

    Example with valid UTF-8 string:
    ”’
    2.0.0p195 :066 > temp = “集合Leap Motion应用的应用商店Airspace进入开发者测试阶段”
    => “集合Leap Motion应用的应用商店Airspace进入开发者测试阶段”
    2.0.0p195 :067 > temp.valid_encoding?
    => true
    2.0.0p195 :068 > temp.encoding
    => #
    2.0.0p195 :069 > temp.encode(‘UTF-8’, ‘binary’, :undef => :replace, :invalid => :replace)
    => “������Leap Motion���������������������Airspace���������������������������”
    ”’

  2. jrochkind says:

    Huh, your test sure seems to show it isn’t so. I swear it used to be so. This stuff IS confusing.

    Could it be a ruby 1.9.3 vs 2.0 thing? Nope, I can reproduce your failure case in ruby 1.9.3 too.

    I guess I got confused and was wrong? Drat. Thanks for pointing it out. Not sure what’s going on, honestly.

    Oh well, there’s still https://github.com/jrochkind/ensure_valid_encoding

  3. jrochkind says:

    Wow, that’s great Patrick thanks! Not til 2.1 though, heh.

    When I’ve tried reporting this as a problem/bug/feature-request, in public or in the ruby tracker, pretty much _everyone_ that responded said “I don’t understand why you’d ever want to do that.”

    Which is odd to me, cause me and all of my colleagues in my domain need to do it all the time. But I guess some ruby committer eventually had the same idea, great!

  4. Pingback: Benchmarking ruby Unicode normalization alternatives | Bibliographic Wilderness

  5. Ken D'Ambrosio says:

    O. M. G. Thank you so, so much. I don’t understand why most of the community doesn’t seem to understand that we frequently have to parse text (e.g., e-mail) that can have some truly funky encodings. I’ve tried most everything shy of pulling out my hair — this is much appreciated.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s