checking/fixing bad bytes in ruby 1.9 char encoding

In ruby 1.9.3 , ever string is tagged with it’s character encoding, a fact which we all know and (um) love.

Certain byte sequences are invalid in any given encoding.

If you’re converting from one encoding to another with String#encode, ruby will raise an exception if it encounters such an invalid byte sequence; or you can have it replace the invalid byte sequence with a replacement char.

But what if you’re not converting from one encoding to another, but you still want to check for and/or repair/replace bad bytes?

Oddly, as far as I can tell ruby gives you no way to do it. Clearly, this feature should be added. If I was any good at understanding ruby’s C code maybe I’d try to submit a patch, but it’s kind of all greek to me. But here’s what I think is the relevant file in MRI, if you have more C-fu than I.

One really lame way would be to transcode to a different encoding to take advantage of the features in #encode, then transcode back again when you’re done. But it has to be an encoding which can be ’round tripped’ to your original one, which means you can’t really write a general purpose solution for any old input encoding, and besides that’s just a really lame solution.

So here’s an (also) really lame pure-ruby implementation. It’s probably not very performant. It tries to be somewhat compatible with #encode’s :invalid and :replace options. (The way those options work is a weird api if you ask me, but seems useful to be consistent with the options you’d use if you were transcoding). I would love it if someone else could give us a better solution or improve this code!

# Pass in a string, will raise an Encoding::InvalidByteSequenceError
# if it contains an invalid byte for it's encoding; otherwise
# returns an equivalent string.
#
# OR, like String#encode, pass in option `:invalid => :replace`
# to replace invalid bytes with a replacement string in the
# returned string.  Pass in the
# char you'd like with option `:replace`, or will, like String#encode
# use the unicode replacement char if it thinks it's a unicode encoding,
# else ascii '?'.
#
# in any case, method will raise, or return a new string
# that is #valid_encoding?
def validate_encoding(str, options = {})
  str.chars.collect do |c|
    if c.valid_encoding?
      c
    else
      unless options[:invalid] == :replace
        # it ought to be filled out with all the metadata
        # this exception usually has, but what a pain!
        # Why isn't ruby doing this for us?
        raise  Encoding::InvalidByteSequenceError.new
      else
        options[:replace] || (
         # surely there's a better way to tell if
         # an encoding is a 'Unicode encoding form'
         # than this? What's wrong with you ruby 1.9?
         str.encoding.name.start_with?('UTF') ?
            "\uFFFD" :
            "?" )
      end
    end
  end.join
end

As you can see, there are several weird things here that make me think “why wasn’t ruby 1.9 designed more carefully?” I mean, String#encode talks about doing things differently with certain encodings that are ‘unicode
encoding forms’, so why isn’t there a method #unicode_encoding_form? on Encoding? What the heck? Anyway.

ruby-1.9.3-rc1 :002 > a = "bad: \xc3\x28 okay".force_encoding("utf-8") => "bad: \xC3( okay"
ruby-1.9.3-rc1 :003 > validate_encoding(a)
Encoding::InvalidByteSequenceError: Encoding::InvalidByteSequenceError
	from check_encoding.rb:9:in `block in validate_encoding'
	from check_encoding.rb:2:in `chars'
	from check_encoding.rb:2:in `each'
	from check_encoding.rb:2:in `collect'
	from check_encoding.rb:2:in `validate_encoding'
	from (irb):3
	from /Users/jrochkind/.rvm/rubies/ruby-1.9.3-rc1/bin/irb:16:in `'
ruby-1.9.3-rc1 :004 > validate_encoding(a, :invalid => :replace)
 => "bad: �( okay"
ruby-1.9.3-rc1 :005 > validate_encoding(a, :invalid => :replace, :replace => "*")
 => "bad: *( okay"
ruby-1.9.3-rc1 :006 > validate_encoding(a, :invalid => :replace, :replace => "")
 => "bad: ( okay"
About these ads
This entry was posted in General. Bookmark the permalink.

One Response to checking/fixing bad bytes in ruby 1.9 char encoding

  1. Pingback: removing illegal bytes for encoding in ruby 1.9+ strings | Bibliographic Wilderness

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s