In ruby 1.9.3 , ever string is tagged with it’s character encoding, a fact which we all know and (um) love.
Certain byte sequences are invalid in any given encoding.
If you’re converting from one encoding to another with String#encode, ruby will raise an exception if it encounters such an invalid byte sequence; or you can have it replace the invalid byte sequence with a replacement char.
But what if you’re not converting from one encoding to another, but you still want to check for and/or repair/replace bad bytes?
Oddly, as far as I can tell ruby gives you no way to do it. Clearly, this feature should be added. If I was any good at understanding ruby’s C code maybe I’d try to submit a patch, but it’s kind of all greek to me. But here’s what I think is the relevant file in MRI, if you have more C-fu than I.
One really lame way would be to transcode to a different encoding to take advantage of the features in #encode, then transcode back again when you’re done. But it has to be an encoding which can be ’round tripped’ to your original one, which means you can’t really write a general purpose solution for any old input encoding, and besides that’s just a really lame solution.
So here’s an (also) really lame pure-ruby implementation. It’s probably not very performant. It tries to be somewhat compatible with #encode’s :invalid and :replace options. (The way those options work is a weird api if you ask me, but seems useful to be consistent with the options you’d use if you were transcoding). I would love it if someone else could give us a better solution or improve this code!
# Pass in a string, will raise an Encoding::InvalidByteSequenceError # if it contains an invalid byte for it's encoding; otherwise # returns an equivalent string. # # OR, like String#encode, pass in option `:invalid => :replace` # to replace invalid bytes with a replacement string in the # returned string. Pass in the # char you'd like with option `:replace`, or will, like String#encode # use the unicode replacement char if it thinks it's a unicode encoding, # else ascii '?'. # # in any case, method will raise, or return a new string # that is #valid_encoding? def validate_encoding(str, options = {}) str.chars.collect do |c| if c.valid_encoding? c else unless options[:invalid] == :replace # it ought to be filled out with all the metadata # this exception usually has, but what a pain! # Why isn't ruby doing this for us? raise Encoding::InvalidByteSequenceError.new else options[:replace] || ( # surely there's a better way to tell if # an encoding is a 'Unicode encoding form' # than this? What's wrong with you ruby 1.9? str.encoding.name.start_with?('UTF') ? "\uFFFD" : "?" ) end end end.join end
As you can see, there are several weird things here that make me think “why wasn’t ruby 1.9 designed more carefully?” I mean, String#encode talks about doing things differently with certain encodings that are ‘unicode
encoding forms’, so why isn’t there a method #unicode_encoding_form? on Encoding? What the heck? Anyway.
ruby-1.9.3-rc1 :002 > a = "bad: \xc3\x28 okay".force_encoding("utf-8") => "bad: \xC3( okay" ruby-1.9.3-rc1 :003 > validate_encoding(a) Encoding::InvalidByteSequenceError: Encoding::InvalidByteSequenceError from check_encoding.rb:9:in `block in validate_encoding' from check_encoding.rb:2:in `chars' from check_encoding.rb:2:in `each' from check_encoding.rb:2:in `collect' from check_encoding.rb:2:in `validate_encoding' from (irb):3 from /Users/jrochkind/.rvm/rubies/ruby-1.9.3-rc1/bin/irb:16:in `' ruby-1.9.3-rc1 :004 > validate_encoding(a, :invalid => :replace) => "bad: �( okay" ruby-1.9.3-rc1 :005 > validate_encoding(a, :invalid => :replace, :replace => "*") => "bad: *( okay" ruby-1.9.3-rc1 :006 > validate_encoding(a, :invalid => :replace, :replace => "") => "bad: ( okay"
One thought on “checking/fixing bad bytes in ruby 1.9 char encoding”