In ruby 1.9.3 , ever string is tagged with it’s character encoding, a fact which we all know and (um) love.
Certain byte sequences are invalid in any given encoding.
If you’re converting from one encoding to another with String#encode, ruby will raise an exception if it encounters such an invalid byte sequence; or you can have it replace the invalid byte sequence with a replacement char.
But what if you’re not converting from one encoding to another, but you still want to check for and/or repair/replace bad bytes?
Oddly, as far as I can tell ruby gives you no way to do it. Clearly, this feature should be added. If I was any good at understanding ruby’s C code maybe I’d try to submit a patch, but it’s kind of all greek to me. But here’s what I think is the relevant file in MRI, if you have more C-fu than I.
One really lame way would be to transcode to a different encoding to take advantage of the features in #encode, then transcode back again when you’re done. But it has to be an encoding which can be ’round tripped’ to your original one, which means you can’t really write a general purpose solution for any old input encoding, and besides that’s just a really lame solution.
So here’s an (also) really lame pure-ruby implementation. It’s probably not very performant. It tries to be somewhat compatible with #encode’s :invalid and :replace options. (The way those options work is a weird api if you ask me, but seems useful to be consistent with the options you’d use if you were transcoding). I would love it if someone else could give us a better solution or improve this code!
# Pass in a string, will raise an Encoding::InvalidByteSequenceError
# if it contains an invalid byte for it's encoding; otherwise
# returns an equivalent string.
#
# OR, like String#encode, pass in option `:invalid => :replace`
# to replace invalid bytes with a replacement string in the
# returned string. Pass in the
# char you'd like with option `:replace`, or will, like String#encode
# use the unicode replacement char if it thinks it's a unicode encoding,
# else ascii '?'.
#
# in any case, method will raise, or return a new string
# that is #valid_encoding?
def validate_encoding(str, options = {})
str.chars.collect do |c|
if c.valid_encoding?
c
else
unless options[:invalid] == :replace
# it ought to be filled out with all the metadata
# this exception usually has, but what a pain!
# Why isn't ruby doing this for us?
raise Encoding::InvalidByteSequenceError.new
else
options[:replace] || (
# surely there's a better way to tell if
# an encoding is a 'Unicode encoding form'
# than this? What's wrong with you ruby 1.9?
str.encoding.name.start_with?('UTF') ?
"\uFFFD" :
"?" )
end
end
end.join
end
As you can see, there are several weird things here that make me think “why wasn’t ruby 1.9 designed more carefully?” I mean, String#encode talks about doing things differently with certain encodings that are ‘unicode
encoding forms’, so why isn’t there a method #unicode_encoding_form? on Encoding? What the heck? Anyway.
ruby-1.9.3-rc1 :002 > a = "bad: \xc3\x28 okay".force_encoding("utf-8") => "bad: \xC3( okay"
ruby-1.9.3-rc1 :003 > validate_encoding(a)
Encoding::InvalidByteSequenceError: Encoding::InvalidByteSequenceError
from check_encoding.rb:9:in `block in validate_encoding'
from check_encoding.rb:2:in `chars'
from check_encoding.rb:2:in `each'
from check_encoding.rb:2:in `collect'
from check_encoding.rb:2:in `validate_encoding'
from (irb):3
from /Users/jrochkind/.rvm/rubies/ruby-1.9.3-rc1/bin/irb:16:in `'
ruby-1.9.3-rc1 :004 > validate_encoding(a, :invalid => :replace)
=> "bad: �( okay"
ruby-1.9.3-rc1 :005 > validate_encoding(a, :invalid => :replace, :replace => "*")
=> "bad: *( okay"
ruby-1.9.3-rc1 :006 > validate_encoding(a, :invalid => :replace, :replace => "")
=> "bad: ( okay"

Pingback: removing illegal bytes for encoding in ruby 1.9+ strings | Bibliographic Wilderness