ruby - How do I remove non UTF-8 characters from a String? -
i need remove non utf-8 characters string. here snap of text.
this how looks when open string in npp, , set encoding utf-8:
i think ack
, ff
non utf-8 characters.
i tried str.scrub
str.encode
. neither of them seems work. scrub
returns same result, , encode
results in error.
we have few problems.
the biggest ruby string stores arbitrary bytes along supposed encoding, no guarantee bytes valid in encoding , no obvious reason encoding have been chosen. (i might biased heavy user of python 3. never speak of "changing string 1 encoding another".)
fortunately, editor did not eat post, it's hard see that. i'm guessing decoded string windows-1252 in order display it, obscures issue.
here's string of bytes see it:
>> s = "\x06-~$a\xa7rug\xf9\"\x9a\f\xb6/k".b => "\x06-~$a\xa7rug\xf9\"\x9a\f\xb6/k" >> s.bytes => [6, 45, 126, 36, 65, 167, 114, 117, 71, 249, 34, 154, 12, 182, 47, 75]
and contain bytes not valid utf-8.
>> s.encoding => #<encoding:ascii-8bit> >> string::new(s).force_encoding(encoding::utf_8).valid_encoding? => false
we can ask decode utf-8 , insert � encounter bytes not valid utf-8:
>> s.encode('utf-8', 'binary', :undef => :replace) => "\u0006-~$a�rug�\"�\f�/k"
Comments
Post a Comment