Multi-byte Strikes Again

Multi-byte support, while by-and-large a now a well-supported stable of modern programming languages, is still something that trip up a person from time to time.  In particular, while UTF-8 is the de facto standard for encoding it does pose an issue when you get down to the character level.

I ran into this today while trying to shrink a potentially large bit of text to a manageable “chunk” of characters.  Ruby, it appears, separates a string into 8-bit characters.   When grabbing a substr from a string with mixed Japanese and English you end up with the dreaded 文字化け (mo-ji-ba-ke); something now only whispered in darkened corners from veterans reliving the horror days before unicode.

Well, the short of the long is a nice bit of hackery provided by at 山下英孝.  Basically, the trick involves a slice after grabbing the characters directly from the string.

However.  While this is a fun hack, it is hack.  The better way to handle this is to use chars instance method on the String class.  This ensures that a character is a logical character (e.g. ‘a’ or ‘あ’) and not the physical char returned directly by the array.

In summary, you want to use:

multi_byte_string = "私のマルチバイト文です。My multi-byte sentence."
# this is a hack of the physical characters
# you can, of course, use this; but, you should not
# Instead, this is a much better approach which uses 
# all the UTF-8 goodiness to get logical characters from the string

