Skip to main content

ByteString Literals

There has recently been some interesting discussion in the “Surprising behavior of ByteString literals via IsStringissue for the bytestring package. The IsString instance does not work with multi-byte characters, and this still catches some people by surprise.

As a software developer in Japan, where multi-byte strings are ubiquitous, I have learned to be very careful when using ByteStrings. I prefer to use Text to represent strings. This works very well when using Unicode encodings, as you can use the Data.Text.Encoding API to decode from and encode to ByteString when necessary. Japanese companies often use JIS encodings such as EUC-JP, ISO-2022-JP, Shift JIS, and CP932, however. One can still use Text in some cases, using text-icu or iconv to decode from and encode to ByteString. Such encodings include characters that are not isomorphic with Unicode, however: decoding to Text and then encoding to ByteString does not produce the same bytes. When this is an issue, one has to work with the ByteStrings directly. I have even worked with some legacy systems that used bygone JIS encodings that are much older than those listed above; that was “fun.”

When I need (short) ByteString constants that include multi-byte characters (in whatever encoding), I usually just write out the bytes in a list. Never use OverloadedStrings/IsString for this.

message :: ByteString
message = BS.pack
    [ 0xB3, 0xDA, 0xA4, 0xB7, 0xA4, 0xA4, 0xA4, 0xC7
    , 0xA4, 0xB9, 0xA4, 0xCD, 0xA1, 0xC4, 0x0A

This silly example (full code) prints a EUC-JP-encoded message. You can pipe it through iconv to view it in a UTF-8 terminal:

$ ./EUC-JP.hs | iconv -f euc-jp -t utf-8

On a slightly related topic, how does TTC convert from String to ByteString? TTC works with UTF-8 encoded strings (documentation), so a String is first converted to Text and then encoded to a UTF-8 ByteString (code):

instance Textual String where
  toBS = TE.encodeUtf8 . T.pack

Travis Cardwell

