ByteString Literals
There has recently been some interesting discussion in the
“Surprising behavior of ByteString literals via
IsString” issue for
the bytestring
package. The IsString
instance
does not work with multi-byte characters, and this still
catches some people by surprise.
As a software developer in Japan, where multi-byte strings are
ubiquitous, I have learned to be very careful when using
ByteStrings. I prefer to use Text
to represent strings. This works very well when using Unicode encodings,
as you can use the Data.Text.Encoding
API to decode from and encode to ByteString when necessary.
Japanese companies often use JIS encodings such
as EUC-JP,
ISO-2022-JP,
Shift JIS, and CP932,
however. One can still use Text in some cases, using text-icu
or iconv
to decode from and encode to ByteString. Such encodings
include characters that are not isomorphic with Unicode, however:
decoding to Text and then encoding to
ByteString does not produce the same bytes. When this is an
issue, one has to work with the ByteStrings directly. I
have even worked with some legacy systems that used bygone JIS encodings
that are much older than those listed above; that was
“fun.”
When I need (short) ByteString constants that
include multi-byte characters (in whatever encoding), I usually just
write out the bytes in a list. Never use
OverloadedStrings/IsString for this.
message :: ByteString
message = BS.pack
[ 0xB3, 0xDA, 0xA4, 0xB7, 0xA4, 0xA4, 0xA4, 0xC7
, 0xA4, 0xB9, 0xA4, 0xCD, 0xA1, 0xC4, 0x0A
]This silly example (full code) prints a EUC-JP-encoded message. You can pipe it through iconv to view it in a UTF-8 terminal:
$ ./EUC-JP.hs | iconv -f euc-jp -t utf-8
楽しいですね…
On a slightly related topic, how does TTC
convert from String to ByteString? TTC works
with UTF-8 encoded strings (documentation),
so a String is first converted to Text and
then encoded to a UTF-8 ByteString (code):
instance Textual String where
...
toBS = TE.encodeUtf8 . T.pack
...