ByteString Literals
There has recently been some interesting discussion in the
“Surprising behavior of ByteString
literals via
IsString
” issue for
the bytestring
package. The IsString
instance
does not work with multi-byte characters, and this still
catches some people by surprise.
As a software developer in Japan, where multi-byte strings are
ubiquitous, I have learned to be very careful when using
ByteString
s. I prefer to use Text
to represent strings. This works very well when using Unicode encodings,
as you can use the Data.Text.Encoding
API to decode from and encode to ByteString
when necessary.
Japanese companies often use JIS encodings such
as EUC-JP,
ISO-2022-JP,
Shift JIS, and CP932,
however. One can still use Text
in some cases, using text-icu
or iconv
to decode from and encode to ByteString
. Such encodings
include characters that are not isomorphic with Unicode, however:
decoding to Text
and then encoding to
ByteString
does not produce the same bytes. When this is an
issue, one has to work with the ByteString
s directly. I
have even worked with some legacy systems that used bygone JIS encodings
that are much older than those listed above; that was
“fun.”
When I need (short) ByteString
constants that
include multi-byte characters (in whatever encoding), I usually just
write out the bytes in a list. Never use
OverloadedStrings
/IsString
for this.
message :: ByteString
= BS.pack
message 0xB3, 0xDA, 0xA4, 0xB7, 0xA4, 0xA4, 0xA4, 0xC7
[ 0xA4, 0xB9, 0xA4, 0xCD, 0xA1, 0xC4, 0x0A
, ]
This silly example (full code) prints a EUC-JP-encoded message. You can pipe it through iconv to view it in a UTF-8 terminal:
$ ./EUC-JP.hs | iconv -f euc-jp -t utf-8
楽しいですね…
On a slightly related topic, how does TTC
convert from String
to ByteString
? TTC works
with UTF-8 encoded strings (documentation),
so a String
is first converted to Text
and
then encoded to a UTF-8
ByteString
(code):
instance Textual String where
...
= TE.encodeUtf8 . T.pack
toBS ...