text Library Lenient Decoding
I have recently started working on TTC
again, after being pulled off of it for a while. As I wrote in the status report at the end of
June, I am working on a new design that I plan on releasing as a
separate package. The new design addresses many limitations of TTC.
One such limitation is that TTC
only supports lenient decoding of UTF-8. The new design
supports both strict and lenient decoding of many text encodings. Thanks
to extensive testing, I discovered that lenient decoding of streams is
broken in older versions of the text
package, however. I am currently thinking about how to deal with this
issue.
Minimal Demonstration
The following code is a minimal demonstration of the issue. The full
source is available on GitHub. The program is written as a Cabal script,
and you can change the with-compiler
option to specify the version of GHC to test with.
{-# LANGUAGE OverloadedStrings #-}
module Main (main) where
import qualified Data.ByteString.Lazy as BSL
import qualified Data.Text.Encoding.Error as TEE
import Data.Text.Lazy ()
import qualified Data.Text.Lazy.Encoding as TLE
bad :: BSL.ByteString
good,= "\0g\0o\0o\0d"
good = "\0b\0a\0d\0"
bad
main :: IO ()
= do
main print good
print $ TLE.decodeUtf16BEWith TEE.lenientDecode good
print bad
print $ TLE.decodeUtf16BEWith TEE.lenientDecode bad
This demonstration defines two lazy ByteString
s and
decodes them using the UTF-16BE encoding. UTF-16 is a
variable-length encoding that encodes each Unicode code point using one
or two 16-bit code units (two or four bytes), and the BE
suffix specifies that the big-endian byte
order is used.
Value good
encodes string good
, where the
first byte of each encoded code point is null. Value bad
has a trailing null byte, so it is not valid UTF-16BE. Indeed, any odd
number of bytes cannot be valid UTF-16.
The main
function prints the ByteString
values as well as the results of decoding them. The issue is with
lenient decoding of streams, so it occurs when lazy decoding is
performed.
In a version of the text
package that does not have the issue, the output is as follows.
"\NULg\NULo\NULo\NULd"
"good"
"\NULb\NULa\NULd\NUL"
"bad\65533"
In this output, the ByteString
values are as expected,
the decoded good
value is correct, and the decoded
bad
value includes the replacement
character instead of the problematic trailing null, as desired. In a
version of the text
package that has the issue, the decoded bad
value has an
infinite stream of replacement
characters. In my tests, this causes a problematic test to run
indefinitely while using lots of processor.
Note that lenient decoding of UTF-8 works fine. Indeed, TTC does lenient decoding of UTF-8 without issue.
Affected Versions
The CHANGELOG
indicates that this issue was fixed in text-2.1
.
TTC
currently supports GHC 8.8 and above, and I test with the latest
releases of those major versions.
GHC | text | Status |
---|---|---|
8.8.4 | 1.2.4.0 | broken |
8.10.7 | 1.2.4.1 | broken |
9.0.2 | 1.2.5.0 | broken |
9.2.8 | 1.2.5.0 | broken |
9.4.8 | 2.0.2 | broken |
9.6.6 | 2.0.2 | broken |
9.8.2 | 2.1.1 | working |
9.10.1 | 2.1.1 | working |
Mitigation
The text
package is a GHC
boot library that is distributed with GHC. While it is
possible to use a version of the package that does not exactly match the
distributed version, doing so can cause major headaches with dependency
management (especially when using Nix)
and generally should not be done in libraries. Using text-2.1
or later with GHC 8.8.4 is
probably not feasible anyway.
My current idea to mitigate the issue is to use strict
lenient decoding when a broken version of the text
package is being used. This limits what the library can be used for, so
I will of course put warnings in the documentation for the affected
encodings.