text Library Lenient Decoding

I have recently started working on TTC again, after being pulled off of it for a while. As I wrote in the status report at the end of June, I am working on a new design that I plan on releasing as a separate package. The new design addresses many limitations of TTC. One such limitation is that TTC only supports lenient decoding of UTF-8. The new design supports both strict and lenient decoding of many text encodings. Thanks to extensive testing, I discovered that lenient decoding of streams is broken in older versions of the text package, however. I am currently thinking about how to deal with this issue.

Minimal Demonstration

The following code is a minimal demonstration of the issue. The full source is available on GitHub. The program is written as a Cabal script, and you can change the with-compiler option to specify the version of GHC to test with.

{-# LANGUAGE OverloadedStrings #-}

module Main (main) where

import qualified Data.ByteString.Lazy as BSL
import qualified Data.Text.Encoding.Error as TEE
import Data.Text.Lazy ()
import qualified Data.Text.Lazy.Encoding as TLE

good, bad :: BSL.ByteString
good = "\0g\0o\0o\0d"
bad  = "\0b\0a\0d\0"

main :: IO ()
main = do
    print good
    print $ TLE.decodeUtf16BEWith TEE.lenientDecode good
    print bad
    print $ TLE.decodeUtf16BEWith TEE.lenientDecode bad

This demonstration defines two lazy ByteStrings and decodes them using the UTF-16BE encoding. UTF-16 is a variable-length encoding that encodes each Unicode code point using one or two 16-bit code units (two or four bytes), and the BE suffix specifies that the big-endian byte order is used.

Value good encodes string good, where the first byte of each encoded code point is null. Value bad has a trailing null byte, so it is not valid UTF-16BE. Indeed, any odd number of bytes cannot be valid UTF-16.

The main function prints the ByteString values as well as the results of decoding them. The issue is with lenient decoding of streams, so it occurs when lazy decoding is performed.

In a version of the text package that does not have the issue, the output is as follows.

"\NULg\NULo\NULo\NULd"
"good"
"\NULb\NULa\NULd\NUL"
"bad\65533"

In this output, the ByteString values are as expected, the decoded good value is correct, and the decoded bad value includes the replacement character instead of the problematic trailing null, as desired. In a version of the text package that has the issue, the decoded bad value has an infinite stream of replacement characters. In my tests, this causes a problematic test to run indefinitely while using lots of processor.

Note that lenient decoding of UTF-8 works fine. Indeed, TTC does lenient decoding of UTF-8 without issue.

Affected Versions

The CHANGELOG indicates that this issue was fixed in text-2.1. TTC currently supports GHC 8.8 and above, and I test with the latest releases of those major versions.

GHC	text	Status
8.8.4	1.2.4.0	broken
8.10.7	1.2.4.1	broken
9.0.2	1.2.5.0	broken
9.2.8	1.2.5.0	broken
9.4.8	2.0.2	broken
9.6.6	2.0.2	broken
9.8.2	2.1.1	working
9.10.1	2.1.1	working

Mitigation

The text package is a GHC boot library that is distributed with GHC. While it is possible to use a version of the package that does not exactly match the distributed version, doing so can cause major headaches with dependency management (especially when using Nix) and generally should not be done in libraries. Using text-2.1 or later with GHC 8.8.4 is probably not feasible anyway.

My current idea to mitigate the issue is to use strict lenient decoding when a broken version of the text package is being used. This limits what the library can be used for, so I will of course put warnings in the documentation for the affected encodings.

Author

Travis Cardwell

Published

September 11, 2024