Skip to main content

text Library Equality Bug

Working on the new design of TTC, I ran into another bug in older versions of the text package. Some lazy Text representations do not maintain an invariant and compare as non-equal even though they represent the same text. Some of my tests hit this bug, causing failures.

Minimal Demonstration

The following code is a minimal demonstration of the issue. The full source is available on GitHub. In the cabal.project file, you can change the with-compiler option to specify the version of GHC to test with, as well as set the optional-packages option to use a vendored clone of the text package to test specific revisions.

{-# LANGUAGE OverloadedStrings #-}

module Main where

-- https://hackage.haskell.org/package/bytestring
import qualified Data.ByteString.Lazy as BSL

-- https://hackage.haskell.org/package/text
import qualified Data.Text.Encoding.Error as TEE
import qualified Data.Text.Internal.Lazy as TIL
import qualified Data.Text.Lazy as TL
import qualified Data.Text.Lazy.Encoding as TLE

-- Invalid UTF-8
invalid :: BSL.ByteString
invalid = "test \xe3"

-- Expected decoded value when ignoring errors
expected :: TL.Text
expected = "test "

main :: IO ()
main = do
    let actual = TLE.decodeUtf8With TEE.ignore invalid
    putStrLn $ "Expected: " ++ TIL.showStructure expected
    putStrLn $ "Actual:   " ++ TIL.showStructure actual
    putStrLn $
      "Equality of lazy text:   " ++ show (expected == actual)
    putStrLn $
      "Equality of strict text: "
        ++ show (TL.toStrict expected == TL.toStrict actual)

Value invalid defines a ByteString that is not valid UTF-8. Value expected defines the expected decoded Text when ignoring errors.

The main function decodes the invalid value, ignoring errors. It prints the internal structure of the expected and actual values, the result of equality comparison of the values, and the result of equality comparison of the values after conversion to strict Text.

When using a version of the text package that does not have the issue, the output is as follows.

Expected: Chunk "test " Empty
Actual:   Chunk "test " Empty
Equality of lazy text:   True
Equality of strict text: True

When using a version of the text package that has the issue, the output is as follows.

Expected: Chunk "test " Empty
Actual:   Chunk "test " (Chunk "" Empty)
Equality of lazy text:   False
Equality of strict text: True

The following invariant is documented in the Data.Text.Internal.Lazy module:

The data type invariant for lazy Text: Every Text is either Empty or consists of non-null Texts

The bug arises because the internal representation when decoding does not maintain this invariant.

Affected Versions

My tests indicate that this bug affects text versions 1.2.5.0, 2.0, and 2.0.1. When using the versions of the text package that are distributed with GHC, this bug affects GHC versions 9.0.2 and 9.2.8. The bug was introduced in revision 204f6ac2 (in May 2021), and it was resolved in revision 7ef771dc (in February 2023). I could not find mention of the bug, so I suspect that it was never noticed and was fixed inadvertently.

Here is my full test log:

Status Revision Version GHC
OK 1b275ff7 1.2.4.1 9.0.2
OK 8d1b6ff5 9.0.2
BUG 204f6ac2 9.0.2
BUG 89efc099 1.2.5.0 9.2.8
BUG 9075b05b 2.0 9.2.8
BUG fdb06ff3 2.0.1 9.2.8
BUG 6f1917df 9.2.8
OK 7ef771dc 9.2.8
OK e815d4d9 2.0.2 9.2.8

Mitigation

In general, users of the affected versions should probably avoid relying on equality of decoded lazy Text values. In my failing tests, I can mitigate the issue by comparing the strict representations.

Author

Travis Cardwell

Published

Tags