text Library Equality Bug
Working on the new design of TTC,
I ran into another bug in older versions of the text
package. Some lazy Text
representations do not maintain an
invariant and compare as non-equal even though they represent the same
text. Some of my tests hit this bug, causing failures.
Minimal Demonstration
The following code is a minimal demonstration of the issue. The full
source is available on GitHub. In the cabal.project
file, you can change the with-compiler
option to specify the version of GHC to test with, as well as set the optional-packages
option to use a vendored clone of the text
package to test specific revisions.
{-# LANGUAGE OverloadedStrings #-}
module Main where
-- https://hackage.haskell.org/package/bytestring
import qualified Data.ByteString.Lazy as BSL
-- https://hackage.haskell.org/package/text
import qualified Data.Text.Encoding.Error as TEE
import qualified Data.Text.Internal.Lazy as TIL
import qualified Data.Text.Lazy as TL
import qualified Data.Text.Lazy.Encoding as TLE
-- Invalid UTF-8
invalid :: BSL.ByteString
= "test \xe3"
invalid
-- Expected decoded value when ignoring errors
expected :: TL.Text
= "test "
expected
main :: IO ()
= do
main let actual = TLE.decodeUtf8With TEE.ignore invalid
putStrLn $ "Expected: " ++ TIL.showStructure expected
putStrLn $ "Actual: " ++ TIL.showStructure actual
putStrLn $
"Equality of lazy text: " ++ show (expected == actual)
putStrLn $
"Equality of strict text: "
++ show (TL.toStrict expected == TL.toStrict actual)
Value invalid
defines a ByteString
that is
not valid UTF-8. Value
expected
defines the expected decoded Text
when ignoring errors.
The main
function decodes the invalid value, ignoring
errors. It prints the internal structure of the expected and actual
values, the result of equality comparison of the values, and the result
of equality comparison of the values after conversion to strict
Text
.
When using a version of the text
package that does not have the issue, the output is as follows.
Expected: Chunk "test " Empty
Actual: Chunk "test " Empty
Equality of lazy text: True
Equality of strict text: True
When using a version of the text
package that has the issue, the output is as follows.
Expected: Chunk "test " Empty
Actual: Chunk "test " (Chunk "" Empty)
Equality of lazy text: False
Equality of strict text: True
The following invariant is documented
in the Data.Text.Internal.Lazy
module:
The data type invariant for lazy
Text
: EveryText
is eitherEmpty
or consists of non-nullText
s
The bug arises because the internal representation when decoding does not maintain this invariant.
Affected Versions
My tests indicate that this bug affects text
versions 1.2.5.0, 2.0, and 2.0.1. When
using the versions of the text
package that are distributed with GHC, this bug affects GHC
versions 9.0.2 and 9.2.8. The bug was introduced in revision 204f6ac2
(in May 2021), and it was resolved in revision 7ef771dc
(in February 2023). I could not find mention of the bug, so I suspect
that it was never noticed and was fixed inadvertently.
Here is my full test log:
Status | Revision | Version | GHC |
---|---|---|---|
OK | 1b275ff7 |
1.2.4.1 | 9.0.2 |
OK | 8d1b6ff5 |
9.0.2 | |
BUG | 204f6ac2 |
9.0.2 | |
BUG | 89efc099 |
1.2.5.0 | 9.2.8 |
BUG | 9075b05b |
2.0 | 9.2.8 |
BUG | fdb06ff3 |
2.0.1 | 9.2.8 |
BUG | 6f1917df |
9.2.8 | |
OK | 7ef771dc |
9.2.8 | |
OK | e815d4d9 |
2.0.2 | 9.2.8 |
Mitigation
In general, users of the affected versions should probably avoid
relying on equality of decoded lazy Text
values. In my
failing tests, I can mitigate the issue by comparing the strict
representations.