Encoding UTF-8 in Haskell

In Decoding UTF-8 in Haskell (and part 2), I experimented with the performance of different methods of decoding UTF-8. Though I have not noticed a significant difference in performance of encoding UTF-8 when using different methods, I decided to write some quick benchmark programs to investigate.

Benchmarks

The benchmarks use a source of strict Text lines and write to a file using various methods. For the source content, I simply generate a sequence of relatively large numbers.

sourceText :: [T.Text]
sourceText =
    T.pack . show <$> enumFromTo (maxBound - 10000000) (maxBound :: Word64)

Text

This benchmark, implemented in bench-text.hs, converts each line to lazy Text, adds newline characters, and writes the content to the output file. The UTF-8 encoding is done by the writeFile function.

main :: IO ()
main
    = benchmark
    . TLIO.writeFile dataPath
    . TL.unlines
    . map TL.fromStrict
    $ sourceText

The results on my system are as follows.

$ stack exec bench-text
Wall clock time: 14.037246691s
Maximum residency: 40.9 KB
Maximum slop: 20.6 KB
Productivity (wall clock time): 98.4 %
Productivity (CPU time): 98.6 %

ByteString

This benchmark, implemented in bench-bs.hs, converts each line to lazy Text, adds newline characters, encodes the Text content to UTF-8 ByteString, and writes the content to the output file.

main :: IO ()
main
    = benchmark
    . BSL.writeFile dataPath
    . TLE.encodeUtf8
    . TL.unlines
    . map TL.fromStrict
    $ sourceText

The results on my system are as follows.

$ stack exec bench-bs
Wall clock time: 10.610070498s
Maximum residency: 67.0 KB
Maximum slop: 19.7 KB
Productivity (wall clock time): 98.4 %
Productivity (CPU time): 98.3 %

ByteString Lines

This benchmark, implemented in bench-bslines.hs, encodes each Text line to UTF-8 ByteString, converts each line to lazy ByteString, adds newline characters, and writes the content to the output file.

main :: IO ()
main
    = benchmark
    . BSL.writeFile dataPath
    . BSL8.unlines
    . map (BSL.fromStrict . TE.encodeUtf8)
    $ sourceText

The results on my system are as follows.

$ stack exec bench-bslines
Wall clock time: 13.27680386s
Maximum residency: 38.0 KB
Maximum slop: 15.2 KB
Productivity (wall clock time): 98.7 %
Productivity (CPU time): 98.6 %

Conduit ByteString

This benchmark, implemented in bench-cbs.hs, uses conduit. It adds newline characters, encodes the content to UTF-8 ByteString, and writes the content to the output file.

main :: IO ()
main
    = benchmark
    $ C.runConduitRes
    $ CC.yieldMany sourceText
    .| CC.unlines
    .| CC.encodeUtf8
    .| CC.sinkFile dataPath

The results on my system are as follows.

$ stack exec bench-cbs
Wall clock time: 18.423908538s
Maximum residency: 37.9 KB
Maximum slop: 15.3 KB
Productivity (wall clock time): 98.6 %
Productivity (CPU time): 98.7 %

Conduit ByteString Lines

This benchmark, implemented in bench-cbslines.hs, also uses conduit. It encodes the Text lines to UTF-8 ByteString, adds newline characters, and writes the content to the output file.

main :: IO ()
main
    = benchmark
    $ C.runConduitRes
    $ CC.yieldMany sourceText
    .| CC.encodeUtf8
    .| CC.unlinesAscii
    .| CC.sinkFile dataPath

The results on my system are as follows.

$ stack exec bench-cbslines
Wall clock time: 18.448139409s
Maximum residency: 37.9 KB
Maximum slop: 15.3 KB
Productivity (wall clock time): 98.7 %
Productivity (CPU time): 98.6 %

Observations

The following table shows an overview of the results.

Benchmark	Time (s)	Mem (KB)	Slop (KB)	Clock Prod (%)	CPU Prod (%)
`text`	`14.0`	`40.9`	`20.6`	`98.4`	`98.6`
`bs`	`10.6`	`67.0`	`19.7`	`98.4`	`98.3`
`bslines`	`13.3`	`38.0`	`15.2`	`98.7`	`98.6`
`cbs`	`18.4`	`37.9`	`15.3`	`98.6`	`98.7`
`cbslines`	`18.4`	`37.9`	`15.3`	`98.7`	`98.6`

The performance difference is not as significant as that seen in decoding UTF-8. I consistently see better performance of the ByteString test, however. (I did not expect the ByteString test to perform any differently from the Text test!) I do not see a performance difference between the conduit versions.

Benchmarking note: I ran all benchmarks multiple times and saw consistent results. The results shown above are those from running each benchmark again as writing this blog entry. (They are not hand-selected from multiple runs.)

Author

Travis Cardwell

Published

May 14, 2021

Tags

Related Blog Entries