Skip to main content

Encoding UTF-8 in Haskell

In Decoding UTF-8 in Haskell (and part 2), I experimented with the performance of different methods of decoding UTF-8. Though I have not noticed a significant difference in performance of encoding UTF-8 when using different methods, I decided to write some quick benchmark programs to investigate.

Benchmarks

The benchmarks use a source of strict Text lines and write to a file using various methods. For the source content, I simply generate a sequence of relatively large numbers.

sourceText :: [T.Text]
sourceText =
    T.pack . show <$> enumFromTo (maxBound - 10000000) (maxBound :: Word64)

Text

This benchmark, implemented in bench-text.hs, converts each line to lazy Text, adds newline characters, and writes the content to the output file. The UTF-8 encoding is done by the writeFile function.

main :: IO ()
main
    = benchmark
    . TLIO.writeFile dataPath
    . TL.unlines
    . map TL.fromStrict
    $ sourceText

The results on my system are as follows.

$ stack exec bench-text
Wall clock time: 14.037246691s
Maximum residency: 40.9 KB
Maximum slop: 20.6 KB
Productivity (wall clock time): 98.4 %
Productivity (CPU time): 98.6 %

ByteString

This benchmark, implemented in bench-bs.hs, converts each line to lazy Text, adds newline characters, encodes the Text content to UTF-8 ByteString, and writes the content to the output file.

main :: IO ()
main
    = benchmark
    . BSL.writeFile dataPath
    . TLE.encodeUtf8
    . TL.unlines
    . map TL.fromStrict
    $ sourceText

The results on my system are as follows.

$ stack exec bench-bs
Wall clock time: 10.610070498s
Maximum residency: 67.0 KB
Maximum slop: 19.7 KB
Productivity (wall clock time): 98.4 %
Productivity (CPU time): 98.3 %

ByteString Lines

This benchmark, implemented in bench-bslines.hs, encodes each Text line to UTF-8 ByteString, converts each line to lazy ByteString, adds newline characters, and writes the content to the output file.

main :: IO ()
main
    = benchmark
    . BSL.writeFile dataPath
    . BSL8.unlines
    . map (BSL.fromStrict . TE.encodeUtf8)
    $ sourceText

The results on my system are as follows.

$ stack exec bench-bslines
Wall clock time: 13.27680386s
Maximum residency: 38.0 KB
Maximum slop: 15.2 KB
Productivity (wall clock time): 98.7 %
Productivity (CPU time): 98.6 %

Conduit ByteString

This benchmark, implemented in bench-cbs.hs, uses conduit. It adds newline characters, encodes the content to UTF-8 ByteString, and writes the content to the output file.

main :: IO ()
main
    = benchmark
    $ C.runConduitRes
    $ CC.yieldMany sourceText
    .| CC.unlines
    .| CC.encodeUtf8
    .| CC.sinkFile dataPath

The results on my system are as follows.

$ stack exec bench-cbs
Wall clock time: 18.423908538s
Maximum residency: 37.9 KB
Maximum slop: 15.3 KB
Productivity (wall clock time): 98.6 %
Productivity (CPU time): 98.7 %

Conduit ByteString Lines

This benchmark, implemented in bench-cbslines.hs, also uses conduit. It encodes the Text lines to UTF-8 ByteString, adds newline characters, and writes the content to the output file.

main :: IO ()
main
    = benchmark
    $ C.runConduitRes
    $ CC.yieldMany sourceText
    .| CC.encodeUtf8
    .| CC.unlinesAscii
    .| CC.sinkFile dataPath

The results on my system are as follows.

$ stack exec bench-cbslines
Wall clock time: 18.448139409s
Maximum residency: 37.9 KB
Maximum slop: 15.3 KB
Productivity (wall clock time): 98.7 %
Productivity (CPU time): 98.6 %

Observations

The following table shows an overview of the results.

Benchmark Time (s) Mem (KB) Slop (KB) Clock Prod (%) CPU Prod (%)
text 14.0 40.9 20.6 98.4 98.6
bs 10.6 67.0 19.7 98.4 98.3
bslines 13.3 38.0 15.2 98.7 98.6
cbs 18.4 37.9 15.3 98.6 98.7
cbslines 18.4 37.9 15.3 98.7 98.6

The performance difference is not as significant as that seen in decoding UTF-8. I consistently see better performance of the ByteString test, however. (I did not expect the ByteString test to perform any differently from the Text test!) I do not see a performance difference between the conduit versions.

Benchmarking note: I ran all benchmarks multiple times and saw consistent results. The results shown above are those from running each benchmark again as writing this blog entry. (They are not hand-selected from multiple runs.)