Encoding UTF-8 in Haskell
In Decoding UTF-8 in Haskell (and part 2), I experimented with the performance of different methods of decoding UTF-8. Though I have not noticed a significant difference in performance of encoding UTF-8 when using different methods, I decided to write some quick benchmark programs to investigate.
Benchmarks
The benchmarks use a source of strict Text
lines and
write to a file using various methods. For the source content, I simply
generate a sequence of relatively large numbers.
sourceText :: [T.Text]
=
sourceText . show <$> enumFromTo (maxBound - 10000000) (maxBound :: Word64) T.pack
Text
This benchmark, implemented in bench-text.hs
,
converts each line to lazy Text
, adds newline characters,
and writes the content to the output file. The UTF-8 encoding is done by
the writeFile
function.
main :: IO ()
main= benchmark
. TLIO.writeFile dataPath
. TL.unlines
. map TL.fromStrict
$ sourceText
The results on my system are as follows.
$ stack exec bench-text
Wall clock time: 14.037246691s
Maximum residency: 40.9 KB
Maximum slop: 20.6 KB
Productivity (wall clock time): 98.4 %
Productivity (CPU time): 98.6 %
ByteString
This benchmark, implemented in bench-bs.hs
,
converts each line to lazy Text
, adds newline characters,
encodes the Text
content to UTF-8 ByteString
,
and writes the content to the output file.
main :: IO ()
main= benchmark
. BSL.writeFile dataPath
. TLE.encodeUtf8
. TL.unlines
. map TL.fromStrict
$ sourceText
The results on my system are as follows.
$ stack exec bench-bs
Wall clock time: 10.610070498s
Maximum residency: 67.0 KB
Maximum slop: 19.7 KB
Productivity (wall clock time): 98.4 %
Productivity (CPU time): 98.3 %
ByteString Lines
This benchmark, implemented in bench-bslines.hs
,
encodes each Text
line to UTF-8 ByteString
,
converts each line to lazy ByteString
, adds newline
characters, and writes the content to the output file.
main :: IO ()
main= benchmark
. BSL.writeFile dataPath
. BSL8.unlines
. map (BSL.fromStrict . TE.encodeUtf8)
$ sourceText
The results on my system are as follows.
$ stack exec bench-bslines
Wall clock time: 13.27680386s
Maximum residency: 38.0 KB
Maximum slop: 15.2 KB
Productivity (wall clock time): 98.7 %
Productivity (CPU time): 98.6 %
Conduit ByteString
This benchmark, implemented in bench-cbs.hs
,
uses conduit
. It adds newline characters, encodes the
content to UTF-8 ByteString
, and writes the content to the
output file.
main :: IO ()
main= benchmark
$ C.runConduitRes
$ CC.yieldMany sourceText
.| CC.unlines
.| CC.encodeUtf8
.| CC.sinkFile dataPath
The results on my system are as follows.
$ stack exec bench-cbs
Wall clock time: 18.423908538s
Maximum residency: 37.9 KB
Maximum slop: 15.3 KB
Productivity (wall clock time): 98.6 %
Productivity (CPU time): 98.7 %
Conduit ByteString Lines
This benchmark, implemented in bench-cbslines.hs
,
also uses conduit
. It encodes the Text
lines
to UTF-8 ByteString
, adds newline characters, and writes
the content to the output file.
main :: IO ()
main= benchmark
$ C.runConduitRes
$ CC.yieldMany sourceText
.| CC.encodeUtf8
.| CC.unlinesAscii
.| CC.sinkFile dataPath
The results on my system are as follows.
$ stack exec bench-cbslines
Wall clock time: 18.448139409s
Maximum residency: 37.9 KB
Maximum slop: 15.3 KB
Productivity (wall clock time): 98.7 %
Productivity (CPU time): 98.6 %
Observations
The following table shows an overview of the results.
Benchmark | Time (s) | Mem (KB) | Slop (KB) | Clock Prod (%) | CPU Prod (%) |
---|---|---|---|---|---|
text |
14.0 |
40.9 |
20.6 |
98.4 |
98.6 |
bs |
10.6 |
67.0 |
19.7 |
98.4 |
98.3 |
bslines |
13.3 |
38.0 |
15.2 |
98.7 |
98.6 |
cbs |
18.4 |
37.9 |
15.3 |
98.6 |
98.7 |
cbslines |
18.4 |
37.9 |
15.3 |
98.7 |
98.6 |
The performance difference is not as significant as that seen in
decoding UTF-8. I consistently see better performance of the
ByteString
test, however. (I did not expect the
ByteString
test to perform any differently from the
Text
test!) I do not see a performance difference between
the conduit
versions.
Benchmarking note: I ran all benchmarks multiple times and saw consistent results. The results shown above are those from running each benchmark again as writing this blog entry. (They are not hand-selected from multiple runs.)