Decoding UTF-8 in Haskell
In a previous blog
entry, I mentioned that I wanted to support translation between the
following Textual
data types (in addition to handles and files) in LiterateX. Why would
one explicitly implement support for Textual data types in
such a library?
StringText- Lazy
Text ByteString- Lazy
ByteString
The API would be simpler if it just supported one
Textual data type. The implementation uses streams of
strict Text lines internally, transforming a stream of
input lines to a stream of output lines. An API that supports one
Textual data type could accept lazy Text and
generate a stream of strict Text lines. The stream of
strict Text lines that is output could then be returned as
lazy Text, with the newlines added. When using other
Textual data types, the calling function can perform the
conversion to/from lazy Text.
I explicitly implemented support for Textual data types
in LiterateX because I have observed that the types used can
significantly affect the performance (runtime and memory usage) of the
software. In a project that I worked on in the past, benchmarking
indicated that decoding a UTF-8 ByteString to
Text was using a lot of memory. Through experimentation, I
discovered that splitting the (lazy) ByteString into lines
and decoding each line separately significantly reduced memory usage in
that application. Memory usage went from multiple gigabytes to less than
a gigabyte!
LiterateX is nowhere near as complex as that previous project, but my
gut feeling is that I should use conduit
and provide explicit implementations for each Textual data
type in order to provide a high quality implementation. This prioritizes
performance over usability, but the usability of the API is still quite
good.
UTF-8 Line Splitting
Before getting into benchmarks, I should make a note about UTF-8 line
splitting. The Data.ByteString.Char8 and
Data.ByteString.Lazy.Char8 modules in the bytestring
package provide an API for working with ByteString values
using 8-bit characters. Some of these functions should never be used
with UTF-8 ByteString values, but other functions are safe
to use with UTF-8 ByteString values because UTF-8 has
backward compatibility with US-ASCII.
With UTF-8, each Unicode code point is encoded using one to four
bytes. Only the first 128 code points are encoded using one byte, and
they have the same encoding as US-ASCII. All other code
points are encoded using more than one byte, where every byte has a high
bit set. This ensures that the bytes that encode US-ASCII
never occur in the encoding of other code points.
Because of this encoding property, one does not have to decode UTF-8
to safely split lines. The line feed and carriage return characters are
US-ASCII and can therefore be safely processed without
(before) decoding.
Benchmarks
When I had a little bit of time over the weekend, I decided to write some minimal benchmarks. The benchmarks process a COVID-19 Case Surveillance Public Use Data CSV file.
$ wc --bytes COVID-19_Case_Surveillance_Public_Use_Data.csv | awk '{print $1}'
2893026390
$ wc --lines COVID-19_Case_Surveillance_Public_Use_Data.csv | awk '{print $1}'
24441352
Each benchmark processes strict Text lines. For the
actual processing, I just count the number of female entries.
countFemale :: Int -> T.Text -> Int
countFemale count line
| ",Female," `T.isInfixOf` line = count + 1
| otherwise = countThe number of female entries can be confirmed using the excellent xsv
utility.
$ xsv frequency -s sex COVID-19_Case_Surveillance_Public_Use_Data.csv
field,value,count
sex,Female,12620227
sex,Male,11552444
sex,Unknown,214452
sex,Missing,53752
sex,Other,457
sex,NA,19
Text
This benchmark, implemented in bench-text.hs,
reads the file as lazy Text, splits the contents into
lines, converts each line to strict Text, and processes the
lines.
main :: IO ()
main
= benchmark
$ print
. foldl' countFemale 0
. map TL.toStrict
. TL.lines
=<< TLIO.readFile dataPathThe results on my system are as follows.
$ stack exec bench-text
12620227
Wall clock time: 50.061926917s
Maximum residency: 62.8 KB
Maximum slop: 31.4 KB
Productivity (wall clock time): 99.1 %
Productivity (CPU time): 99.1 %
ByteString
This benchmark, implemented in bench-bs.hs,
reads the file as lazy ByteString, decodes the UTF-8
ByteString to Text, splits the contents into
lines, converts each line to strict Text, and processes the
lines.
main :: IO ()
main
= benchmark
$ print
. foldl' countFemale 0
. map TL.toStrict
. TL.lines
. TLE.decodeUtf8With TEE.lenientDecode
=<< BSL.readFile dataPathThe results on my system are as follows.
$ stack exec bench-bs
12620227
Wall clock time: 43.120697698s
Maximum residency: 124.6 KB
Maximum slop: 27.0 KB
Productivity (wall clock time): 99.2 %
Productivity (CPU time): 99.2 %
ByteString Lines
This benchmark, implemented in bench-bslines.hs,
reads the file as lazy ByteString, splits the UTF-8
contents to lines, decodes each line to strict Text, and
processes the lines.
main :: IO ()
main
= benchmark
$ print
. foldl' countFemale 0
. map (TL.toStrict . TLE.decodeUtf8With TEE.lenientDecode)
. BSL8.lines
=<< BSL.readFile dataPathThe results on my system are as follows.
$ stack exec bench-bslines
12620227
Wall clock time: 16.983949916s
Maximum residency: 96.2 KB
Maximum slop: 30.8 KB
Productivity (wall clock time): 98.8 %
Productivity (CPU time): 98.9 %
Conduit ByteString
This benchmark, implemented in bench-cbs.hs,
uses conduit to read and process the file. It reads the
file as ByteString, decodes the UTF-8 content to
Text, splits the content to lines, and processes the
lines.
main :: IO ()
main
= benchmark
$ (print =<<)
$ C.runConduitRes
$ CC.sourceFile dataPath
.| CC.decodeUtf8Lenient
.| CC.linesUnbounded
.| CC.foldl countFemale 0The results on my system are as follows.
$ stack exec bench-cbs
12620227
Wall clock time: 68.559410489s
Maximum residency: 104.5 KB
Maximum slop: 14.4 KB
Productivity (wall clock time): 96.2 %
Productivity (CPU time): 96.4 %
Conduit ByteString Lines
This benchmark, implemented in bench-cbslines.hs,
also uses conduit to read and process the file. It reads
the file as ByteString, splits the UTF-8 contents to lines,
decodes each line to Text, and processes the lines.
main :: IO ()
main
= benchmark
$ (print =<<)
$ C.runConduitRes
$ CC.sourceFile dataPath
.| CC.linesUnboundedAscii
.| CC.decodeUtf8Lenient
.| CC.foldl countFemale 0The results on my system are as follows.
$ stack exec bench-cbslines
12620227
Wall clock time: 40.910613058s
Maximum residency: 76.1 KB
Maximum slop: 18.5 KB
Productivity (wall clock time): 98.5 %
Productivity (CPU time): 98.6 %
There is an issue with this code! See Decoding UTF-8 in Haskell (Part 2) for an updated version.
Observations
The following table shows an overview of the results.
| Benchmark | Time (s) | Mem (KB) | Slop (KB) | Clock Prod (%) | CPU Prod (%) |
|---|---|---|---|---|---|
text |
50.0 |
62.8 |
31.4 |
99.1 |
99.1 |
bs |
43.1 |
124.6 |
27.0 |
99.2 |
99.2 |
bslines |
17.0 |
96.2 |
30.8 |
98.8 |
98.9 |
cbs |
68.6 |
104.5 |
14.4 |
96.2 |
96.4 |
cbslines |
40.9 |
76.1 |
18.5 |
98.5 |
98.6 |
In these benchmarks, there is not a big difference in memory usage,
but there is a difference in runtime. Splitting the UTF-8
ByteString contents into lines and decoding each line
separately results in significantly faster performance. The performance
gain is not as significant when using conduit, however.
Benchmarking note: I ran all benchmarks multiple times and saw consistent results. The results shown above are those from running each benchmark again as writing this blog entry. (They are not hand-selected from multiple runs.)