Decoding UTF-8 in Haskell
In a previous blog
entry, I mentioned that I wanted to support translation between the
following Textual
data types (in addition to handles and files) in LiterateX. Why would
one explicitly implement support for Textual
data types in
such a library?
String
Text
- Lazy
Text
ByteString
- Lazy
ByteString
The API would be simpler if it just supported one
Textual
data type. The implementation uses streams of
strict Text
lines internally, transforming a stream of
input lines to a stream of output lines. An API that supports one
Textual
data type could accept lazy Text
and
generate a stream of strict Text
lines. The stream of
strict Text
lines that is output could then be returned as
lazy Text
, with the newlines added. When using other
Textual
data types, the calling function can perform the
conversion to/from lazy Text
.
I explicitly implemented support for Textual
data types
in LiterateX because I have observed that the types used can
significantly affect the performance (runtime and memory usage) of the
software. In a project that I worked on in the past, benchmarking
indicated that decoding a UTF-8 ByteString
to
Text
was using a lot of memory. Through experimentation, I
discovered that splitting the (lazy) ByteString
into lines
and decoding each line separately significantly reduced memory usage in
that application. Memory usage went from multiple gigabytes to less than
a gigabyte!
LiterateX is nowhere near as complex as that previous project, but my
gut feeling is that I should use conduit
and provide explicit implementations for each Textual
data
type in order to provide a high quality implementation. This prioritizes
performance over usability, but the usability of the API is still quite
good.
UTF-8 Line Splitting
Before getting into benchmarks, I should make a note about UTF-8 line
splitting. The Data.ByteString.Char8
and
Data.ByteString.Lazy.Char8
modules in the bytestring
package provide an API for working with ByteString
values
using 8-bit characters. Some of these functions should never be used
with UTF-8 ByteString
values, but other functions are safe
to use with UTF-8 ByteString
values because UTF-8 has
backward compatibility with US-ASCII
.
With UTF-8, each Unicode code point is encoded using one to four
bytes. Only the first 128 code points are encoded using one byte, and
they have the same encoding as US-ASCII
. All other code
points are encoded using more than one byte, where every byte has a high
bit set. This ensures that the bytes that encode US-ASCII
never occur in the encoding of other code points.
Because of this encoding property, one does not have to decode UTF-8
to safely split lines. The line feed and carriage return characters are
US-ASCII
and can therefore be safely processed without
(before) decoding.
Benchmarks
When I had a little bit of time over the weekend, I decided to write some minimal benchmarks. The benchmarks process a COVID-19 Case Surveillance Public Use Data CSV file.
$ wc --bytes COVID-19_Case_Surveillance_Public_Use_Data.csv | awk '{print $1}'
2893026390
$ wc --lines COVID-19_Case_Surveillance_Public_Use_Data.csv | awk '{print $1}'
24441352
Each benchmark processes strict Text
lines. For the
actual processing, I just count the number of female entries.
countFemale :: Int -> T.Text -> Int
countFemale count line| ",Female," `T.isInfixOf` line = count + 1
| otherwise = count
The number of female entries can be confirmed using the excellent xsv
utility.
$ xsv frequency -s sex COVID-19_Case_Surveillance_Public_Use_Data.csv
field,value,count
sex,Female,12620227
sex,Male,11552444
sex,Unknown,214452
sex,Missing,53752
sex,Other,457
sex,NA,19
Text
This benchmark, implemented in bench-text.hs
,
reads the file as lazy Text
, splits the contents into
lines, converts each line to strict Text
, and processes the
lines.
main :: IO ()
main= benchmark
$ print
. foldl' countFemale 0
. map TL.toStrict
. TL.lines
=<< TLIO.readFile dataPath
The results on my system are as follows.
$ stack exec bench-text
12620227
Wall clock time: 50.061926917s
Maximum residency: 62.8 KB
Maximum slop: 31.4 KB
Productivity (wall clock time): 99.1 %
Productivity (CPU time): 99.1 %
ByteString
This benchmark, implemented in bench-bs.hs
,
reads the file as lazy ByteString
, decodes the UTF-8
ByteString
to Text
, splits the contents into
lines, converts each line to strict Text
, and processes the
lines.
main :: IO ()
main= benchmark
$ print
. foldl' countFemale 0
. map TL.toStrict
. TL.lines
. TLE.decodeUtf8With TEE.lenientDecode
=<< BSL.readFile dataPath
The results on my system are as follows.
$ stack exec bench-bs
12620227
Wall clock time: 43.120697698s
Maximum residency: 124.6 KB
Maximum slop: 27.0 KB
Productivity (wall clock time): 99.2 %
Productivity (CPU time): 99.2 %
ByteString Lines
This benchmark, implemented in bench-bslines.hs
,
reads the file as lazy ByteString
, splits the UTF-8
contents to lines, decodes each line to strict Text
, and
processes the lines.
main :: IO ()
main= benchmark
$ print
. foldl' countFemale 0
. map (TL.toStrict . TLE.decodeUtf8With TEE.lenientDecode)
. BSL8.lines
=<< BSL.readFile dataPath
The results on my system are as follows.
$ stack exec bench-bslines
12620227
Wall clock time: 16.983949916s
Maximum residency: 96.2 KB
Maximum slop: 30.8 KB
Productivity (wall clock time): 98.8 %
Productivity (CPU time): 98.9 %
Conduit ByteString
This benchmark, implemented in bench-cbs.hs
,
uses conduit
to read and process the file. It reads the
file as ByteString
, decodes the UTF-8 content to
Text
, splits the content to lines, and processes the
lines.
main :: IO ()
main= benchmark
$ (print =<<)
$ C.runConduitRes
$ CC.sourceFile dataPath
.| CC.decodeUtf8Lenient
.| CC.linesUnbounded
.| CC.foldl countFemale 0
The results on my system are as follows.
$ stack exec bench-cbs
12620227
Wall clock time: 68.559410489s
Maximum residency: 104.5 KB
Maximum slop: 14.4 KB
Productivity (wall clock time): 96.2 %
Productivity (CPU time): 96.4 %
Conduit ByteString Lines
This benchmark, implemented in bench-cbslines.hs
,
also uses conduit
to read and process the file. It reads
the file as ByteString
, splits the UTF-8 contents to lines,
decodes each line to Text
, and processes the lines.
main :: IO ()
main= benchmark
$ (print =<<)
$ C.runConduitRes
$ CC.sourceFile dataPath
.| CC.linesUnboundedAscii
.| CC.decodeUtf8Lenient
.| CC.foldl countFemale 0
The results on my system are as follows.
$ stack exec bench-cbslines
12620227
Wall clock time: 40.910613058s
Maximum residency: 76.1 KB
Maximum slop: 18.5 KB
Productivity (wall clock time): 98.5 %
Productivity (CPU time): 98.6 %
There is an issue with this code! See Decoding UTF-8 in Haskell (Part 2) for an updated version.
Observations
The following table shows an overview of the results.
Benchmark | Time (s) | Mem (KB) | Slop (KB) | Clock Prod (%) | CPU Prod (%) |
---|---|---|---|---|---|
text |
50.0 |
62.8 |
31.4 |
99.1 |
99.1 |
bs |
43.1 |
124.6 |
27.0 |
99.2 |
99.2 |
bslines |
17.0 |
96.2 |
30.8 |
98.8 |
98.9 |
cbs |
68.6 |
104.5 |
14.4 |
96.2 |
96.4 |
cbslines |
40.9 |
76.1 |
18.5 |
98.5 |
98.6 |
In these benchmarks, there is not a big difference in memory usage,
but there is a difference in runtime. Splitting the UTF-8
ByteString
contents into lines and decoding each line
separately results in significantly faster performance. The performance
gain is not as significant when using conduit
, however.
Benchmarking note: I ran all benchmarks multiple times and saw consistent results. The results shown above are those from running each benchmark again as writing this blog entry. (They are not hand-selected from multiple runs.)