Skip to main content

Decoding UTF-8 in Haskell

In a previous blog entry, I mentioned that I wanted to support translation between the following Textual data types (in addition to handles and files) in LiterateX. Why would one explicitly implement support for Textual data types in such a library?

  • String
  • Text
  • Lazy Text
  • ByteString
  • Lazy ByteString

The API would be simpler if it just supported one Textual data type. The implementation uses streams of strict Text lines internally, transforming a stream of input lines to a stream of output lines. An API that supports one Textual data type could accept lazy Text and generate a stream of strict Text lines. The stream of strict Text lines that is output could then be returned as lazy Text, with the newlines added. When using other Textual data types, the calling function can perform the conversion to/from lazy Text.

I explicitly implemented support for Textual data types in LiterateX because I have observed that the types used can significantly affect the performance (runtime and memory usage) of the software. In a project that I worked on in the past, benchmarking indicated that decoding a UTF-8 ByteString to Text was using a lot of memory. Through experimentation, I discovered that splitting the (lazy) ByteString into lines and decoding each line separately significantly reduced memory usage in that application. Memory usage went from multiple gigabytes to less than a gigabyte!

LiterateX is nowhere near as complex as that previous project, but my gut feeling is that I should use conduit and provide explicit implementations for each Textual data type in order to provide a high quality implementation. This prioritizes performance over usability, but the usability of the API is still quite good.

UTF-8 Line Splitting

Before getting into benchmarks, I should make a note about UTF-8 line splitting. The Data.ByteString.Char8 and Data.ByteString.Lazy.Char8 modules in the bytestring package provide an API for working with ByteString values using 8-bit characters. Some of these functions should never be used with UTF-8 ByteString values, but other functions are safe to use with UTF-8 ByteString values because UTF-8 has backward compatibility with US-ASCII.

With UTF-8, each Unicode code point is encoded using one to four bytes. Only the first 128 code points are encoded using one byte, and they have the same encoding as US-ASCII. All other code points are encoded using more than one byte, where every byte has a high bit set. This ensures that the bytes that encode US-ASCII never occur in the encoding of other code points.

Because of this encoding property, one does not have to decode UTF-8 to safely split lines. The line feed and carriage return characters are US-ASCII and can therefore be safely processed without (before) decoding.

Benchmarks

When I had a little bit of time over the weekend, I decided to write some minimal benchmarks. The benchmarks process a COVID-19 Case Surveillance Public Use Data CSV file.

$ wc --bytes COVID-19_Case_Surveillance_Public_Use_Data.csv | awk '{print $1}'
2893026390
$ wc --lines COVID-19_Case_Surveillance_Public_Use_Data.csv | awk '{print $1}'
24441352

Each benchmark processes strict Text lines. For the actual processing, I just count the number of female entries.

countFemale :: Int -> T.Text -> Int
countFemale count line
    | ",Female," `T.isInfixOf` line = count + 1
    | otherwise                     = count

The number of female entries can be confirmed using the excellent xsv utility.

$ xsv frequency -s sex COVID-19_Case_Surveillance_Public_Use_Data.csv
field,value,count
sex,Female,12620227
sex,Male,11552444
sex,Unknown,214452
sex,Missing,53752
sex,Other,457
sex,NA,19

Text

This benchmark, implemented in bench-text.hs, reads the file as lazy Text, splits the contents into lines, converts each line to strict Text, and processes the lines.

main :: IO ()
main
    = benchmark
    $ print
    . foldl' countFemale 0
    . map TL.toStrict
    . TL.lines
    =<< TLIO.readFile dataPath

The results on my system are as follows.

$ stack exec bench-text
12620227
Wall clock time: 50.061926917s
Maximum residency: 62.8 KB
Maximum slop: 31.4 KB
Productivity (wall clock time): 99.1 %
Productivity (CPU time): 99.1 %

ByteString

This benchmark, implemented in bench-bs.hs, reads the file as lazy ByteString, decodes the UTF-8 ByteString to Text, splits the contents into lines, converts each line to strict Text, and processes the lines.

main :: IO ()
main
    = benchmark
    $ print
    . foldl' countFemale 0
    . map TL.toStrict
    . TL.lines
    . TLE.decodeUtf8With TEE.lenientDecode
    =<< BSL.readFile dataPath

The results on my system are as follows.

$ stack exec bench-bs
12620227
Wall clock time: 43.120697698s
Maximum residency: 124.6 KB
Maximum slop: 27.0 KB
Productivity (wall clock time): 99.2 %
Productivity (CPU time): 99.2 %

ByteString Lines

This benchmark, implemented in bench-bslines.hs, reads the file as lazy ByteString, splits the UTF-8 contents to lines, decodes each line to strict Text, and processes the lines.

main :: IO ()
main
    = benchmark
    $ print
    . foldl' countFemale 0
    . map (TL.toStrict . TLE.decodeUtf8With TEE.lenientDecode)
    . BSL8.lines
    =<< BSL.readFile dataPath

The results on my system are as follows.

$ stack exec bench-bslines
12620227
Wall clock time: 16.983949916s
Maximum residency: 96.2 KB
Maximum slop: 30.8 KB
Productivity (wall clock time): 98.8 %
Productivity (CPU time): 98.9 %

Conduit ByteString

This benchmark, implemented in bench-cbs.hs, uses conduit to read and process the file. It reads the file as ByteString, decodes the UTF-8 content to Text, splits the content to lines, and processes the lines.

main :: IO ()
main
    = benchmark
    $ (print =<<)
    $ C.runConduitRes
    $ CC.sourceFile dataPath
    .| CC.decodeUtf8Lenient
    .| CC.linesUnbounded
    .| CC.foldl countFemale 0

The results on my system are as follows.

$ stack exec bench-cbs
12620227
Wall clock time: 68.559410489s
Maximum residency: 104.5 KB
Maximum slop: 14.4 KB
Productivity (wall clock time): 96.2 %
Productivity (CPU time): 96.4 %

Conduit ByteString Lines

This benchmark, implemented in bench-cbslines.hs, also uses conduit to read and process the file. It reads the file as ByteString, splits the UTF-8 contents to lines, decodes each line to Text, and processes the lines.

main :: IO ()
main
    = benchmark
    $ (print =<<)
    $ C.runConduitRes
    $ CC.sourceFile dataPath
    .| CC.linesUnboundedAscii
    .| CC.decodeUtf8Lenient
    .| CC.foldl countFemale 0

The results on my system are as follows.

$ stack exec bench-cbslines
12620227
Wall clock time: 40.910613058s
Maximum residency: 76.1 KB
Maximum slop: 18.5 KB
Productivity (wall clock time): 98.5 %
Productivity (CPU time): 98.6 %

There is an issue with this code! See Decoding UTF-8 in Haskell (Part 2) for an updated version.

Observations

The following table shows an overview of the results.

Benchmark Time (s) Mem (KB) Slop (KB) Clock Prod (%) CPU Prod (%)
text 50.0 62.8 31.4 99.1 99.1
bs 43.1 124.6 27.0 99.2 99.2
bslines 17.0 96.2 30.8 98.8 98.9
cbs 68.6 104.5 14.4 96.2 96.4
cbslines 40.9 76.1 18.5 98.5 98.6

In these benchmarks, there is not a big difference in memory usage, but there is a difference in runtime. Splitting the UTF-8 ByteString contents into lines and decoding each line separately results in significantly faster performance. The performance gain is not as significant when using conduit, however.

Benchmarking note: I ran all benchmarks multiple times and saw consistent results. The results shown above are those from running each benchmark again as writing this blog entry. (They are not hand-selected from multiple runs.)

Author

Travis Cardwell

Published

Revised

Tags
Related Blog Entries