Decoding UTF-8 in Haskell

In a previous blog entry, I mentioned that I wanted to support translation between the following Textual data types (in addition to handles and files) in LiterateX. Why would one explicitly implement support for Textual data types in such a library?

String
Text
Lazy Text
ByteString
Lazy ByteString

The API would be simpler if it just supported one Textual data type. The implementation uses streams of strict Text lines internally, transforming a stream of input lines to a stream of output lines. An API that supports one Textual data type could accept lazy Text and generate a stream of strict Text lines. The stream of strict Text lines that is output could then be returned as lazy Text, with the newlines added. When using other Textual data types, the calling function can perform the conversion to/from lazy Text.

I explicitly implemented support for Textual data types in LiterateX because I have observed that the types used can significantly affect the performance (runtime and memory usage) of the software. In a project that I worked on in the past, benchmarking indicated that decoding a UTF-8 ByteString to Text was using a lot of memory. Through experimentation, I discovered that splitting the (lazy) ByteString into lines and decoding each line separately significantly reduced memory usage in that application. Memory usage went from multiple gigabytes to less than a gigabyte!

LiterateX is nowhere near as complex as that previous project, but my gut feeling is that I should use conduit and provide explicit implementations for each Textual data type in order to provide a high quality implementation. This prioritizes performance over usability, but the usability of the API is still quite good.

UTF-8 Line Splitting

Before getting into benchmarks, I should make a note about UTF-8 line splitting. The Data.ByteString.Char8 and Data.ByteString.Lazy.Char8 modules in the bytestring package provide an API for working with ByteString values using 8-bit characters. Some of these functions should never be used with UTF-8 ByteString values, but other functions are safe to use with UTF-8 ByteString values because UTF-8 has backward compatibility with US-ASCII.

With UTF-8, each Unicode code point is encoded using one to four bytes. Only the first 128 code points are encoded using one byte, and they have the same encoding as US-ASCII. All other code points are encoded using more than one byte, where every byte has a high bit set. This ensures that the bytes that encode US-ASCII never occur in the encoding of other code points.

Because of this encoding property, one does not have to decode UTF-8 to safely split lines. The line feed and carriage return characters are US-ASCII and can therefore be safely processed without (before) decoding.

Benchmarks

When I had a little bit of time over the weekend, I decided to write some minimal benchmarks. The benchmarks process a COVID-19 Case Surveillance Public Use Data CSV file.

$ wc --bytes COVID-19_Case_Surveillance_Public_Use_Data.csv | awk '{print $1}'
2893026390
$ wc --lines COVID-19_Case_Surveillance_Public_Use_Data.csv | awk '{print $1}'
24441352

Each benchmark processes strict Text lines. For the actual processing, I just count the number of female entries.

countFemale :: Int -> T.Text -> Int
countFemale count line
    | ",Female," `T.isInfixOf` line = count + 1
    | otherwise                     = count

The number of female entries can be confirmed using the excellent xsv utility.

$ xsv frequency -s sex COVID-19_Case_Surveillance_Public_Use_Data.csv
field,value,count
sex,Female,12620227
sex,Male,11552444
sex,Unknown,214452
sex,Missing,53752
sex,Other,457
sex,NA,19

Text

This benchmark, implemented in bench-text.hs, reads the file as lazy Text, splits the contents into lines, converts each line to strict Text, and processes the lines.

main :: IO ()
main
    = benchmark
    $ print
    . foldl' countFemale 0
    . map TL.toStrict
    . TL.lines
    =<< TLIO.readFile dataPath