Skip to main content

Decoding UTF-8 in Haskell (Part 2)

I made some progress on LiterateX yesterday, implementing the optimization discussed in Decoding UTF-8 In Haskell as well as adding support for GitHub Flavored Markdown. In doing so, I discovered an issue in the Conduit ByteString Lines approach!

The Issue

When splitting the UTF-8 ByteString to lines and then decoding each line to Text using decodeUtf8Lenient, blank lines are lost! The following code, implemented in issue-demo.hs, is a simple demonstration of the issue.

source :: ByteString
source = BS8.unlines ["first", "", "", "last"]

main :: IO ()
main
    = mapM_ print
    . C.runConduitPure
    $ C.yield source
    .| CC.linesUnboundedAscii
    .| CC.decodeUtf8Lenient
    .| CC.sinkList

This code processes four lines, including two blank lines in the middle. It splits the UTF-8 ByteString to lines, decodes to Text, collects the results in a list, and prints the items of the list. It is clear that the blank lines are lost by running this demonstration.

$ stack exec issue-demo
"first"
"last"

The source code for decodeUtf8Lenient makes it clear what is happening. The function decodes streams of UTF-8 ByteString that may be split at arbitrary places, even in the middle of a sequence of bytes that encodes a single Unicode code point. Since linesUnboundedAscii strips newline characters, blank lines become empty strings, and decodeUtf8Lenient ignores empty strings.

I did not notice this yesterday because CSV files do not contain blank lines. The project I worked on before also did not have blank lines.

The Fix

The decodeUtf8Lenient function is not intended to work with lines. The issue can be fixed by using a function that works with lines instead. The following code, implemented in issue-fix.hs, is a simple demonstration of the fix.

source :: ByteString
source = BS8.unlines ["first", "", "", "last"]

decodeUtf8LinesLenient
  :: Monad m
  => C.ConduitT ByteString Text m ()
decodeUtf8LinesLenient =
    C.awaitForever $ C.yield . TE.decodeUtf8With TEE.lenientDecode

main :: IO ()
main
    = mapM_ print
    . C.runConduitPure
    $ C.yield source
    .| CC.linesUnboundedAscii
    .| decodeUtf8LinesLenient
    .| CC.sinkList

The decodeUtf8LinesLenient function simply decodes each line, yielding every line, even if it is empty. It is clear that the blank lines are no longer lost by running this demonstration.

$ stack exec issue-fix
"first"
""
""
"last"

Updated Benchmark

The updated benchmark program, implemented in bench-cbslines.hs, is as follows.

decodeUtf8LinesLenient
  :: Monad m
  => C.ConduitT ByteString Text m ()
decodeUtf8LinesLenient =
    C.awaitForever $ C.yield . TE.decodeUtf8With TEE.lenientDecode

main :: IO ()
main
    = benchmark
    $ (print =<<)
    $ C.runConduitRes
    $ CC.sourceFile dataPath
    .| CC.linesUnboundedAscii
    .| decodeUtf8LinesLenient
    .| CC.foldl countFemale 0

The results on my system are as follows.

$ stack exec bench-cbslines
12620227
Wall clock time: 37.579308932s
Maximum residency: 71.4 KB
Maximum slop: 18.7 KB
Productivity (wall clock time): 99.0 %
Productivity (CPU time): 99.1 %

There is a consistent yet small improvement in runtime and productivity.