Decoding UTF-8 in Haskell (Part 2)
I made some progress on LiterateX yesterday, implementing the optimization discussed in Decoding UTF-8 In Haskell as well as adding support for GitHub Flavored Markdown. In doing so, I discovered an issue in the Conduit ByteString Lines approach!
The Issue
When splitting the UTF-8 ByteString to lines and then
decoding each line to Text using
decodeUtf8Lenient, blank lines are lost! The following
code, implemented in issue-demo.hs,
is a simple demonstration of the issue.
source :: ByteString
source = BS8.unlines ["first", "", "", "last"]
main :: IO ()
main
= mapM_ print
. C.runConduitPure
$ C.yield source
.| CC.linesUnboundedAscii
.| CC.decodeUtf8Lenient
.| CC.sinkListThis code processes four lines, including two blank lines in the
middle. It splits the UTF-8 ByteString to lines, decodes to
Text, collects the results in a list, and prints the items
of the list. It is clear that the blank lines are lost by running this
demonstration.
$ stack exec issue-demo
"first"
"last"
The source code for decodeUtf8Lenient
makes it clear what is happening. The function decodes streams
of UTF-8 ByteString that may be split at arbitrary places,
even in the middle of a sequence of bytes that encodes a single Unicode
code point. Since linesUnboundedAscii strips newline
characters, blank lines become empty strings, and
decodeUtf8Lenient ignores empty strings.
I did not notice this yesterday because CSV files do not contain blank lines. The project I worked on before also did not have blank lines.
The Fix
The decodeUtf8Lenient function is not intended to work
with lines. The issue can be fixed by using a function that works with
lines instead. The following code, implemented in issue-fix.hs,
is a simple demonstration of the fix.
source :: ByteString
source = BS8.unlines ["first", "", "", "last"]
decodeUtf8LinesLenient
:: Monad m
=> C.ConduitT ByteString Text m ()
decodeUtf8LinesLenient =
C.awaitForever $ C.yield . TE.decodeUtf8With TEE.lenientDecode
main :: IO ()
main
= mapM_ print
. C.runConduitPure
$ C.yield source
.| CC.linesUnboundedAscii
.| decodeUtf8LinesLenient
.| CC.sinkListThe decodeUtf8LinesLenient function simply decodes each
line, yielding every line, even if it is empty. It is clear that the
blank lines are no longer lost by running this demonstration.
$ stack exec issue-fix
"first"
""
""
"last"
Updated Benchmark
The updated benchmark program, implemented in bench-cbslines.hs,
is as follows.
decodeUtf8LinesLenient
:: Monad m
=> C.ConduitT ByteString Text m ()
decodeUtf8LinesLenient =
C.awaitForever $ C.yield . TE.decodeUtf8With TEE.lenientDecode
main :: IO ()
main
= benchmark
$ (print =<<)
$ C.runConduitRes
$ CC.sourceFile dataPath
.| CC.linesUnboundedAscii
.| decodeUtf8LinesLenient
.| CC.foldl countFemale 0The results on my system are as follows.
$ stack exec bench-cbslines
12620227
Wall clock time: 37.579308932s
Maximum residency: 71.4 KB
Maximum slop: 18.7 KB
Productivity (wall clock time): 99.0 %
Productivity (CPU time): 99.1 %
There is a consistent yet small improvement in runtime and productivity.