Decoding UTF-8 in Haskell (Part 2)
I made some progress on LiterateX yesterday, implementing the optimization discussed in Decoding UTF-8 In Haskell as well as adding support for GitHub Flavored Markdown. In doing so, I discovered an issue in the Conduit ByteString Lines approach!
The Issue
When splitting the UTF-8 ByteString
to lines and then
decoding each line to Text
using
decodeUtf8Lenient
, blank lines are lost! The following
code, implemented in issue-demo.hs
,
is a simple demonstration of the issue.
source :: ByteString
= BS8.unlines ["first", "", "", "last"]
source
main :: IO ()
main= mapM_ print
. C.runConduitPure
$ C.yield source
.| CC.linesUnboundedAscii
.| CC.decodeUtf8Lenient
.| CC.sinkList
This code processes four lines, including two blank lines in the
middle. It splits the UTF-8 ByteString
to lines, decodes to
Text
, collects the results in a list, and prints the items
of the list. It is clear that the blank lines are lost by running this
demonstration.
$ stack exec issue-demo
"first"
"last"
The source code for decodeUtf8Lenient
makes it clear what is happening. The function decodes streams
of UTF-8 ByteString
that may be split at arbitrary places,
even in the middle of a sequence of bytes that encodes a single Unicode
code point. Since linesUnboundedAscii
strips newline
characters, blank lines become empty strings, and
decodeUtf8Lenient
ignores empty strings.
I did not notice this yesterday because CSV files do not contain blank lines. The project I worked on before also did not have blank lines.
The Fix
The decodeUtf8Lenient
function is not intended to work
with lines. The issue can be fixed by using a function that works with
lines instead. The following code, implemented in issue-fix.hs
,
is a simple demonstration of the fix.
source :: ByteString
= BS8.unlines ["first", "", "", "last"]
source
decodeUtf8LinesLenient :: Monad m
=> C.ConduitT ByteString Text m ()
=
decodeUtf8LinesLenient $ C.yield . TE.decodeUtf8With TEE.lenientDecode
C.awaitForever
main :: IO ()
main= mapM_ print
. C.runConduitPure
$ C.yield source
.| CC.linesUnboundedAscii
.| decodeUtf8LinesLenient
.| CC.sinkList
The decodeUtf8LinesLenient
function simply decodes each
line, yielding every line, even if it is empty. It is clear that the
blank lines are no longer lost by running this demonstration.
$ stack exec issue-fix
"first"
""
""
"last"
Updated Benchmark
The updated benchmark program, implemented in bench-cbslines.hs
,
is as follows.
decodeUtf8LinesLenient :: Monad m
=> C.ConduitT ByteString Text m ()
=
decodeUtf8LinesLenient $ C.yield . TE.decodeUtf8With TEE.lenientDecode
C.awaitForever
main :: IO ()
main= benchmark
$ (print =<<)
$ C.runConduitRes
$ CC.sourceFile dataPath
.| CC.linesUnboundedAscii
.| decodeUtf8LinesLenient
.| CC.foldl countFemale 0
The results on my system are as follows.
$ stack exec bench-cbslines
12620227
Wall clock time: 37.579308932s
Maximum residency: 71.4 KB
Maximum slop: 18.7 KB
Productivity (wall clock time): 99.0 %
Productivity (CPU time): 99.1 %
There is a consistent yet small improvement in runtime and productivity.