Parsing Bounded Integers
I really like the numeric types that Haskell provides, but one must always keep in mind that the bounded integral types overflow silently, which can lead to bugs. I am using the attoparsec library to do some parsing and need to parse a bounded integral value such that overflow is considered an error. This blog entry discusses various implementations, with tests to demonstrate the issue.
The source code is available on GitHub.
Update: I realized that I missed a significant issue with the DecimalMaxSO and DecimalMaxFixed implementations. See the part 2 blog entry for details. Note that some tests are changed.
Decimal
The attoparsec library provides a decimal function that is implemented as follows:
decimal :: Integral a => Parser a
= T.foldl' step 0 `fmap` takeWhile1 isDecimal
decimal where step a c = a * 10 + fromIntegral (ord c - 48)
If this function is used to parse a bounded integer from a number
that is larger than the maximum bound, the accumulator overflows
silently and there is no failure. For example, consider the Word8
8-bit unsigned integer type, which can represent integers between 0 and
255. The following REPL session demonstrates the overflow that occurs
when parsing 1001
.
λ: import Data.Word
λ: (minBound, maxBound) :: (Word8, Word8)
(0,255)
it :: (Word8, Word8)
λ: 100 * 10 + 1 :: Word8
233
it :: Word8
The following function, which parses just a Word8
using
the decimal
function, can be used for testing.
parseWord8 :: Text -> Either String Word8
= AT.parseOnly (AT.decimal <* AT.endOfInput) parseWord8
The test cases are as follows:
tests :: TestTree
= testGroup "Decimal"
tests "empty" $ Left "not enough input" @=? parseWord8 ""
[ testCase "zero" $ Right 0 @=? parseWord8 "0"
, testCase "max" $ Right 255 @=? parseWord8 "255"
, testCase "large1" $ Left "Failed reading: overflow" @=? parseWord8 "256"
, testCase "large2" $
, testCase Left "Failed reading: overflow" @=? parseWord8 "1001"
"negative" $
, testCase Left "Failed reading: takeWhile1" @=? parseWord8 "-1"
"imaginary" $ Left "endOfInput" @=? parseWord8 "5i"
, testCase ]
Running the tests results in the following output:
Decimal
empty: OK
zero: OK
max: OK
large1: FAIL
test/Decimal/Test.hs:32:
expected: Left "Failed reading: overflow"
but got: Right 0
Use -p '/Decimal.large1/' to rerun this test only.
large2: FAIL
test/Decimal/Test.hs:34:
expected: Left "Failed reading: overflow"
but got: Right 233
Use -p '/Decimal.large2/' to rerun this test only.
negative: OK
imaginary: OK
DecimalBounded
This issue can be avoided by first parsing to Integer
and then checking that the value is not too large before
converting to the bounded type. This works because the
Integer
type is unbounded. The following function
implements this strategy. Note that it requires the ScopedTypeVariables
extension in order to use the maxBound
for the target
type.
boundedDecimal :: forall a. (Bounded a, Integral a) => AT.Parser a
= do
boundedDecimal <- AT.decimal
i if (i :: Integer) > fromIntegral (maxBound :: a)
then fail "overflow"
else return $ fromIntegral i
The parseWord8
function is changed to use the above
function.
parseWord8 :: Text -> Either String Word8
= AT.parseOnly (boundedDecimal <* AT.endOfInput) parseWord8
Run against the same tests, all tests pass.
DecimalBounded
empty: OK
zero: OK
max: OK
large1: OK
large2: OK
negative: OK
imaginary: OK
DecimalBoundedStrict
I need the parser to support two different modes. In strict mode, the string that is parsed must be equal to the string that is rendered from the parsed value, forming an isomorphism. In particular, leading zeros are not allowed in strict mode. In lenient mode, leading zeros are allowed.
The following function implements a strict parser that does not accept leading zeros. When the first character is a zero character, it looks for additional digits and throws a “leading zero” error if there are any. This results in more helpful feedback than the error resulting from stopping parsing after the leading zero, returning zero, and then failing with unexpected digits.
strictDecimal :: forall a. Integral a => AT.Parser a
= (<?> "strictDecimal") $ do
strictDecimal <- AT.digit
d <- AT.takeWhile isDigit
ds if d == '0' && not (T.null ds)
then fail "leading zero"
else return $ T.foldl' (\a c -> a * 10 + fromChar c) (fromChar d) ds
where
fromChar :: Char -> a
= fromIntegral $ ord c - 48
fromChar c {-# INLINE fromChar #-}
The boundedDecimal
function is changed to use the above
function if the isStrict
flag is set. The
decimal
function is used otherwise.
boundedDecimal :: forall a. (Bounded a, Integral a) => Bool -> AT.Parser a
= (<?> "boundedDecimal") $ do
boundedDecimal isStrict <- if isStrict then strictDecimal else AT.decimal <?> "decimal"
i if (i :: Integer) > fromIntegral (maxBound :: a)
then fail "overflow"
else return $ fromIntegral i
The parseWord8
function is changed to pass on the
isStrict
flag.
parseWord8 :: Bool -> Text -> Either String Word8
= AT.parseOnly (boundedDecimal isStrict <* AT.endOfInput) parseWord8 isStrict
The strict mode test cases are as follows:
strictTests :: TestTree
= testGroup "Strict"
strictTests "empty" $
[ testCase Left "boundedDecimal > strictDecimal > digit: not enough input"
@=? parseWord8 True ""
"zero" $ Right 0 @=? parseWord8 True "0"
, testCase "max" $ Right 255 @=? parseWord8 True "255"
, testCase "large1" $
, testCase Left "boundedDecimal: Failed reading: overflow"
@=? parseWord8 True "256"
"large2" $
, testCase Left "boundedDecimal: Failed reading: overflow"
@=? parseWord8 True "1001"
"negative" $
, testCase Left "boundedDecimal > strictDecimal > digit: Failed reading: satisfy"
@=? parseWord8 True "-1"
"imaginary" $ Left "endOfInput" @=? parseWord8 True "5i"
, testCase "leadingZero" $
, testCase Left "boundedDecimal > strictDecimal: Failed reading: leading zero"
@=? parseWord8 True "042"
]
The lenient mode test cases are as follows:
lenientTests :: TestTree
= testGroup "Lenient"
lenientTests "empty" $
[ testCase Left "boundedDecimal > decimal: not enough input"
@=? parseWord8 False ""
"zero" $ Right 0 @=? parseWord8 False "0"
, testCase "max" $ Right 255 @=? parseWord8 False "255"
, testCase "large1" $
, testCase Left "boundedDecimal: Failed reading: overflow"
@=? parseWord8 False "256"
"large2" $
, testCase Left "boundedDecimal: Failed reading: overflow"
@=? parseWord8 False "1001"
"negative" $
, testCase Left "boundedDecimal > decimal: Failed reading: takeWhile1"
@=? parseWord8 False "-1"
"imaginary" $ Left "endOfInput" @=? parseWord8 False "5i"
, testCase "leadingZero" $ Right 42 @=? parseWord8 False "042"
, testCase ]
All of the tests pass:
DecimalBoundedStrict
Strict
empty: OK
zero: OK
max: OK
large1: OK
large2: OK
negative: OK
imaginary: OK
leadingZero: OK
Lenient
empty: OK
zero: OK
max: OK
large1: OK
large2: OK
negative: OK
imaginary: OK
leadingZero: OK
Note that attoparsec error messages are intended for developers. If you need better error messages, consider using the megaparsec library instead.
DecimalMaxSO
The boundedDecimal
function may do too much work. For
example, the Integer
value for 100 digits is computed while
it is clear after three digits that there is a Word8
overflow. It is possible to stop parsing by using the scan
function. I found an existing
solution on Stack Overflow.
decimalMax :: Integral a => Integer -> Parser a
= do
decimalMax dMax let numDigs = ceiling $ log (fromIntegral(dMax+1)) / log 10
= foldl' (\s d -> s*10+fromIntegral (d-48)) 0 . B.unpack
getVal <- getVal <$> scan 0 (\n c ->
val if n > numDigs || not (isDigit c) then Nothing else Just (n+1))
if val <= dMax
then return $ fromIntegral val
else fail $ "decimalMax: parsed decimal exceeded" ++ show dMax
As is common with code found online, this implementation has some issues. Some ambiguous types lead to poor defaults, so I added some type annotations in my test code to get rid of the compilation warnings. More significantly, the parser can only fail when the parsed value exceeds the specified maximum. If a non-digit is seen, or there is no input at all, the parser returns zero without failure. This can lead to confusing error messages!
The parseWord8
function is changed to call this function
with the appropriate maximum value.
parseWord8 :: ByteString -> Either String Word8
= ABS8.parseOnly $
parseWord8 fromIntegral @Word8 maxBound) <* ABS8.endOfInput decimalMax (
The test cases are the same as above, though some error messages are updated. Running the tests results in the following output:
DecimalMaxSO
empty: FAIL
test/DecimalMaxSO/Test.hs:35:
expected: Left "Failed reading: expected decimal"
but got: Right 0
Use -p '/DecimalMaxSO.empty/' to rerun this test only.
zero: OK
max: OK
large1: OK
large1: OK
negative: FAIL
test/DecimalMaxSO/Test.hs:45:
expected: Left "Failed reading: expected decimal"
but got: Left "endOfInput"
Use -p '/DecimalMaxSO.negative/' to rerun this test only.
imaginary: OK
The empty
test shows that the empty string is indeed
parsed as a zero. The negative
test parses string
-1
. The decimalMax
function sees the
-
character and successfully parses a zero without
consuming any characters. The endOfInput
function then
fails because input -1
remains.
DecimalMaxFixed
It is easy to fix the decimalMax
function.
decimalMax :: Integral a => Integer -> ABS8.Parser a
= (<?> "decimalMax") $ do
decimalMax valueMax <- ABS8.scan 0 $ \numDigits c ->
bs if numDigits <= numDigitsMax && isDigit c
then Just $ numDigits + 1
else Nothing
$ fail "expected decimal"
when (BS.null bs) let value = BS.foldl' (\acc d -> acc * 10 + fromIntegral (d - 48)) 0 bs
if value > valueMax
then fail $ "decimal exceeds " ++ show valueMax
else return $ fromIntegral value
where
numDigitsMax :: Int
= ceiling @Float $ logBase 10 (fromIntegral valueMax + 1) numDigitsMax
Aside from adding type annotations and using different variable names, the changes are as follows:
- The
ByteString
returned by the scan is captured, allowing an error to be thrown when it is empty. - The
(<?>)
operator is used to add context to all error messages instead of writing the context in each error message separately. - Using
BS.foldl'
is preferred over usingfoldl'
andBS.unpack
. - Using
logBase
is preferred over usinglog a / log b
. - The condition in the scan is inverted so that the
not
call is not needed. - In calculating
numDigitsMax
,valueMax
is converted to aFloat
before incrementing it, making the increment a native floating point operation instead of an arbitrary precision operation.
The test cases are the same as above, though some error messages are updated. All of the tests pass:
DecimalMaxFixed
empty: OK
zero: OK
max: OK
large1: OK
large1: OK
negative: OK
imaginary: OK
DecimalMaxStrict
The decimalMax
parser now works, but I need support for
strict and lenient parsing. I also need to parse Unicode text, not ASCII
bytes. I wrote my own implementation to address these issues, and it
takes a different approach.
decimalMax :: (Integral a, Integral b) => Bool -> a -> AT.Parser b
= (<?> "decimalMax") $ do
decimalMax isStrict valueMax <- AT.runScanner (Just 0) $ \mN c -> case mN of
(t, mN) Just n
| isDigit c -> Just $
let n' = n * 10 + fromIntegral (ord c - 48)
in if n' > valueMax then Nothing else Just n'
| otherwise -> Nothing
Nothing -> Nothing
case (mN, T.uncons t) of
Just n, Just (d, ds))
(| isStrict && d == '0' && not (T.null ds) -> fail "leading zero"
| otherwise -> return $ fromIntegral n
Just{}, Nothing) -> fail "expected decimal"
(Nothing, _digits) -> fail "overflow" (
Instead of counting digits, this version calculates the value as the
digits are scanned. The runScanner
function is used, so the state is returned as well as the parsed string.
The state needs to represent the parsed value as well as an overflow, so
a Maybe
type is used where Nothing
indicates
an overflow.
Previous versions of decimalMax
used the unbounded
Integer
type to accumulate the value during parsing. This
implementation allows the developer to choose which type to use, as the
type just needs to be sufficiently larger than the target type. The
word8
parser uses an Word
. A
Word16
would suffice, but the native word size is preferred
for good performance.
word8 :: Bool -> AT.Parser Word8
= decimalMax isStrict $ fromIntegral @Word8 @Word maxBound word8 isStrict
The parseWord8
function is changed to call this
function, passing on the isStrict
flag.
parseWord8 :: Bool -> Text -> Either String Word8
= AT.parseOnly (word8 isStrict <* AT.endOfInput) parseWord8 isStrict
The tests are the same as those for DecimalBoundedStrict, with slightly different error messages. All of the tests pass:
DecimalMaxStrict
Strict
empty: OK
zero: OK
max: OK
large1: OK
large2: OK
negative: OK
imaginary: OK
leadingZero: OK
Lenient
empty: OK
zero: OK
max: OK
large1: OK
large2: OK
negative: OK
imaginary: OK
leadingZero: OK
Benchmarks
The DecimalBoundedStrict and DecimalMaxStrict versions are both sufficient for my needs. How do they compare when benchmarked? I prepared some benchmarks, and DecimalMaxStrict consistently performs better than DecimalBoundedStrict on my computer.