Parsing Bounded Integers

I really like the numeric types that Haskell provides, but one must always keep in mind that the bounded integral types overflow silently, which can lead to bugs. I am using the attoparsec library to do some parsing and need to parse a bounded integral value such that overflow is considered an error. This blog entry discusses various implementations, with tests to demonstrate the issue.

The source code is available on GitHub.

Update: I realized that I missed a significant issue with the DecimalMaxSO and DecimalMaxFixed implementations. See the part 2 blog entry for details. Note that some tests are changed.

Decimal

The attoparsec library provides a decimal function that is implemented as follows:

decimal :: Integral a => Parser a
decimal = T.foldl' step 0 `fmap` takeWhile1 isDecimal
  where step a c = a * 10 + fromIntegral (ord c - 48)

If this function is used to parse a bounded integer from a number that is larger than the maximum bound, the accumulator overflows silently and there is no failure. For example, consider the Word8 8-bit unsigned integer type, which can represent integers between 0 and 255. The following REPL session demonstrates the overflow that occurs when parsing 1001.

λ: import Data.Word
λ: (minBound, maxBound) :: (Word8, Word8)
(0,255)
it :: (Word8, Word8)
λ: 100 * 10 + 1 :: Word8
233
it :: Word8

The following function, which parses just a Word8 using the decimal function, can be used for testing.

parseWord8 :: Text -> Either String Word8
parseWord8 = AT.parseOnly (AT.decimal <* AT.endOfInput)

The test cases are as follows:

tests :: TestTree
tests = testGroup "Decimal"
    [ testCase "empty" $ Left "not enough input" @=? parseWord8 ""
    , testCase "zero" $ Right 0 @=? parseWord8 "0"
    , testCase "max" $ Right 255 @=? parseWord8 "255"
    , testCase "large1" $ Left "Failed reading: overflow" @=? parseWord8 "256"
    , testCase "large2" $
        Left "Failed reading: overflow" @=? parseWord8 "1001"
    , testCase "negative" $
        Left "Failed reading: takeWhile1" @=? parseWord8 "-1"
    , testCase "imaginary" $ Left "endOfInput" @=? parseWord8 "5i"
    ]

Running the tests results in the following output:

Decimal
  empty:         OK
  zero:          OK
  max:           OK
  large1:        FAIL
    test/Decimal/Test.hs:32:
    expected: Left "Failed reading: overflow"
     but got: Right 0
    Use -p '/Decimal.large1/' to rerun this test only.
  large2:        FAIL
    test/Decimal/Test.hs:34:
    expected: Left "Failed reading: overflow"
     but got: Right 233
    Use -p '/Decimal.large2/' to rerun this test only.
  negative:      OK
  imaginary:     OK

DecimalBounded

This issue can be avoided by first parsing to Integer and then checking that the value is not too large before converting to the bounded type. This works because the Integer type is unbounded. The following function implements this strategy. Note that it requires the ScopedTypeVariables extension in order to use the maxBound for the target type.

boundedDecimal :: forall a. (Bounded a, Integral a) => AT.Parser a
boundedDecimal = do
    i <- AT.decimal
    if (i :: Integer) > fromIntegral (maxBound :: a)
      then fail "overflow"
      else return $ fromIntegral i

The parseWord8 function is changed to use the above function.

parseWord8 :: Text -> Either String Word8
parseWord8 = AT.parseOnly (boundedDecimal <* AT.endOfInput)

Run against the same tests, all tests pass.

DecimalBounded
  empty:         OK
  zero:          OK
  max:           OK
  large1:        OK
  large2:        OK
  negative:      OK
  imaginary:     OK

DecimalBoundedStrict

I need the parser to support two different modes. In strict mode, the string that is parsed must be equal to the string that is rendered from the parsed value, forming an isomorphism. In particular, leading zeros are not allowed in strict mode. In lenient mode, leading zeros are allowed.

The following function implements a strict parser that does not accept leading zeros. When the first character is a zero character, it looks for additional digits and throws a “leading zero” error if there are any. This results in more helpful feedback than the error resulting from stopping parsing after the leading zero, returning zero, and then failing with unexpected digits.

strictDecimal :: forall a. Integral a => AT.Parser a
strictDecimal = (<?> "strictDecimal") $ do
    d <- AT.digit
    ds <- AT.takeWhile isDigit
    if d == '0' && not (T.null ds)
      then fail "leading zero"
      else return $ T.foldl' (\a c -> a * 10 + fromChar c) (fromChar d) ds
  where
    fromChar :: Char -> a
    fromChar c = fromIntegral $ ord c - 48
    {-# INLINE fromChar #-}

The boundedDecimal function is changed to use the above function if the isStrict flag is set. The decimal function is used otherwise.

boundedDecimal :: forall a. (Bounded a, Integral a) => Bool -> AT.Parser a
boundedDecimal isStrict = (<?> "boundedDecimal") $ do
    i <- if isStrict then strictDecimal else AT.decimal <?> "decimal"
    if (i :: Integer) > fromIntegral (maxBound :: a)
      then fail "overflow"
      else return $ fromIntegral i

The parseWord8 function is changed to pass on the isStrict flag.

parseWord8 :: Bool -> Text -> Either String Word8
parseWord8 isStrict = AT.parseOnly (boundedDecimal isStrict <* AT.endOfInput)

The strict mode test cases are as follows:

strictTests :: TestTree
strictTests = testGroup "Strict"
    [ testCase "empty" $
        Left "boundedDecimal > strictDecimal > digit: not enough input"
          @=? parseWord8 True ""
    , testCase "zero" $ Right 0 @=? parseWord8 True "0"
    , testCase "max" $ Right 255 @=? parseWord8 True "255"
    , testCase "large1" $
        Left "boundedDecimal: Failed reading: overflow"
          @=? parseWord8 True "256"
    , testCase "large2" $
        Left "boundedDecimal: Failed reading: overflow"
          @=? parseWord8 True "1001"
    , testCase "negative" $
        Left "boundedDecimal > strictDecimal > digit: Failed reading: satisfy"
          @=? parseWord8 True "-1"
    , testCase "imaginary" $ Left "endOfInput" @=? parseWord8 True "5i"
    , testCase "leadingZero" $
        Left "boundedDecimal > strictDecimal: Failed reading: leading zero"
          @=? parseWord8 True "042"
    ]

The lenient mode test cases are as follows:

lenientTests :: TestTree
lenientTests = testGroup "Lenient"
    [ testCase "empty" $
        Left "boundedDecimal > decimal: not enough input"
          @=? parseWord8 False ""
    , testCase "zero" $ Right 0 @=? parseWord8 False "0"
    , testCase "max" $ Right 255 @=? parseWord8 False "255"
    , testCase "large1" $
        Left "boundedDecimal: Failed reading: overflow"
          @=? parseWord8 False "256"
    , testCase "large2" $
        Left "boundedDecimal: Failed reading: overflow"
          @=? parseWord8 False "1001"
    , testCase "negative" $
        Left "boundedDecimal > decimal: Failed reading: takeWhile1"
          @=? parseWord8 False "-1"
    , testCase "imaginary" $ Left "endOfInput" @=? parseWord8 False "5i"
    , testCase "leadingZero" $ Right 42 @=? parseWord8 False "042"
    ]

All of the tests pass:

DecimalBoundedStrict
  Strict
    empty:       OK
    zero:        OK
    max:         OK
    large1:      OK
    large2:      OK
    negative:    OK
    imaginary:   OK
    leadingZero: OK
  Lenient
    empty:       OK
    zero:        OK
    max:         OK
    large1:      OK
    large2:      OK
    negative:    OK
    imaginary:   OK
    leadingZero: OK

Note that attoparsec error messages are intended for developers. If you need better error messages, consider using the megaparsec library instead.

DecimalMaxSO

The boundedDecimal function may do too much work. For example, the Integer value for 100 digits is computed while it is clear after three digits that there is a Word8 overflow. It is possible to stop parsing by using the scan function. I found an existing solution on Stack Overflow.

decimalMax :: Integral a => Integer -> Parser a
decimalMax dMax = do
  let numDigs = ceiling $ log (fromIntegral(dMax+1)) / log 10
      getVal = foldl' (\s d -> s*10+fromIntegral (d-48)) 0 . B.unpack
  val <- getVal <$> scan 0 (\n c ->
          if n > numDigs || not (isDigit c) then Nothing else Just (n+1))
  if val <= dMax
    then return $ fromIntegral val
    else fail $ "decimalMax: parsed decimal exceeded" ++ show dMax

As is common with code found online, this implementation has some issues. Some ambiguous types lead to poor defaults, so I added some type annotations in my test code to get rid of the compilation warnings. More significantly, the parser can only fail when the parsed value exceeds the specified maximum. If a non-digit is seen, or there is no input at all, the parser returns zero without failure. This can lead to confusing error messages!

The parseWord8 function is changed to call this function with the appropriate maximum value.

parseWord8 :: ByteString -> Either String Word8
parseWord8 = ABS8.parseOnly $
    decimalMax (fromIntegral @Word8 maxBound) <* ABS8.endOfInput

The test cases are the same as above, though some error messages are updated. Running the tests results in the following output:

DecimalMaxSO
  empty:         FAIL
    test/DecimalMaxSO/Test.hs:35:
    expected: Left "Failed reading: expected decimal"
     but got: Right 0
    Use -p '/DecimalMaxSO.empty/' to rerun this test only.
  zero:          OK
  max:           OK
  large1:        OK
  large1:        OK
  negative:      FAIL
    test/DecimalMaxSO/Test.hs:45:
    expected: Left "Failed reading: expected decimal"
     but got: Left "endOfInput"
    Use -p '/DecimalMaxSO.negative/' to rerun this test only.
  imaginary:     OK

The empty test shows that the empty string is indeed parsed as a zero. The negative test parses string -1. The decimalMax function sees the - character and successfully parses a zero without consuming any characters. The endOfInput function then fails because input -1 remains.

DecimalMaxFixed

It is easy to fix the decimalMax function.

decimalMax :: Integral a => Integer -> ABS8.Parser a
decimalMax valueMax = (<?> "decimalMax") $ do
    bs <- ABS8.scan 0 $ \numDigits c ->
      if numDigits <= numDigitsMax && isDigit c
        then Just $ numDigits + 1
        else Nothing
    when (BS.null bs) $ fail "expected decimal"
    let value = BS.foldl' (\acc d -> acc * 10 + fromIntegral (d - 48)) 0 bs
    if value > valueMax
      then fail $ "decimal exceeds " ++ show valueMax
      else return $ fromIntegral value
  where
    numDigitsMax :: Int
    numDigitsMax = ceiling @Float $ logBase 10 (fromIntegral valueMax + 1)

Aside from adding type annotations and using different variable names, the changes are as follows:

The ByteString returned by the scan is captured, allowing an error to be thrown when it is empty.
The (<?>) operator is used to add context to all error messages instead of writing the context in each error message separately.
Using BS.foldl' is preferred over using foldl' and BS.unpack.
Using logBase is preferred over using log a / log b.
The condition in the scan is inverted so that the not call is not needed.
In calculating numDigitsMax, valueMax is converted to a Float before incrementing it, making the increment a native floating point operation instead of an arbitrary precision operation.

The test cases are the same as above, though some error messages are updated. All of the tests pass:

DecimalMaxFixed
  empty:         OK
  zero:          OK
  max:           OK
  large1:        OK
  large1:        OK
  negative:      OK
  imaginary:     OK

DecimalMaxStrict

The decimalMax parser now works, but I need support for strict and lenient parsing. I also need to parse Unicode text, not ASCII bytes. I wrote my own implementation to address these issues, and it takes a different approach.

decimalMax :: (Integral a, Integral b) => Bool -> a -> AT.Parser b
decimalMax isStrict valueMax = (<?> "decimalMax") $ do
    (t, mN) <- AT.runScanner (Just 0) $ \mN c -> case mN of
      Just n
        | isDigit c -> Just $
            let n' = n * 10 + fromIntegral (ord c - 48)
            in  if n' > valueMax then Nothing else Just n'
        | otherwise -> Nothing
      Nothing -> Nothing
    case (mN, T.uncons t) of
      (Just n, Just (d, ds))
        | isStrict && d == '0' && not (T.null ds) -> fail "leading zero"
        | otherwise -> return $ fromIntegral n
      (Just{}, Nothing) -> fail "expected decimal"
      (Nothing, _digits) -> fail "overflow"

Instead of counting digits, this version calculates the value as the digits are scanned. The runScanner function is used, so the state is returned as well as the parsed string. The state needs to represent the parsed value as well as an overflow, so a Maybe type is used where Nothing indicates an overflow.

Previous versions of decimalMax used the unbounded Integer type to accumulate the value during parsing. This implementation allows the developer to choose which type to use, as the type just needs to be sufficiently larger than the target type. The word8 parser uses an Word. A Word16 would suffice, but the native word size is preferred for good performance.

word8 :: Bool -> AT.Parser Word8
word8 isStrict = decimalMax isStrict $ fromIntegral @Word8 @Word maxBound

The parseWord8 function is changed to call this function, passing on the isStrict flag.

parseWord8 :: Bool -> Text -> Either String Word8
parseWord8 isStrict = AT.parseOnly (word8 isStrict <* AT.endOfInput)

The tests are the same as those for DecimalBoundedStrict, with slightly different error messages. All of the tests pass:

DecimalMaxStrict
  Strict
    empty:       OK
    zero:        OK
    max:         OK
    large1:      OK
    large2:      OK
    negative:    OK
    imaginary:   OK
    leadingZero: OK
  Lenient
    empty:       OK
    zero:        OK
    max:         OK
    large1:      OK
    large2:      OK
    negative:    OK
    imaginary:   OK
    leadingZero: OK

Benchmarks

The DecimalBoundedStrict and DecimalMaxStrict versions are both sufficient for my needs. How do they compare when benchmarked? I prepared some benchmarks, and DecimalMaxStrict consistently performs better than DecimalBoundedStrict on my computer.

Valid Strict Data

Valid Lenient Data

Overflow Strict Data

Overflow Lenient Data

Author

Travis Cardwell

Published

October 20, 2021

Revised

October 21, 2021