Parsing Bounded Integers
I really like the numeric types that Haskell provides, but one must always keep in mind that the bounded integral types overflow silently, which can lead to bugs. I am using the attoparsec library to do some parsing and need to parse a bounded integral value such that overflow is considered an error. This blog entry discusses various implementations, with tests to demonstrate the issue.
The source code is available on GitHub.
Update: I realized that I missed a significant issue with the DecimalMaxSO and DecimalMaxFixed implementations. See the part 2 blog entry for details. Note that some tests are changed.
Decimal
The attoparsec library provides a decimal function that is implemented as follows:
decimal :: Integral a => Parser a
decimal = T.foldl' step 0 `fmap` takeWhile1 isDecimal
  where step a c = a * 10 + fromIntegral (ord c - 48)If this function is used to parse a bounded integer from a number
that is larger than the maximum bound, the accumulator overflows
silently and there is no failure. For example, consider the Word8
8-bit unsigned integer type, which can represent integers between 0 and
255. The following REPL session demonstrates the overflow that occurs
when parsing 1001.
λ: import Data.Word
λ: (minBound, maxBound) :: (Word8, Word8)
(0,255)
it :: (Word8, Word8)
λ: 100 * 10 + 1 :: Word8
233
it :: Word8The following function, which parses just a Word8 using
the decimal function, can be used for testing.
parseWord8 :: Text -> Either String Word8
parseWord8 = AT.parseOnly (AT.decimal <* AT.endOfInput)The test cases are as follows:
tests :: TestTree
tests = testGroup "Decimal"
    [ testCase "empty" $ Left "not enough input" @=? parseWord8 ""
    , testCase "zero" $ Right 0 @=? parseWord8 "0"
    , testCase "max" $ Right 255 @=? parseWord8 "255"
    , testCase "large1" $ Left "Failed reading: overflow" @=? parseWord8 "256"
    , testCase "large2" $
        Left "Failed reading: overflow" @=? parseWord8 "1001"
    , testCase "negative" $
        Left "Failed reading: takeWhile1" @=? parseWord8 "-1"
    , testCase "imaginary" $ Left "endOfInput" @=? parseWord8 "5i"
    ]Running the tests results in the following output:
Decimal
  empty:         OK
  zero:          OK
  max:           OK
  large1:        FAIL
    test/Decimal/Test.hs:32:
    expected: Left "Failed reading: overflow"
     but got: Right 0
    Use -p '/Decimal.large1/' to rerun this test only.
  large2:        FAIL
    test/Decimal/Test.hs:34:
    expected: Left "Failed reading: overflow"
     but got: Right 233
    Use -p '/Decimal.large2/' to rerun this test only.
  negative:      OK
  imaginary:     OKDecimalBounded
This issue can be avoided by first parsing to Integer
and then checking that the value is not too large before
converting to the bounded type. This works because the
Integer type is unbounded. The following function
implements this strategy. Note that it requires the ScopedTypeVariables
extension in order to use the maxBound for the target
type.
boundedDecimal :: forall a. (Bounded a, Integral a) => AT.Parser a
boundedDecimal = do
    i <- AT.decimal
    if (i :: Integer) > fromIntegral (maxBound :: a)
      then fail "overflow"
      else return $ fromIntegral iThe parseWord8 function is changed to use the above
function.
parseWord8 :: Text -> Either String Word8
parseWord8 = AT.parseOnly (boundedDecimal <* AT.endOfInput)Run against the same tests, all tests pass.
DecimalBounded
  empty:         OK
  zero:          OK
  max:           OK
  large1:        OK
  large2:        OK
  negative:      OK
  imaginary:     OKDecimalBoundedStrict
I need the parser to support two different modes. In strict mode, the string that is parsed must be equal to the string that is rendered from the parsed value, forming an isomorphism. In particular, leading zeros are not allowed in strict mode. In lenient mode, leading zeros are allowed.
The following function implements a strict parser that does not accept leading zeros. When the first character is a zero character, it looks for additional digits and throws a “leading zero” error if there are any. This results in more helpful feedback than the error resulting from stopping parsing after the leading zero, returning zero, and then failing with unexpected digits.
strictDecimal :: forall a. Integral a => AT.Parser a
strictDecimal = (<?> "strictDecimal") $ do
    d <- AT.digit
    ds <- AT.takeWhile isDigit
    if d == '0' && not (T.null ds)
      then fail "leading zero"
      else return $ T.foldl' (\a c -> a * 10 + fromChar c) (fromChar d) ds
  where
    fromChar :: Char -> a
    fromChar c = fromIntegral $ ord c - 48
    {-# INLINE fromChar #-}The boundedDecimal function is changed to use the above
function if the isStrict flag is set. The
decimal function is used otherwise.
boundedDecimal :: forall a. (Bounded a, Integral a) => Bool -> AT.Parser a
boundedDecimal isStrict = (<?> "boundedDecimal") $ do
    i <- if isStrict then strictDecimal else AT.decimal <?> "decimal"
    if (i :: Integer) > fromIntegral (maxBound :: a)
      then fail "overflow"
      else return $ fromIntegral iThe parseWord8 function is changed to pass on the
isStrict flag.
parseWord8 :: Bool -> Text -> Either String Word8
parseWord8 isStrict = AT.parseOnly (boundedDecimal isStrict <* AT.endOfInput)The strict mode test cases are as follows:
strictTests :: TestTree
strictTests = testGroup "Strict"
    [ testCase "empty" $
        Left "boundedDecimal > strictDecimal > digit: not enough input"
          @=? parseWord8 True ""
    , testCase "zero" $ Right 0 @=? parseWord8 True "0"
    , testCase "max" $ Right 255 @=? parseWord8 True "255"
    , testCase "large1" $
        Left "boundedDecimal: Failed reading: overflow"
          @=? parseWord8 True "256"
    , testCase "large2" $
        Left "boundedDecimal: Failed reading: overflow"
          @=? parseWord8 True "1001"
    , testCase "negative" $
        Left "boundedDecimal > strictDecimal > digit: Failed reading: satisfy"
          @=? parseWord8 True "-1"
    , testCase "imaginary" $ Left "endOfInput" @=? parseWord8 True "5i"
    , testCase "leadingZero" $
        Left "boundedDecimal > strictDecimal: Failed reading: leading zero"
          @=? parseWord8 True "042"
    ]The lenient mode test cases are as follows:
lenientTests :: TestTree
lenientTests = testGroup "Lenient"
    [ testCase "empty" $
        Left "boundedDecimal > decimal: not enough input"
          @=? parseWord8 False ""
    , testCase "zero" $ Right 0 @=? parseWord8 False "0"
    , testCase "max" $ Right 255 @=? parseWord8 False "255"
    , testCase "large1" $
        Left "boundedDecimal: Failed reading: overflow"
          @=? parseWord8 False "256"
    , testCase "large2" $
        Left "boundedDecimal: Failed reading: overflow"
          @=? parseWord8 False "1001"
    , testCase "negative" $
        Left "boundedDecimal > decimal: Failed reading: takeWhile1"
          @=? parseWord8 False "-1"
    , testCase "imaginary" $ Left "endOfInput" @=? parseWord8 False "5i"
    , testCase "leadingZero" $ Right 42 @=? parseWord8 False "042"
    ]All of the tests pass:
DecimalBoundedStrict
  Strict
    empty:       OK
    zero:        OK
    max:         OK
    large1:      OK
    large2:      OK
    negative:    OK
    imaginary:   OK
    leadingZero: OK
  Lenient
    empty:       OK
    zero:        OK
    max:         OK
    large1:      OK
    large2:      OK
    negative:    OK
    imaginary:   OK
    leadingZero: OKNote that attoparsec error messages are intended for developers. If you need better error messages, consider using the megaparsec library instead.
DecimalMaxSO
The boundedDecimal function may do too much work. For
example, the Integer value for 100 digits is computed while
it is clear after three digits that there is a Word8
overflow. It is possible to stop parsing by using the scan
function. I found an existing
solution on Stack Overflow.
decimalMax :: Integral a => Integer -> Parser a
decimalMax dMax = do
  let numDigs = ceiling $ log (fromIntegral(dMax+1)) / log 10
      getVal = foldl' (\s d -> s*10+fromIntegral (d-48)) 0 . B.unpack
  val <- getVal <$> scan 0 (\n c ->
          if n > numDigs || not (isDigit c) then Nothing else Just (n+1))
  if val <= dMax
    then return $ fromIntegral val
    else fail $ "decimalMax: parsed decimal exceeded" ++ show dMaxAs is common with code found online, this implementation has some issues. Some ambiguous types lead to poor defaults, so I added some type annotations in my test code to get rid of the compilation warnings. More significantly, the parser can only fail when the parsed value exceeds the specified maximum. If a non-digit is seen, or there is no input at all, the parser returns zero without failure. This can lead to confusing error messages!
The parseWord8 function is changed to call this function
with the appropriate maximum value.
parseWord8 :: ByteString -> Either String Word8
parseWord8 = ABS8.parseOnly $
    decimalMax (fromIntegral @Word8 maxBound) <* ABS8.endOfInputThe test cases are the same as above, though some error messages are updated. Running the tests results in the following output:
DecimalMaxSO
  empty:         FAIL
    test/DecimalMaxSO/Test.hs:35:
    expected: Left "Failed reading: expected decimal"
     but got: Right 0
    Use -p '/DecimalMaxSO.empty/' to rerun this test only.
  zero:          OK
  max:           OK
  large1:        OK
  large1:        OK
  negative:      FAIL
    test/DecimalMaxSO/Test.hs:45:
    expected: Left "Failed reading: expected decimal"
     but got: Left "endOfInput"
    Use -p '/DecimalMaxSO.negative/' to rerun this test only.
  imaginary:     OKThe empty test shows that the empty string is indeed
parsed as a zero. The negative test parses string
-1. The decimalMax function sees the
- character and successfully parses a zero without
consuming any characters. The endOfInput function then
fails because input -1 remains.
DecimalMaxFixed
It is easy to fix the decimalMax function.
decimalMax :: Integral a => Integer -> ABS8.Parser a
decimalMax valueMax = (<?> "decimalMax") $ do
    bs <- ABS8.scan 0 $ \numDigits c ->
      if numDigits <= numDigitsMax && isDigit c
        then Just $ numDigits + 1
        else Nothing
    when (BS.null bs) $ fail "expected decimal"
    let value = BS.foldl' (\acc d -> acc * 10 + fromIntegral (d - 48)) 0 bs
    if value > valueMax
      then fail $ "decimal exceeds " ++ show valueMax
      else return $ fromIntegral value
  where
    numDigitsMax :: Int
    numDigitsMax = ceiling @Float $ logBase 10 (fromIntegral valueMax + 1)Aside from adding type annotations and using different variable names, the changes are as follows:
- The ByteStringreturned by the scan is captured, allowing an error to be thrown when it is empty.
- The (<?>)operator is used to add context to all error messages instead of writing the context in each error message separately.
- Using BS.foldl'is preferred over usingfoldl'andBS.unpack.
- Using logBaseis preferred over usinglog a / log b.
- The condition in the scan is inverted so that the notcall is not needed.
- In calculating numDigitsMax,valueMaxis converted to aFloatbefore incrementing it, making the increment a native floating point operation instead of an arbitrary precision operation.
The test cases are the same as above, though some error messages are updated. All of the tests pass:
DecimalMaxFixed
  empty:         OK
  zero:          OK
  max:           OK
  large1:        OK
  large1:        OK
  negative:      OK
  imaginary:     OKDecimalMaxStrict
The decimalMax parser now works, but I need support for
strict and lenient parsing. I also need to parse Unicode text, not ASCII
bytes. I wrote my own implementation to address these issues, and it
takes a different approach.
decimalMax :: (Integral a, Integral b) => Bool -> a -> AT.Parser b
decimalMax isStrict valueMax = (<?> "decimalMax") $ do
    (t, mN) <- AT.runScanner (Just 0) $ \mN c -> case mN of
      Just n
        | isDigit c -> Just $
            let n' = n * 10 + fromIntegral (ord c - 48)
            in  if n' > valueMax then Nothing else Just n'
        | otherwise -> Nothing
      Nothing -> Nothing
    case (mN, T.uncons t) of
      (Just n, Just (d, ds))
        | isStrict && d == '0' && not (T.null ds) -> fail "leading zero"
        | otherwise -> return $ fromIntegral n
      (Just{}, Nothing) -> fail "expected decimal"
      (Nothing, _digits) -> fail "overflow"Instead of counting digits, this version calculates the value as the
digits are scanned. The runScanner
function is used, so the state is returned as well as the parsed string.
The state needs to represent the parsed value as well as an overflow, so
a Maybe type is used where Nothing indicates
an overflow.
Previous versions of decimalMax used the unbounded
Integer type to accumulate the value during parsing. This
implementation allows the developer to choose which type to use, as the
type just needs to be sufficiently larger than the target type. The
word8 parser uses an Word. A
Word16 would suffice, but the native word size is preferred
for good performance.
word8 :: Bool -> AT.Parser Word8
word8 isStrict = decimalMax isStrict $ fromIntegral @Word8 @Word maxBoundThe parseWord8 function is changed to call this
function, passing on the isStrict flag.
parseWord8 :: Bool -> Text -> Either String Word8
parseWord8 isStrict = AT.parseOnly (word8 isStrict <* AT.endOfInput)The tests are the same as those for DecimalBoundedStrict, with slightly different error messages. All of the tests pass:
DecimalMaxStrict
  Strict
    empty:       OK
    zero:        OK
    max:         OK
    large1:      OK
    large2:      OK
    negative:    OK
    imaginary:   OK
    leadingZero: OK
  Lenient
    empty:       OK
    zero:        OK
    max:         OK
    large1:      OK
    large2:      OK
    negative:    OK
    imaginary:   OK
    leadingZero: OKBenchmarks
The DecimalBoundedStrict and DecimalMaxStrict versions are both sufficient for my needs. How do they compare when benchmarked? I prepared some benchmarks, and DecimalMaxStrict consistently performs better than DecimalBoundedStrict on my computer.