# Parsing Bounded Integers

I really like the numeric types that Haskell provides, but one must always keep in mind that the bounded integral types overflow silently, which can lead to bugs. I am using the attoparsec library to do some parsing and need to parse a bounded integral value such that overflow is considered an error. This blog entry discusses various implementations, with tests to demonstrate the issue.

The source code is available on GitHub.

Update: I realized that I missed a significant issue with the DecimalMaxSO and DecimalMaxFixed implementations. See the part 2 blog entry for details. Note that some tests are changed.

## Decimal

The attoparsec library provides a decimal function that is implemented as follows:

```
decimal :: Integral a => Parser a
= T.foldl' step 0 `fmap` takeWhile1 isDecimal
decimal where step a c = a * 10 + fromIntegral (ord c - 48)
```

If this function is used to parse a bounded integer from a number
that is larger than the maximum bound, the accumulator overflows
silently and there is no failure. For example, consider the Word8
8-bit unsigned integer type, which can represent integers between 0 and
255. The following REPL session demonstrates the overflow that occurs
when parsing `1001`

.

```
λ: import Data.Word
λ: (minBound, maxBound) :: (Word8, Word8)
(0,255)
it :: (Word8, Word8)
λ: 100 * 10 + 1 :: Word8
233
it :: Word8
```

The following function, which parses just a `Word8`

using
the `decimal`

function, can be used for testing.

```
parseWord8 :: Text -> Either String Word8
= AT.parseOnly (AT.decimal <* AT.endOfInput) parseWord8
```

The test cases are as follows:

```
tests :: TestTree
= testGroup "Decimal"
tests "empty" $ Left "not enough input" @=? parseWord8 ""
[ testCase "zero" $ Right 0 @=? parseWord8 "0"
, testCase "max" $ Right 255 @=? parseWord8 "255"
, testCase "large1" $ Left "Failed reading: overflow" @=? parseWord8 "256"
, testCase "large2" $
, testCase Left "Failed reading: overflow" @=? parseWord8 "1001"
"negative" $
, testCase Left "Failed reading: takeWhile1" @=? parseWord8 "-1"
"imaginary" $ Left "endOfInput" @=? parseWord8 "5i"
, testCase ]
```

Running the tests results in the following output:

```
Decimal
empty: OK
zero: OK
max: OK
large1: FAIL
test/Decimal/Test.hs:32:
expected: Left "Failed reading: overflow"
but got: Right 0
Use -p '/Decimal.large1/' to rerun this test only.
large2: FAIL
test/Decimal/Test.hs:34:
expected: Left "Failed reading: overflow"
but got: Right 233
Use -p '/Decimal.large2/' to rerun this test only.
negative: OK
imaginary: OK
```

## DecimalBounded

This issue can be avoided by first parsing to `Integer`

and then checking that the value is not too large *before*
converting to the bounded type. This works because the
`Integer`

type is unbounded. The following function
implements this strategy. Note that it requires the ScopedTypeVariables
extension in order to use the `maxBound`

for the target
type.

```
boundedDecimal :: forall a. (Bounded a, Integral a) => AT.Parser a
= do
boundedDecimal <- AT.decimal
i if (i :: Integer) > fromIntegral (maxBound :: a)
then fail "overflow"
else return $ fromIntegral i
```

The `parseWord8`

function is changed to use the above
function.

```
parseWord8 :: Text -> Either String Word8
= AT.parseOnly (boundedDecimal <* AT.endOfInput) parseWord8
```

Run against the same tests, all tests pass.

```
DecimalBounded
empty: OK
zero: OK
max: OK
large1: OK
large2: OK
negative: OK
imaginary: OK
```

## DecimalBoundedStrict

I need the parser to support two different modes. In *strict*
mode, the string that is parsed must be equal to the string that is
rendered from the parsed value, forming an isomorphism. In particular,
leading zeros are not allowed in strict mode. In *lenient* mode,
leading zeros are allowed.

The following function implements a strict parser that does not accept leading zeros. When the first character is a zero character, it looks for additional digits and throws a “leading zero” error if there are any. This results in more helpful feedback than the error resulting from stopping parsing after the leading zero, returning zero, and then failing with unexpected digits.

```
strictDecimal :: forall a. Integral a => AT.Parser a
= (<?> "strictDecimal") $ do
strictDecimal <- AT.digit
d <- AT.takeWhile isDigit
ds if d == '0' && not (T.null ds)
then fail "leading zero"
else return $ T.foldl' (\a c -> a * 10 + fromChar c) (fromChar d) ds
where
fromChar :: Char -> a
= fromIntegral $ ord c - 48
fromChar c {-# INLINE fromChar #-}
```

The `boundedDecimal`

function is changed to use the above
function if the `isStrict`

flag is set. The
`decimal`

function is used otherwise.

```
boundedDecimal :: forall a. (Bounded a, Integral a) => Bool -> AT.Parser a
= (<?> "boundedDecimal") $ do
boundedDecimal isStrict <- if isStrict then strictDecimal else AT.decimal <?> "decimal"
i if (i :: Integer) > fromIntegral (maxBound :: a)
then fail "overflow"
else return $ fromIntegral i
```

The `parseWord8`

function is changed to pass on the
`isStrict`

flag.

```
parseWord8 :: Bool -> Text -> Either String Word8
= AT.parseOnly (boundedDecimal isStrict <* AT.endOfInput) parseWord8 isStrict
```

The strict mode test cases are as follows:

```
strictTests :: TestTree
= testGroup "Strict"
strictTests "empty" $
[ testCase Left "boundedDecimal > strictDecimal > digit: not enough input"
@=? parseWord8 True ""
"zero" $ Right 0 @=? parseWord8 True "0"
, testCase "max" $ Right 255 @=? parseWord8 True "255"
, testCase "large1" $
, testCase Left "boundedDecimal: Failed reading: overflow"
@=? parseWord8 True "256"
"large2" $
, testCase Left "boundedDecimal: Failed reading: overflow"
@=? parseWord8 True "1001"
"negative" $
, testCase Left "boundedDecimal > strictDecimal > digit: Failed reading: satisfy"
@=? parseWord8 True "-1"
"imaginary" $ Left "endOfInput" @=? parseWord8 True "5i"
, testCase "leadingZero" $
, testCase Left "boundedDecimal > strictDecimal: Failed reading: leading zero"
@=? parseWord8 True "042"
]
```

The lenient mode test cases are as follows:

```
lenientTests :: TestTree
= testGroup "Lenient"
lenientTests "empty" $
[ testCase Left "boundedDecimal > decimal: not enough input"
@=? parseWord8 False ""
"zero" $ Right 0 @=? parseWord8 False "0"
, testCase "max" $ Right 255 @=? parseWord8 False "255"
, testCase "large1" $
, testCase Left "boundedDecimal: Failed reading: overflow"
@=? parseWord8 False "256"
"large2" $
, testCase Left "boundedDecimal: Failed reading: overflow"
@=? parseWord8 False "1001"
"negative" $
, testCase Left "boundedDecimal > decimal: Failed reading: takeWhile1"
@=? parseWord8 False "-1"
"imaginary" $ Left "endOfInput" @=? parseWord8 False "5i"
, testCase "leadingZero" $ Right 42 @=? parseWord8 False "042"
, testCase ]
```

All of the tests pass:

```
DecimalBoundedStrict
Strict
empty: OK
zero: OK
max: OK
large1: OK
large2: OK
negative: OK
imaginary: OK
leadingZero: OK
Lenient
empty: OK
zero: OK
max: OK
large1: OK
large2: OK
negative: OK
imaginary: OK
leadingZero: OK
```

Note that attoparsec error messages are intended for developers. If you need better error messages, consider using the megaparsec library instead.

## DecimalMaxSO

The `boundedDecimal`

function may do too much work. For
example, the `Integer`

value for 100 digits is computed while
it is clear after three digits that there is a `Word8`

overflow. It is possible to stop parsing by using the scan
function. I found an existing
solution on Stack Overflow.

```
decimalMax :: Integral a => Integer -> Parser a
= do
decimalMax dMax let numDigs = ceiling $ log (fromIntegral(dMax+1)) / log 10
= foldl' (\s d -> s*10+fromIntegral (d-48)) 0 . B.unpack
getVal <- getVal <$> scan 0 (\n c ->
val if n > numDigs || not (isDigit c) then Nothing else Just (n+1))
if val <= dMax
then return $ fromIntegral val
else fail $ "decimalMax: parsed decimal exceeded" ++ show dMax
```

As is common with code found online, this implementation has some issues. Some ambiguous types lead to poor defaults, so I added some type annotations in my test code to get rid of the compilation warnings. More significantly, the parser can only fail when the parsed value exceeds the specified maximum. If a non-digit is seen, or there is no input at all, the parser returns zero without failure. This can lead to confusing error messages!

The `parseWord8`

function is changed to call this function
with the appropriate maximum value.

```
parseWord8 :: ByteString -> Either String Word8
= ABS8.parseOnly $
parseWord8 fromIntegral @Word8 maxBound) <* ABS8.endOfInput decimalMax (
```

The test cases are the same as above, though some error messages are updated. Running the tests results in the following output:

```
DecimalMaxSO
empty: FAIL
test/DecimalMaxSO/Test.hs:35:
expected: Left "Failed reading: expected decimal"
but got: Right 0
Use -p '/DecimalMaxSO.empty/' to rerun this test only.
zero: OK
max: OK
large1: OK
large1: OK
negative: FAIL
test/DecimalMaxSO/Test.hs:45:
expected: Left "Failed reading: expected decimal"
but got: Left "endOfInput"
Use -p '/DecimalMaxSO.negative/' to rerun this test only.
imaginary: OK
```

The `empty`

test shows that the empty string is indeed
parsed as a zero. The `negative`

test parses string
`-1`

. The `decimalMax`

function sees the
`-`

character and successfully parses a zero *without*
consuming any characters. The `endOfInput`

function then
fails because input `-1`

remains.

## DecimalMaxFixed

It is easy to fix the `decimalMax`

function.

```
decimalMax :: Integral a => Integer -> ABS8.Parser a
= (<?> "decimalMax") $ do
decimalMax valueMax <- ABS8.scan 0 $ \numDigits c ->
bs if numDigits <= numDigitsMax && isDigit c
then Just $ numDigits + 1
else Nothing
$ fail "expected decimal"
when (BS.null bs) let value = BS.foldl' (\acc d -> acc * 10 + fromIntegral (d - 48)) 0 bs
if value > valueMax
then fail $ "decimal exceeds " ++ show valueMax
else return $ fromIntegral value
where
numDigitsMax :: Int
= ceiling @Float $ logBase 10 (fromIntegral valueMax + 1) numDigitsMax
```

Aside from adding type annotations and using different variable names, the changes are as follows:

- The
`ByteString`

returned by the scan is captured, allowing an error to be thrown when it is empty. - The
`(<?>)`

operator is used to add context to all error messages instead of writing the context in each error message separately. - Using
`BS.foldl'`

is preferred over using`foldl'`

and`BS.unpack`

. - Using
`logBase`

is preferred over using`log a / log b`

. - The condition in the scan is inverted so that the
`not`

call is not needed. - In calculating
`numDigitsMax`

,`valueMax`

is converted to a`Float`

before incrementing it, making the increment a native floating point operation instead of an arbitrary precision operation.

The test cases are the same as above, though some error messages are updated. All of the tests pass:

```
DecimalMaxFixed
empty: OK
zero: OK
max: OK
large1: OK
large1: OK
negative: OK
imaginary: OK
```

## DecimalMaxStrict

The `decimalMax`

parser now works, but I need support for
strict and lenient parsing. I also need to parse Unicode text, not ASCII
bytes. I wrote my own implementation to address these issues, and it
takes a different approach.

```
decimalMax :: (Integral a, Integral b) => Bool -> a -> AT.Parser b
= (<?> "decimalMax") $ do
decimalMax isStrict valueMax <- AT.runScanner (Just 0) $ \mN c -> case mN of
(t, mN) Just n
| isDigit c -> Just $
let n' = n * 10 + fromIntegral (ord c - 48)
in if n' > valueMax then Nothing else Just n'
| otherwise -> Nothing
Nothing -> Nothing
case (mN, T.uncons t) of
Just n, Just (d, ds))
(| isStrict && d == '0' && not (T.null ds) -> fail "leading zero"
| otherwise -> return $ fromIntegral n
Just{}, Nothing) -> fail "expected decimal"
(Nothing, _digits) -> fail "overflow" (
```

Instead of counting digits, this version calculates the value as the
digits are scanned. The runScanner
function is used, so the state is returned as well as the parsed string.
The state needs to represent the parsed value as well as an overflow, so
a `Maybe`

type is used where `Nothing`

indicates
an overflow.

Previous versions of `decimalMax`

used the unbounded
`Integer`

type to accumulate the value during parsing. This
implementation allows the developer to choose which type to use, as the
type just needs to be sufficiently larger than the target type. The
`word8`

parser uses an `Word`

. A
`Word16`

would suffice, but the native word size is preferred
for good performance.

```
word8 :: Bool -> AT.Parser Word8
= decimalMax isStrict $ fromIntegral @Word8 @Word maxBound word8 isStrict
```

The `parseWord8`

function is changed to call this
function, passing on the `isStrict`

flag.

```
parseWord8 :: Bool -> Text -> Either String Word8
= AT.parseOnly (word8 isStrict <* AT.endOfInput) parseWord8 isStrict
```

The tests are the same as those for DecimalBoundedStrict, with slightly different error messages. All of the tests pass:

```
DecimalMaxStrict
Strict
empty: OK
zero: OK
max: OK
large1: OK
large2: OK
negative: OK
imaginary: OK
leadingZero: OK
Lenient
empty: OK
zero: OK
max: OK
large1: OK
large2: OK
negative: OK
imaginary: OK
leadingZero: OK
```

## Benchmarks

The DecimalBoundedStrict and DecimalMaxStrict versions are both sufficient for my needs. How do they compare when benchmarked? I prepared some benchmarks, and DecimalMaxStrict consistently performs better than DecimalBoundedStrict on my computer.