Aeson Object Design
This weekend, there was a lot of discussion about a denial-of-service vulnerability in aeson, the most popular JSON library for Haskell. Tom Sydney Kerckhove published a blog post titled JSON Vulnerability in Haskell’s Aeson library, and there was discussion on Reddit, Discourse, and a GitHub issue.
Hash Collision Vulnerability
The vulnerability itself is an old and well known issue that every hash table
implementation must consider, in any programming language. On average,
hash tables have constant (O(1)
) time complexity
for insertion, search, and deletion. This is possible because the
O(1)
calculation of the hash of a key is sufficient to
determine the index in the hash table where the entry is stored. When
multiple keys have the same hash (called collision),
however, multiple entries must be stored at the same index. These
entries are usually stored in a linked list, which requires linear
search. If every key in a hash table has the same hash, this results in
the O(n)
worst case time complexity for insertion, search,
and deletion. When an attacker knows the hash algorithm being used by
the data structure used to store JSON objects, they can create JSON that
includes many keys that have the same hash, forcing the worst case time
complexity. By keeping the server busy with hash table operations, the
attacker can cause denial of service.
The aeson
library implements JSON objects using a HashMap
, provided
by the unordered-containers
library, and the hash algorithm is defined in the hashable
library, which contains a warning that it is susceptible to
denial-of-service attacks. In 2018, a team at FP Complete investigated that
warning and was able to create an exploit that demonstrates the threat.
They spent some time communicating with maintainers about how to resolve
the issue and then published the exploit so that the Haskell community
could demand a quick resolution.
Solutions
The blog post suggests some ways to fix the issue:
- Using a better hash algorithm would make it very difficult for an attacker to find keys that result in collisions. This would decrease performance, unfortunately.
- The hash algorithm could use a random salt that is generated when the program starts, to make it more difficult to attack. This is not a proper solution, however, because an attacker may be able to determine the salt and still attack.
- Changing the implementation of
HashMap
to store entries in aHashMap
that uses a different salt gets rid of the linked list that results in the worst case time complexity. This strategy is usually not done because of the added implementation complexity and decreased performance for small hash tables, but the team implemented it and found that it worked well. The maintainers rejected the change because there is no proof that such an implementation does not also have vulnerabilities. How difficult it is to create keys that collide with two different salt values? - Using a different data structure to represent JSON objects would
avoid hashes altogether. They author suggested using the
Map
data structure, provided by the containers library. This would also decrease performance.
The maintainers are currently working on a change in aeson to address
the issue. In the proposed change, an Object
will be
implemented using an opaque data structure named TextMap
,
with a minimal API. This is a breaking change, but it will allow the
maintainers to experiment with different implementations without
additional breaking changes. The TextMap
data structure
will initially be a newtype
around HashMap
or
Map
, and a compile-time flag will allow users to select the
desired implementation: HashMap
for better performance or
Map
for better security.
One limitation of this current fix is that the selection affects the whole program. One may want to prioritize security on a public API while prioritizing performance on an internal or administrative API, but this is not possible when using the compile-time flag.
Entry List
There are a few limitations caused by using HashMap
or
Map
to represent JSON objects.
One limitation is that these data structure cannot handle objects with duplicate keys. JSON is based on JavaScript and is unfortunately quite poorly defined. It does not disallow duplicate keys, but they are very rarely used in practice and generally indicate poor design. This limitation is therefore generally ignored.
Another limitation is that these data structures do not allow the developer to specify the order of entries. JSON is generally used as a format that can be easily parsed by machines, and the order of object entries generally does not matter. YAML is a format for humans, however, and sometimes the order of entries is important!
Currently, I do not know of a Haskell library that supports writing YAML with fixed-order object entries. When that is a requirement, options are to write a new library, mangle the YAML output of an existing library to fix the order, or to use a different tool. Unfortunately, this issue sometimes results in having to use a different language altogether.
Both of these limitations would be resolved by representing JSON
objects as a list of entries (implemented using Vector
for
efficiency):
type ObjectEntries = Vector (Text, Value)
When creating a Value
to be encoded as JSON/YAML, the
developer specifies the order of entries, and even duplicate entries are
possible. When decoding JSON/YAML, the Object
represents
the source order, and duplicate entries are not lost. It is then up to
the developer to choose how to use this data according to the
requirements of the application. The data can be read into a
HashMap
for optimal search performance, it can be read into
a Map
for better security, etc.
The challenging part of this design is type conversion. The aeson library
provides a FromJSON
type class, and instances convert from
a JSON representation to a specific data type. This conversion usually
involves search of object keys, as in the following example from the
documentation:
instance FromJSON Coord where
= withObject "Coord" $ \v -> Coord
parseJSON <$> v .: "x"
<*> v .: "y"
Using ObjectEntries
directly would result in linear time
complexity. When writing manual instances, it would be possible to
specify the data structure to use. As an example design, a type class
could be used so that operators work with any data type (with an
instance).
class Object o where
lookup :: Text -> o -> Maybe Value
(.:) :: (FromJSON a, Object o) => o -> Text -> Parser a
For example, the instances for ObjectEntries
(O(n)
) and HashMap
(O(1)
best
case and O(n)
worst case) could be implemented as
follows:
instance Object ObjectEntries where
lookup k = fmap snd . Vector.find ((== k) . fst)
instance Object (HashMap Text Value) where
lookup = HashMap.lookup
The withObject
function could be renamed to
withObjectEntries
to indicate that it works with a list
(Vector
) by default:
withObjectEntries :: String
-> (ObjectEntries -> Parser a)
-> Value
-> Parser a
If desired, helper functions could be use to perform conversions. For
example, an asHashMap
function could load the
ObjectEntries
into a HashMap
as follows:
asHashMap :: (HashMap Text Value -> Parser a)
-> ObjectEntries
-> Parser a
= f . HashMap.fromList . Vector.toList asHashMap f
The example instance definition could continue to use a
HashMap
for lookup when implemented as follows:
instance FromJSON Coord where
= withObjectEntries "Coord" . asHashMap $ \v -> Coord
parseJSON <$> v .: "x"
<*> v .: "y"
By exposing the Object
type class, users would even be
able to implement support for data types that are not provided in the
library.
What about Generic
instances? Perhaps the library could
provide a suitable default (prioritize security?) and expose a
compile-time flag that allows users to select an alternative, much like
the current fix. The flag would select which conversion function to use
(asHashMap
to use HashMap
, etc.). Use of the
compile-time flag results in the same limitation as the current fix, but
at least in this case it would be possible to write some instances
manually to select a different data structure.
Ordered Map
While it would not support duplicate keys, use of a map that preserves the insertion order would resolve the order issue. This is what Python does. In Python, the entries are stored in insertion order, and the sparse table of hash table indexes is separate. (This design was suggested in 2012, released in Python 3.6 and made an official part of the language from Python 3.7.) In Python, this change actually resulted in both smaller memory usage and better iteration performance!
I do not know of a Haskell library that provides such a data
structure. Specifically, I want a data structure that provides the same
API and performance of HashMap
where the data structure is
traversed in order of insertion, with toList
returning a
list of entries in order of insertion. I would like to develop such a
library myself if I can find the time to do so.
Ordered Map Wrapper
Another idea is to provide an OrderedMap
data type that
wraps an existing map data structure such as HashMap
or
Map
and uses a DList
(from the dlist package) or
similar data structure to keep track of the insertion order. While this
would likely not perform as well as a standalone data type, it would
allow the user to choose which type of map to use. The performance
should not be too bad for the target use case, where entries are
inserted and then iterated (without updates or deletions).
Yesterday, I wrote a quick prototype to explore the design. In the prototype, I use the following definition:
data OrderedMap t k v
= OrderedMap
omMap :: !(t k v)
{ omList :: !(DList (k ,v))
, }
As an example of the API, consider the insert
function.
The OrderedMap
implementation uses a function that checks
for membership and does the insertion for the wrapped map type. In the
target use case, insertions are for new keys, in which case
snoc
can be used on the DList
(O(1)
). In cases where the insertion overwrites a value,
the DList
has to be modified, requiring a linear
(O(n)
) traversal.
insert :: Eq k
=> (k -> v -> t k v -> (Bool, t k v))
-> k
-> v
-> OrderedMap t k v
-> OrderedMap t k v
=
insert mapMemberInsert key value omap let (isMember, newMap) = mapMemberInsert key value $ omMap omap
g| isMember = dlistAdjust (const value) key
| otherwise = flip DList.snoc (key, value)
in OrderedMap
= newMap
{ omMap = g $ omList omap
, omList
}
-- helper function
dlistAdjust :: Eq k
=> (v -> v)
-> k
-> DList (k, v)
-> DList (k, v)
= DList.map $ \(k, v) -> (k, if k == key then f v else v) dlistAdjust f key
Separate modules provide implementations for specific map types. For
example, a type synonym for an OrderedMap
that wraps the
Map
type is as follows:
type OrderedMap = OM.OrderedMap Map
The insert
function uses the
insertLookupWithKey
function to perform the membership test
and insertion in one O(log n)
traversal of the data
structure:
insert :: Ord k
=> k
-> v
-> OrderedMap k v
-> OrderedMap k v
= OM.insert memberInsert
insert where
memberInsert :: Ord k => k -> v -> Map k v -> (Bool, Map k v)
memberInsert key value= first isJust
. Map.insertLookupWithKey useNewValue key value
useNewValue :: (k -> v -> v -> v)
= newValue useNewValue _key newValue _oldValue
I would like to implement some more modules and run some benchmarks to see how bad the performance is.