Aeson Object Design
This weekend, there was a lot of discussion about a denial-of-service vulnerability in aeson, the most popular JSON library for Haskell. Tom Sydney Kerckhove published a blog post titled JSON Vulnerability in Haskell’s Aeson library, and there was discussion on Reddit, Discourse, and a GitHub issue.
Hash Collision Vulnerability
The vulnerability itself is an old and well known issue that every hash table
implementation must consider, in any programming language. On average,
hash tables have constant (O(1)) time complexity
for insertion, search, and deletion. This is possible because the
O(1) calculation of the hash of a key is sufficient to
determine the index in the hash table where the entry is stored. When
multiple keys have the same hash (called collision),
however, multiple entries must be stored at the same index. These
entries are usually stored in a linked list, which requires linear
search. If every key in a hash table has the same hash, this results in
the O(n) worst case time complexity for insertion, search,
and deletion. When an attacker knows the hash algorithm being used by
the data structure used to store JSON objects, they can create JSON that
includes many keys that have the same hash, forcing the worst case time
complexity. By keeping the server busy with hash table operations, the
attacker can cause denial of service.
The aeson
library implements JSON objects using a HashMap, provided
by the unordered-containers
library, and the hash algorithm is defined in the hashable
library, which contains a warning that it is susceptible to
denial-of-service attacks. In 2018, a team at FP Complete investigated that
warning and was able to create an exploit that demonstrates the threat.
They spent some time communicating with maintainers about how to resolve
the issue and then published the exploit so that the Haskell community
could demand a quick resolution.
Solutions
The blog post suggests some ways to fix the issue:
- Using a better hash algorithm would make it very difficult for an attacker to find keys that result in collisions. This would decrease performance, unfortunately.
- The hash algorithm could use a random salt that is generated when the program starts, to make it more difficult to attack. This is not a proper solution, however, because an attacker may be able to determine the salt and still attack.
- Changing the implementation of
HashMapto store entries in aHashMapthat uses a different salt gets rid of the linked list that results in the worst case time complexity. This strategy is usually not done because of the added implementation complexity and decreased performance for small hash tables, but the team implemented it and found that it worked well. The maintainers rejected the change because there is no proof that such an implementation does not also have vulnerabilities. How difficult it is to create keys that collide with two different salt values? - Using a different data structure to represent JSON objects would
avoid hashes altogether. They author suggested using the
Mapdata structure, provided by the containers library. This would also decrease performance.
The maintainers are currently working on a change in aeson to address
the issue. In the proposed change, an Object will be
implemented using an opaque data structure named TextMap,
with a minimal API. This is a breaking change, but it will allow the
maintainers to experiment with different implementations without
additional breaking changes. The TextMap data structure
will initially be a newtype around HashMap or
Map, and a compile-time flag will allow users to select the
desired implementation: HashMap for better performance or
Map for better security.
One limitation of this current fix is that the selection affects the whole program. One may want to prioritize security on a public API while prioritizing performance on an internal or administrative API, but this is not possible when using the compile-time flag.
Entry List
There are a few limitations caused by using HashMap or
Map to represent JSON objects.
One limitation is that these data structure cannot handle objects with duplicate keys. JSON is based on JavaScript and is unfortunately quite poorly defined. It does not disallow duplicate keys, but they are very rarely used in practice and generally indicate poor design. This limitation is therefore generally ignored.
Another limitation is that these data structures do not allow the developer to specify the order of entries. JSON is generally used as a format that can be easily parsed by machines, and the order of object entries generally does not matter. YAML is a format for humans, however, and sometimes the order of entries is important!
Currently, I do not know of a Haskell library that supports writing YAML with fixed-order object entries. When that is a requirement, options are to write a new library, mangle the YAML output of an existing library to fix the order, or to use a different tool. Unfortunately, this issue sometimes results in having to use a different language altogether.
Both of these limitations would be resolved by representing JSON
objects as a list of entries (implemented using Vector for
efficiency):
type ObjectEntries = Vector (Text, Value)When creating a Value to be encoded as JSON/YAML, the
developer specifies the order of entries, and even duplicate entries are
possible. When decoding JSON/YAML, the Object represents
the source order, and duplicate entries are not lost. It is then up to
the developer to choose how to use this data according to the
requirements of the application. The data can be read into a
HashMap for optimal search performance, it can be read into
a Map for better security, etc.
The challenging part of this design is type conversion. The aeson library
provides a FromJSON type class, and instances convert from
a JSON representation to a specific data type. This conversion usually
involves search of object keys, as in the following example from the
documentation:
instance FromJSON Coord where
parseJSON = withObject "Coord" $ \v -> Coord
<$> v .: "x"
<*> v .: "y"Using ObjectEntries directly would result in linear time
complexity. When writing manual instances, it would be possible to
specify the data structure to use. As an example design, a type class
could be used so that operators work with any data type (with an
instance).
class Object o where
lookup :: Text -> o -> Maybe Value
(.:) :: (FromJSON a, Object o) => o -> Text -> Parser aFor example, the instances for ObjectEntries
(O(n)) and HashMap (O(1) best
case and O(n) worst case) could be implemented as
follows:
instance Object ObjectEntries where
lookup k = fmap snd . Vector.find ((== k) . fst)
instance Object (HashMap Text Value) where
lookup = HashMap.lookupThe withObject function could be renamed to
withObjectEntries to indicate that it works with a list
(Vector) by default:
withObjectEntries
:: String
-> (ObjectEntries -> Parser a)
-> Value
-> Parser aIf desired, helper functions could be use to perform conversions. For
example, an asHashMap function could load the
ObjectEntries into a HashMap as follows:
asHashMap
:: (HashMap Text Value -> Parser a)
-> ObjectEntries
-> Parser a
asHashMap f = f . HashMap.fromList . Vector.toListThe example instance definition could continue to use a
HashMap for lookup when implemented as follows:
instance FromJSON Coord where
parseJSON = withObjectEntries "Coord" . asHashMap $ \v -> Coord
<$> v .: "x"
<*> v .: "y"By exposing the Object type class, users would even be
able to implement support for data types that are not provided in the
library.
What about Generic instances? Perhaps the library could
provide a suitable default (prioritize security?) and expose a
compile-time flag that allows users to select an alternative, much like
the current fix. The flag would select which conversion function to use
(asHashMap to use HashMap, etc.). Use of the
compile-time flag results in the same limitation as the current fix, but
at least in this case it would be possible to write some instances
manually to select a different data structure.
Ordered Map
While it would not support duplicate keys, use of a map that preserves the insertion order would resolve the order issue. This is what Python does. In Python, the entries are stored in insertion order, and the sparse table of hash table indexes is separate. (This design was suggested in 2012, released in Python 3.6 and made an official part of the language from Python 3.7.) In Python, this change actually resulted in both smaller memory usage and better iteration performance!
I do not know of a Haskell library that provides such a data
structure. Specifically, I want a data structure that provides the same
API and performance of HashMap where the data structure is
traversed in order of insertion, with toList returning a
list of entries in order of insertion. I would like to develop such a
library myself if I can find the time to do so.
Ordered Map Wrapper
Another idea is to provide an OrderedMap data type that
wraps an existing map data structure such as HashMap or
Map and uses a DList (from the dlist package) or
similar data structure to keep track of the insertion order. While this
would likely not perform as well as a standalone data type, it would
allow the user to choose which type of map to use. The performance
should not be too bad for the target use case, where entries are
inserted and then iterated (without updates or deletions).
Yesterday, I wrote a quick prototype to explore the design. In the prototype, I use the following definition:
data OrderedMap t k v
= OrderedMap
{ omMap :: !(t k v)
, omList :: !(DList (k ,v))
}As an example of the API, consider the insert function.
The OrderedMap implementation uses a function that checks
for membership and does the insertion for the wrapped map type. In the
target use case, insertions are for new keys, in which case
snoc can be used on the DList
(O(1)). In cases where the insertion overwrites a value,
the DList has to be modified, requiring a linear
(O(n)) traversal.
insert
:: Eq k
=> (k -> v -> t k v -> (Bool, t k v))
-> k
-> v
-> OrderedMap t k v
-> OrderedMap t k v
insert mapMemberInsert key value omap =
let (isMember, newMap) = mapMemberInsert key value $ omMap omap
g
| isMember = dlistAdjust (const value) key
| otherwise = flip DList.snoc (key, value)
in OrderedMap
{ omMap = newMap
, omList = g $ omList omap
}
-- helper function
dlistAdjust
:: Eq k
=> (v -> v)
-> k
-> DList (k, v)
-> DList (k, v)
dlistAdjust f key = DList.map $ \(k, v) -> (k, if k == key then f v else v)Separate modules provide implementations for specific map types. For
example, a type synonym for an OrderedMap that wraps the
Map type is as follows:
type OrderedMap = OM.OrderedMap MapThe insert function uses the
insertLookupWithKey function to perform the membership test
and insertion in one O(log n) traversal of the data
structure:
insert
:: Ord k
=> k
-> v
-> OrderedMap k v
-> OrderedMap k v
insert = OM.insert memberInsert
where
memberInsert :: Ord k => k -> v -> Map k v -> (Bool, Map k v)
memberInsert key value
= first isJust
. Map.insertLookupWithKey useNewValue key value
useNewValue :: (k -> v -> v -> v)
useNewValue _key newValue _oldValue = newValueI would like to implement some more modules and run some benchmarks to see how bad the performance is.