Skip to main content

Aeson Object Design

This weekend, there was a lot of discussion about a denial-of-service vulnerability in aeson, the most popular JSON library for Haskell. Tom Sydney Kerckhove published a blog post titled JSON Vulnerability in Haskell’s Aeson library, and there was discussion on Reddit, Discourse, and a GitHub issue.

Hash Collision Vulnerability

The vulnerability itself is an old and well known issue that every hash table implementation must consider, in any programming language. On average, hash tables have constant (O(1)) time complexity for insertion, search, and deletion. This is possible because the O(1) calculation of the hash of a key is sufficient to determine the index in the hash table where the entry is stored. When multiple keys have the same hash (called collision), however, multiple entries must be stored at the same index. These entries are usually stored in a linked list, which requires linear search. If every key in a hash table has the same hash, this results in the O(n) worst case time complexity for insertion, search, and deletion. When an attacker knows the hash algorithm being used by the data structure used to store JSON objects, they can create JSON that includes many keys that have the same hash, forcing the worst case time complexity. By keeping the server busy with hash table operations, the attacker can cause denial of service.

The aeson library implements JSON objects using a HashMap, provided by the unordered-containers library, and the hash algorithm is defined in the hashable library, which contains a warning that it is susceptible to denial-of-service attacks. In 2018, a team at FP Complete investigated that warning and was able to create an exploit that demonstrates the threat. They spent some time communicating with maintainers about how to resolve the issue and then published the exploit so that the Haskell community could demand a quick resolution.

Solutions

The blog post suggests some ways to fix the issue:

  • Using a better hash algorithm would make it very difficult for an attacker to find keys that result in collisions. This would decrease performance, unfortunately.
  • The hash algorithm could use a random salt that is generated when the program starts, to make it more difficult to attack. This is not a proper solution, however, because an attacker may be able to determine the salt and still attack.
  • Changing the implementation of HashMap to store entries in a HashMap that uses a different salt gets rid of the linked list that results in the worst case time complexity. This strategy is usually not done because of the added implementation complexity and decreased performance for small hash tables, but the team implemented it and found that it worked well. The maintainers rejected the change because there is no proof that such an implementation does not also have vulnerabilities. How difficult it is to create keys that collide with two different salt values?
  • Using a different data structure to represent JSON objects would avoid hashes altogether. They author suggested using the Map data structure, provided by the containers library. This would also decrease performance.

The maintainers are currently working on a change in aeson to address the issue. In the proposed change, an Object will be implemented using an opaque data structure named TextMap, with a minimal API. This is a breaking change, but it will allow the maintainers to experiment with different implementations without additional breaking changes. The TextMap data structure will initially be a newtype around HashMap or Map, and a compile-time flag will allow users to select the desired implementation: HashMap for better performance or Map for better security.

One limitation of this current fix is that the selection affects the whole program. One may want to prioritize security on a public API while prioritizing performance on an internal or administrative API, but this is not possible when using the compile-time flag.

Entry List

There are a few limitations caused by using HashMap or Map to represent JSON objects.

One limitation is that these data structure cannot handle objects with duplicate keys. JSON is based on JavaScript and is unfortunately quite poorly defined. It does not disallow duplicate keys, but they are very rarely used in practice and generally indicate poor design. This limitation is therefore generally ignored.

Another limitation is that these data structures do not allow the developer to specify the order of entries. JSON is generally used as a format that can be easily parsed by machines, and the order of object entries generally does not matter. YAML is a format for humans, however, and sometimes the order of entries is important!

Currently, I do not know of a Haskell library that supports writing YAML with fixed-order object entries. When that is a requirement, options are to write a new library, mangle the YAML output of an existing library to fix the order, or to use a different tool. Unfortunately, this issue sometimes results in having to use a different language altogether.

Both of these limitations would be resolved by representing JSON objects as a list of entries (implemented using Vector for efficiency):

type ObjectEntries = Vector (Text, Value)

When creating a Value to be encoded as JSON/YAML, the developer specifies the order of entries, and even duplicate entries are possible. When decoding JSON/YAML, the Object represents the source order, and duplicate entries are not lost. It is then up to the developer to choose how to use this data according to the requirements of the application. The data can be read into a HashMap for optimal search performance, it can be read into a Map for better security, etc.

The challenging part of this design is type conversion. The aeson library provides a FromJSON type class, and instances convert from a JSON representation to a specific data type. This conversion usually involves search of object keys, as in the following example from the documentation:

instance FromJSON Coord where
  parseJSON = withObject "Coord" $ \v -> Coord
    <$> v .: "x"
    <*> v .: "y"

Using ObjectEntries directly would result in linear time complexity. When writing manual instances, it would be possible to specify the data structure to use. As an example design, a type class could be used so that operators work with any data type (with an instance).

class Object o where
  lookup :: Text -> o -> Maybe Value

(.:) :: (FromJSON a, Object o) => o -> Text -> Parser a

For example, the instances for ObjectEntries (O(n)) and HashMap (O(1) best case and O(n) worst case) could be implemented as follows:

instance Object ObjectEntries where
  lookup k = fmap snd . Vector.find ((== k) . fst)

instance Object (HashMap Text Value) where
  lookup = HashMap.lookup

The withObject function could be renamed to withObjectEntries to indicate that it works with a list (Vector) by default:

withObjectEntries
  :: String
  -> (ObjectEntries -> Parser a)
  -> Value
  -> Parser a

If desired, helper functions could be use to perform conversions. For example, an asHashMap function could load the ObjectEntries into a HashMap as follows:

asHashMap
  :: (HashMap Text Value -> Parser a)
  -> ObjectEntries
  -> Parser a
asHashMap f = f . HashMap.fromList . Vector.toList

The example instance definition could continue to use a HashMap for lookup when implemented as follows:

instance FromJSON Coord where
  parseJSON = withObjectEntries "Coord" . asHashMap $ \v -> Coord
    <$> v .: "x"
    <*> v .: "y"

By exposing the Object type class, users would even be able to implement support for data types that are not provided in the library.

What about Generic instances? Perhaps the library could provide a suitable default (prioritize security?) and expose a compile-time flag that allows users to select an alternative, much like the current fix. The flag would select which conversion function to use (asHashMap to use HashMap, etc.). Use of the compile-time flag results in the same limitation as the current fix, but at least in this case it would be possible to write some instances manually to select a different data structure.

Ordered Map

While it would not support duplicate keys, use of a map that preserves the insertion order would resolve the order issue. This is what Python does. In Python, the entries are stored in insertion order, and the sparse table of hash table indexes is separate. (This design was suggested in 2012, released in Python 3.6 and made an official part of the language from Python 3.7.) In Python, this change actually resulted in both smaller memory usage and better iteration performance!

I do not know of a Haskell library that provides such a data structure. Specifically, I want a data structure that provides the same API and performance of HashMap where the data structure is traversed in order of insertion, with toList returning a list of entries in order of insertion. I would like to develop such a library myself if I can find the time to do so.

Ordered Map Wrapper

Another idea is to provide an OrderedMap data type that wraps an existing map data structure such as HashMap or Map and uses a DList (from the dlist package) or similar data structure to keep track of the insertion order. While this would likely not perform as well as a standalone data type, it would allow the user to choose which type of map to use. The performance should not be too bad for the target use case, where entries are inserted and then iterated (without updates or deletions).

Yesterday, I wrote a quick prototype to explore the design. In the prototype, I use the following definition:

data OrderedMap t k v
  = OrderedMap
    { omMap  :: !(t k v)
    , omList :: !(DList (k ,v))
    }

As an example of the API, consider the insert function. The OrderedMap implementation uses a function that checks for membership and does the insertion for the wrapped map type. In the target use case, insertions are for new keys, in which case snoc can be used on the DList (O(1)). In cases where the insertion overwrites a value, the DList has to be modified, requiring a linear (O(n)) traversal.

insert
  :: Eq k
  => (k -> v -> t k v -> (Bool, t k v))
  -> k
  -> v
  -> OrderedMap t k v
  -> OrderedMap t k v
insert mapMemberInsert key value omap =
    let (isMember, newMap) = mapMemberInsert key value $ omMap omap
        g
          | isMember  = dlistAdjust (const value) key
          | otherwise = flip DList.snoc (key, value)
    in  OrderedMap
          { omMap  = newMap
          , omList = g $ omList omap
          }

-- helper function
dlistAdjust
  :: Eq k
  => (v -> v)
  -> k
  -> DList (k, v)
  -> DList (k, v)
dlistAdjust f key = DList.map $ \(k, v) -> (k, if k == key then f v else v)

Separate modules provide implementations for specific map types. For example, a type synonym for an OrderedMap that wraps the Map type is as follows:

type OrderedMap = OM.OrderedMap Map

The insert function uses the insertLookupWithKey function to perform the membership test and insertion in one O(log n) traversal of the data structure:

insert
  :: Ord k
  => k
  -> v
  -> OrderedMap k v
  -> OrderedMap k v
insert = OM.insert memberInsert
  where
    memberInsert :: Ord k => k -> v -> Map k v -> (Bool, Map k v)
    memberInsert key value
      = first isJust
      . Map.insertLookupWithKey useNewValue key value

    useNewValue :: (k -> v -> v -> v)
    useNewValue _key newValue _oldValue = newValue

I would like to implement some more modules and run some benchmarks to see how bad the performance is.