# Aeson Object Design (Part 2)

I posted a brief overview of my entry list design to the aeson issue, and Brendan Hay commented:

For my own codebases I’m not convinced that

`(.:)`

introducing a linear scan by way of`Vector.find`

would even be slower, at least, as the deserialisation routines (for say, a record) never perform repeated lookup of the same object key. As mentioned in the`Z-Data`

code I linked above you’re probably paying more to materialise a`{Hash,}Map`

than what the linear scan for a single key introduces? It would naturally require actual benchmarks to prove that I am not, in fact, a moron for claiming this.

The `Z-Data`

comment is the following (with a typo fixed to avoid triggering my typo checker):

The

`Object`

’s payload is a key-value vector instead of a map, which parsed directly from JSON document. This design choice has following advantages:

- Allow different strategies handling duplicated keys.
- Allow different
`Map`

type to do further parsing, e.g.`Z.Data.Vector.FlatMap`

- Roundtrip without touching the original key-value order.
- Save time if constructing map is not necessary, e.g. using a linear scan to find a key if only that key is needed.

This is an interesting idea that I would like to explore. I explore it using mathematics in this post, but the calculations are just estimations used to provide some intuition about the costs. It would indeed be wise to confirm the intuition using benchmarks.

I would like to consider two scenarios:

- In the
*normal*scenario, every property of an object is used exactly once. - In the
*attack*scenario, expected properties are used exactly once while additional properties with hash collisions are never used.

I will use the following variables:

- \(n\) is the number of expected properties
- \(m\) is the number of additional properties in the attack scenario

In general, the total cost is as follows:

\[ {cost}_{total} = {cost}_{construction} + {cost}_{lookup} \]

The lookup cost is the sum of the lookup costs for each expected property:

\[ {cost}_{lookup} = \sum_{i=1}^n {cost}_i \]

## Normal Scenario

### Entry List

When I first worked on the entry list designed, I used a list instead of a `Vector`

:

`type ObjectEntries = [(Text, Value)]`

I then found Brendan’s first comment and realized that a `Vector`

might be better, so I changed the type definition to the following:

`type ObjectEntries = Vector (Text, Value)`

I am not sure which version would be best, and I would like to see some benchmarks of both performance and memory usage. Looking at the Aeson source, I found that it parses JSON objects to `[(Text, Value)]`

and then converts to a `HashMap`

. It does *not* parse JSON objects directly into a `HashMap`

as I figured it would. For this exploration, I will use the list version of `ObjectEntries`

and not include the cost of that list construction since it is done no matter what data type is used for searching.

\[ {cost}_{construction} = 0 \]

In the normal scenario, every property of an object is used exactly once, so the cost is reduced to the following:

\[ {cost}_{lookup} = \sum_{i=1}^n i = \frac{n (n + 1)}{2} \]

The total cost is as follows:

\[ {cost}_{total} = \frac{n (n + 1)}{2} \]

### Map

The construction cost for a `Map`

is that of building it from the entry list:

\[ {cost}_{construction} = n \log_2 n \]

The lookup cost is as follows:

\[ {cost}_{lookup} = \sum_{i=1}^n \log_2 n = n \log_2 n \]

It is interesting to note that the construction cost and lookup cost are approximately the same. The total cost is as follows:

\[ {cost}_{total} = n \log_2 n + n \log_2 n = 2n \log_2 n \]

### HashMap

The construction cost for a `HashMap`

is that of building it from the entry list. The documentation lists this as \(O(n \log n)\) but notes that the large base makes \(O(\log n)\) operations run in “constant time” in practice, so I will be optimistic and use the following construction cost:

\[ {cost}_{construction} = n \]

The lookup cost for a single property is \(O(\log n)\), which runs in “constant time” in practice, so the lookup cost is as follows:

\[ {cost}_{lookup} = \sum_{i=1}^n 1 = n \]

The total cost is as follows:

\[ {cost}_{total} = n + n = 2n \]

### Comparison

Here are some estimated number of operations for various sizes of objects:

1 | 2 | 4 | 8 | 16 | 32 | 64 | 128 | |
---|---|---|---|---|---|---|---|---|

List | 1 | 3 | 10 | 36 | 136 | 528 | 2,080 | 8,256 |

Map | 2 | 4 | 16 | 48 | 128 | 320 | 768 | 1,792 |

HashMap | 2 | 4 | 8 | 16 | 32 | 64 | 128 | 256 |

With these estimations, the `HashMap`

is only outperformed on objects with a single property, and the performance difference in this case is negligible. The total cost of using a `HashMap`

is linear and *greatly* outperforms the other options. The entry list outperforms `Map`

when there are fewer than 15 properties, but not by much.

In the following graph, created using Desmos, the list line is red, the `Map`

line is green, and the `HashMap`

line is blue.

These estimations give equal weight to all optimizations, while in reality some operations are more expensive than others even when they run in constant time. The exact numbers above are pretty meaningless, but the shapes of the lines in the graph help form an intuition about what to expect.

## Attack Scenario

### Entry List

In the attack scenario, the worst-case is that the expected properties are at the end, so that every lookup must linearly traverse all of the other properties:

\[ {cost}_{lookup} = \sum_{i=1}^n(n+m) = n (n+m) \]

The total cost is as follows:

\[ {cost}_{total} = n (n+m) \]

### Map

In the attack scenario, the other properties increase the cost of construction:

\[ {cost}_{construction} = (n+m) \log_2(n+m) \]

The lookup cost is as follows:

\[ {cost}_{lookup} = \sum_{i=1}^n \log_2(n+m) = n \log_2(n+m) \]

The total cost is as follows:

\[ {cost}_{total} = (n+m) \log_2(n+m) + n \log_2(n+m) = (2n+m) \log_2(n+m) \]

### HashMap

In the attack scenario, the other properties have hash collisions, resulting in linear insertion when the `HashMap`

is constructed:

\[ {cost}_{construction} = n + \sum_{i=1}^m i = n + \frac{m(m+1)}{2} \]

Note that this quadratic expression makes just constructing the `HashMap`

sufficient for denial of service. This is demonstrated in the exploit, which just parses the JSON without performing any lookups.

Only the expected properties are used. In the worse case, one of the expected properties collides with the other properties:

\[ {cost}_{lookup} = \sum_{i=1}^{n-1} 1 + m + 1 = n - 1 + m + 1 = n + m \]

The total cost is as follows:

\[ {cost}_{total} = n + \frac{m(m+1)}{2} + n + m = 2n + m + \frac{m(m+1)}{2} \]

### Comparison

For the comparison, I will use a constant number of other properties in the attack: \(m = 100,000\). This is a bit less than the number of properties used in the exploit (109,100).

Here are some estimated number of operations for various sizes of objects:

1 | 2 | 3 | 4 | |
---|---|---|---|---|

List | 100,001 | 200,004 | 400,016 | 800,064 |

Map | 1,660,999 | 1,661,033 | 1,661,103 | 1,661,241 |

HashMap | 5,000,150,002 | 5,000,150,004 | 5,000,150,008 | 5,000,150,016 |

With these estimations, the list outperforms `Map`

when there are fewer than 17 properties, but `Map`

performs much better as the number of properties increases. The `HashMap`

has a high creation cost, with relatively small lookup cost.

In the following graph, created using Desmos, the list line is red, the `Map`

line is green, and the `HashMap`

line is blue.

In the following graph, created using Desmos, the list line is red and the `Map`

line is green. The `HashMap`

line is left out to illustrate how the list and `Map`

lines compare.

As with the normal scenario, these estimations give equal weight to all optimizations, while in reality some operations are more expensive than others even when they run in constant time. The exact numbers above are pretty meaningless, but the shapes of the lines in the graph help form an intuition about what to expect. In particular, these estimations illustrate how using a `Map`

avoids the quadratic cost of the `HashMap`

.