Aeson Object Design (Part 2)

I posted a brief overview of my entry list design to the aeson issue, and Brendan Hay commented:

For my own codebases I’m not convinced that (.:) introducing a linear scan by way of Vector.find would even be slower, at least, as the deserialisation routines (for say, a record) never perform repeated lookup of the same object key. As mentioned in the Z-Data code I linked above you’re probably paying more to materialise a {Hash,}Map than what the linear scan for a single key introduces? It would naturally require actual benchmarks to prove that I am not, in fact, a moron for claiming this.

The Z-Data comment is the following (with a typo fixed to avoid triggering my typo checker):

The Object’s payload is a key-value vector instead of a map, which parsed directly from JSON document. This design choice has following advantages:

Allow different strategies handling duplicated keys.

Allow different Map type to do further parsing, e.g. Z.Data.Vector.FlatMap

Roundtrip without touching the original key-value order.

Save time if constructing map is not necessary, e.g. using a linear scan to find a key if only that key is needed.

This is an interesting idea that I would like to explore. I explore it using mathematics in this post, but the calculations are just estimations used to provide some intuition about the costs. It would indeed be wise to confirm the intuition using benchmarks.

I would like to consider two scenarios:

In the normal scenario, every property of an object is used exactly once.
In the attack scenario, expected properties are used exactly once while additional properties with hash collisions are never used.

I will use the following variables:

\(n\) is the number of expected properties
\(m\) is the number of additional properties in the attack scenario

In general, the total cost is as follows:

\[ {cost}_{total} = {cost}_{construction} + {cost}_{lookup} \]

The lookup cost is the sum of the lookup costs for each expected property:

\[ {cost}_{lookup} = \sum_{i=1}^n {cost}_i \]

Normal Scenario

Entry List

When I first worked on the entry list designed, I used a list instead of a Vector:

type ObjectEntries = [(Text, Value)]

I then found Brendan’s first comment and realized that a Vector might be better, so I changed the type definition to the following:

type ObjectEntries = Vector (Text, Value)

I am not sure which version would be best, and I would like to see some benchmarks of both performance and memory usage. Looking at the Aeson source, I found that it parses JSON objects to [(Text, Value)] and then converts to a HashMap. It does not parse JSON objects directly into a HashMap as I figured it would. For this exploration, I will use the list version of ObjectEntries and not include the cost of that list construction since it is done no matter what data type is used for searching.

\[ {cost}_{construction} = 0 \]

In the normal scenario, every property of an object is used exactly once, so the cost is reduced to the following:

\[ {cost}_{lookup} = \sum_{i=1}^n i = \frac{n (n + 1)}{2} \]

The total cost is as follows:

\[ {cost}_{total} = \frac{n (n + 1)}{2} \]

Map

The construction cost for a Map is that of building it from the entry list:

\[ {cost}_{construction} = n \log_2 n \]

The lookup cost is as follows:

\[ {cost}_{lookup} = \sum_{i=1}^n \log_2 n = n \log_2 n \]

It is interesting to note that the construction cost and lookup cost are approximately the same. The total cost is as follows:

\[ {cost}_{total} = n \log_2 n + n \log_2 n = 2n \log_2 n \]

HashMap

The construction cost for a HashMap is that of building it from the entry list. The documentation lists this as \(O(n \log n)\) but notes that the large base makes \(O(\log n)\) operations run in “constant time” in practice, so I will be optimistic and use the following construction cost:

\[ {cost}_{construction} = n \]

The lookup cost for a single property is \(O(\log n)\), which runs in “constant time” in practice, so the lookup cost is as follows:

\[ {cost}_{lookup} = \sum_{i=1}^n 1 = n \]

The total cost is as follows:

\[ {cost}_{total} = n + n = 2n \]

Comparison

Here are some estimated number of operations for various sizes of objects:

	1	2	4	8	16	32	64	128
List	1	3	10	36	136	528	2,080	8,256
Map	2	4	16	48	128	320	768	1,792
HashMap	2	4	8	16	32	64	128	256

With these estimations, the HashMap is only outperformed on objects with a single property, and the performance difference in this case is negligible. The total cost of using a HashMap is linear and greatly outperforms the other options. The entry list outperforms Map when there are fewer than 15 properties, but not by much.

In the following graph, created using Desmos, the list line is red, the Map line is green, and the HashMap line is blue.

Normal Scenario Graph

These estimations give equal weight to all optimizations, while in reality some operations are more expensive than others even when they run in constant time. The exact numbers above are pretty meaningless, but the shapes of the lines in the graph help form an intuition about what to expect.

Attack Scenario

Entry List

In the attack scenario, the worst-case is that the expected properties are at the end, so that every lookup must linearly traverse all of the other properties:

\[ {cost}_{lookup} = \sum_{i=1}^n(n+m) = n (n+m) \]

The total cost is as follows:

\[ {cost}_{total} = n (n+m) \]

Map

In the attack scenario, the other properties increase the cost of construction:

\[ {cost}_{construction} = (n+m) \log_2(n+m) \]

The lookup cost is as follows:

\[ {cost}_{lookup} = \sum_{i=1}^n \log_2(n+m) = n \log_2(n+m) \]

The total cost is as follows:

\[ {cost}_{total} = (n+m) \log_2(n+m) + n \log_2(n+m) = (2n+m) \log_2(n+m) \]

HashMap

In the attack scenario, the other properties have hash collisions, resulting in linear insertion when the HashMap is constructed:

\[ {cost}_{construction} = n + \sum_{i=1}^m i = n + \frac{m(m+1)}{2} \]

Note that this quadratic expression makes just constructing the HashMap sufficient for denial of service. This is demonstrated in the exploit, which just parses the JSON without performing any lookups.

Only the expected properties are used. In the worse case, one of the expected properties collides with the other properties:

\[ {cost}_{lookup} = \sum_{i=1}^{n-1} 1 + m + 1 = n - 1 + m + 1 = n + m \]

The total cost is as follows:

\[ {cost}_{total} = n + \frac{m(m+1)}{2} + n + m = 2n + m + \frac{m(m+1)}{2} \]

Comparison

For the comparison, I will use a constant number of other properties in the attack: \(m = 100,000\). This is a bit less than the number of properties used in the exploit (109,100).

Here are some estimated number of operations for various sizes of objects:

	1	2	3	4
List	100,001	200,004	400,016	800,064
Map	1,660,999	1,661,033	1,661,103	1,661,241
HashMap	5,000,150,002	5,000,150,004	5,000,150,008	5,000,150,016

With these estimations, the list outperforms Map when there are fewer than 17 properties, but Map performs much better as the number of properties increases. The HashMap has a high creation cost, with relatively small lookup cost.

In the following graph, created using Desmos, the list line is red, the Map line is green, and the HashMap line is blue.

Attack Scenario Graph (All)

In the following graph, created using Desmos, the list line is red and the Map line is green. The HashMap line is left out to illustrate how the list and Map lines compare.

Attack Scenario Graph (Partial)

As with the normal scenario, these estimations give equal weight to all optimizations, while in reality some operations are more expensive than others even when they run in constant time. The exact numbers above are pretty meaningless, but the shapes of the lines in the graph help form an intuition about what to expect. In particular, these estimations illustrate how using a Map avoids the quadratic cost of the HashMap.

Author

Travis Cardwell

Published

September 15, 2021