Aeson Object Design (Part 2)
I posted a brief overview of my entry list design to the aeson issue, and Brendan Hay commented:
For my own codebases I’m not convinced that
(.:)
introducing a linear scan by way ofVector.find
would even be slower, at least, as the deserialisation routines (for say, a record) never perform repeated lookup of the same object key. As mentioned in theZ-Data
code I linked above you’re probably paying more to materialise a{Hash,}Map
than what the linear scan for a single key introduces? It would naturally require actual benchmarks to prove that I am not, in fact, a moron for claiming this.
The Z-Data
comment is the following (with a typo fixed to avoid triggering my
typo checker):
The
Object
’s payload is a key-value vector instead of a map, which parsed directly from JSON document. This design choice has following advantages:
- Allow different strategies handling duplicated keys.
- Allow different
Map
type to do further parsing, e.g.Z.Data.Vector.FlatMap
- Roundtrip without touching the original key-value order.
- Save time if constructing map is not necessary, e.g. using a linear scan to find a key if only that key is needed.
This is an interesting idea that I would like to explore. I explore it using mathematics in this post, but the calculations are just estimations used to provide some intuition about the costs. It would indeed be wise to confirm the intuition using benchmarks.
I would like to consider two scenarios:
- In the normal scenario, every property of an object is used exactly once.
- In the attack scenario, expected properties are used exactly once while additional properties with hash collisions are never used.
I will use the following variables:
- \(n\) is the number of expected properties
- \(m\) is the number of additional properties in the attack scenario
In general, the total cost is as follows:
\[ {cost}_{total} = {cost}_{construction} + {cost}_{lookup} \]
The lookup cost is the sum of the lookup costs for each expected property:
\[ {cost}_{lookup} = \sum_{i=1}^n {cost}_i \]
Normal Scenario
Entry List
When I first worked on the entry list designed, I used a list instead
of a Vector
:
type ObjectEntries = [(Text, Value)]
I then found Brendan’s first comment and realized that a
Vector
might be better, so I changed the type definition to
the following:
type ObjectEntries = Vector (Text, Value)
I am not sure which version would be best, and I would like to see
some benchmarks of both performance and memory usage. Looking at the
Aeson source, I found that it parses JSON objects to
[(Text, Value)]
and then converts to a
HashMap
. It does not parse JSON objects directly
into a HashMap
as I figured it would. For this exploration,
I will use the list version of ObjectEntries
and not
include the cost of that list construction since it is done no matter
what data type is used for searching.
\[ {cost}_{construction} = 0 \]
In the normal scenario, every property of an object is used exactly once, so the cost is reduced to the following:
\[ {cost}_{lookup} = \sum_{i=1}^n i = \frac{n (n + 1)}{2} \]
The total cost is as follows:
\[ {cost}_{total} = \frac{n (n + 1)}{2} \]
Map
The construction cost for a Map
is that of building it
from the entry list:
\[ {cost}_{construction} = n \log_2 n \]
The lookup cost is as follows:
\[ {cost}_{lookup} = \sum_{i=1}^n \log_2 n = n \log_2 n \]
It is interesting to note that the construction cost and lookup cost are approximately the same. The total cost is as follows:
\[ {cost}_{total} = n \log_2 n + n \log_2 n = 2n \log_2 n \]
HashMap
The construction cost for a HashMap
is that of building
it from the entry list. The documentation lists this as \(O(n \log n)\) but notes that the large base
makes \(O(\log n)\) operations run in
“constant time” in practice, so I will be optimistic and use the
following construction cost:
\[ {cost}_{construction} = n \]
The lookup cost for a single property is \(O(\log n)\), which runs in “constant time” in practice, so the lookup cost is as follows:
\[ {cost}_{lookup} = \sum_{i=1}^n 1 = n \]
The total cost is as follows:
\[ {cost}_{total} = n + n = 2n \]
Comparison
Here are some estimated number of operations for various sizes of objects:
1 | 2 | 4 | 8 | 16 | 32 | 64 | 128 | |
---|---|---|---|---|---|---|---|---|
List | 1 | 3 | 10 | 36 | 136 | 528 | 2,080 | 8,256 |
Map | 2 | 4 | 16 | 48 | 128 | 320 | 768 | 1,792 |
HashMap | 2 | 4 | 8 | 16 | 32 | 64 | 128 | 256 |
With these estimations, the HashMap
is only outperformed
on objects with a single property, and the performance difference in
this case is negligible. The total cost of using a HashMap
is linear and greatly outperforms the other options. The entry
list outperforms Map
when there are fewer than 15
properties, but not by much.
In the following graph, created using Desmos, the list
line is red, the Map
line is green, and the
HashMap
line is blue.
These estimations give equal weight to all optimizations, while in reality some operations are more expensive than others even when they run in constant time. The exact numbers above are pretty meaningless, but the shapes of the lines in the graph help form an intuition about what to expect.
Attack Scenario
Entry List
In the attack scenario, the worst-case is that the expected properties are at the end, so that every lookup must linearly traverse all of the other properties:
\[ {cost}_{lookup} = \sum_{i=1}^n(n+m) = n (n+m) \]
The total cost is as follows:
\[ {cost}_{total} = n (n+m) \]
Map
In the attack scenario, the other properties increase the cost of construction:
\[ {cost}_{construction} = (n+m) \log_2(n+m) \]
The lookup cost is as follows:
\[ {cost}_{lookup} = \sum_{i=1}^n \log_2(n+m) = n \log_2(n+m) \]
The total cost is as follows:
\[ {cost}_{total} = (n+m) \log_2(n+m) + n \log_2(n+m) = (2n+m) \log_2(n+m) \]
HashMap
In the attack scenario, the other properties have hash collisions,
resulting in linear insertion when the HashMap
is
constructed:
\[ {cost}_{construction} = n + \sum_{i=1}^m i = n + \frac{m(m+1)}{2} \]
Note that this quadratic expression makes just constructing the
HashMap
sufficient for denial of service. This is
demonstrated in the exploit, which just parses the JSON without
performing any lookups.
Only the expected properties are used. In the worse case, one of the expected properties collides with the other properties:
\[ {cost}_{lookup} = \sum_{i=1}^{n-1} 1 + m + 1 = n - 1 + m + 1 = n + m \]
The total cost is as follows:
\[ {cost}_{total} = n + \frac{m(m+1)}{2} + n + m = 2n + m + \frac{m(m+1)}{2} \]
Comparison
For the comparison, I will use a constant number of other properties in the attack: \(m = 100,000\). This is a bit less than the number of properties used in the exploit (109,100).
Here are some estimated number of operations for various sizes of objects:
1 | 2 | 3 | 4 | |
---|---|---|---|---|
List | 100,001 | 200,004 | 400,016 | 800,064 |
Map | 1,660,999 | 1,661,033 | 1,661,103 | 1,661,241 |
HashMap | 5,000,150,002 | 5,000,150,004 | 5,000,150,008 | 5,000,150,016 |
With these estimations, the list outperforms Map
when
there are fewer than 17 properties, but Map
performs much
better as the number of properties increases. The HashMap
has a high creation cost, with relatively small lookup cost.
In the following graph, created using Desmos, the list
line is red, the Map
line is green, and the
HashMap
line is blue.
In the following graph, created using Desmos, the list
line is red and the Map
line is green. The
HashMap
line is left out to illustrate how the list and
Map
lines compare.
As with the normal scenario, these estimations give equal weight to
all optimizations, while in reality some operations are more expensive
than others even when they run in constant time. The exact numbers above
are pretty meaningless, but the shapes of the lines in the graph help
form an intuition about what to expect. In particular, these estimations
illustrate how using a Map
avoids the quadratic cost of the
HashMap
.