I am writing a graph algorithm at the heart of one of the products we are developing at Simulation Dynamics. A part of the algorithm is tracking whether a node or vertex has been observed yet or not. As the algorithm progresses, it will mark whether it has completed processing a node or a vertex to know it does not need to continue through the graph.
This algorithm is part of the inner loop of the simulation and needs to be as fast as possible. Speed is a critical feature for us, so essentially, anything is allowed if it gives us more speed. The only requirement is it be written in .NET (preferably F#) and runs on a modern x86-64 processor.
These graphs are almost always small, say 20 nodes and 30 vertices. We are modeling manufacturing processes, and the types of facilities that we simulate can be simplified down to these small graphs. This means we don’t necessarily need to handle N
nodes and M
vertices, and we need an algorithm that is as fast as possible for these small cases.
The F# Set
I love the F# Set<'T>
collection. When I just need to process sets of value and perform comparisons between them, nothing beats the tried and true Set<'T>
. For those curious, the F# Set
and Map
collections are backed by a variation on the AVL tree. We actually have Victor Baybekov to thank for some massive performance improvements he provided in this pull request.
Before I test out various approaches, I create a baseline performance test. I will generate 1,000,000 different indexes in a range of 0
to 50
. For each of these indices, I will check whether it is in the Set
or not. If it is not in the Set
, I will add it. If it is in the Set
, I will remove it. This will simulate the type of behavior that the graph algorithm uses while traversing the graph. I will also use the BenchmarkDotNet
library to measure my performance.
|
|
When I run this benchmark, I get the following result.
|
|
We now have a baseline to compare against.
The .NET HashSet
The first thing someone will likely point out is, “Matthew, you are using an immutable collection that you are constantly updating and rebinding. Why not use a mutable collection?” Great question! Let’s reach for the next most obvious collection in .NET, the HashSet
. HashSet
is found in the System.Collections.Generic
namespace. Instead of using a tree to keep track of items, it uses a hash table. This should give us constant-time lookup, insertion, and removal.
We add a new test to our Benchmarks
class…
|
|
… and re-run our benchmarks.
|
|
We see that the HashSet
is giving us a performance boost over the F# Set
. This is expected since we don’t have to re-arrange a tree when we perform insertion and deletion. This is NOT to say that Set
is inferior to HashSet
; please don’t consider that the takeaway. Set
can do many things that HashSet
cannot. It just happens in this instance, HashSet
appears to be a better choice. Oh, but we aren’t done yet…
Plain Old Arrays
One of my favorite talks on the internet is one by Scott Meyers, where he talks about the criticality of understanding memory hierarchies when writing high-performance code. My favorite moment is when he quotes a developer who writes high-frequency trading algorithms. The developer says, “I don’t care how fancy your data structure is; an array will beat it.”
I can map all nodes and vertices to int
values ahead of time for my use case. This means that I could easily track whether I have visited a node or index by looking up a bool
in an array using the value for the node or vertex as an index into the array. Let’s try that and see what we get. I add a new benchmark to my Benchmarks
class to see how this approach performs.
|
|
We can now see how using an array<bool>
performs against a Set
and HashSet
.
|
|
Alright, we see another speed boost. Using just an array, we’ve cut out some cycles that the HashSet
has to perform when it performs a lookup. Now we are just taking a pointer to the head of the array and offsetting it to perform the lookup. It’s hard to get much faster than this… or is it?
Enter Data-Oriented Design
I’ve been on a quest to become a better developer, and my latest wanderings have brought me to the world of Data-Oriented Design. I love Mike Acton’s talk at CPP Con, and I highly recommend it for anyone who hasn’t seen it. He mentions that when he sees a bool
in a Struct
he almost always knows that something has gone wrong. The reason is that a bool
only contains 1 bit of information, but it still takes up 8 bits of memory. This means that if you are moving many types back and forth in the cache which contains bools, you are wasting a large amount of memory bandwidth.
At the time I watched it, I thought, “Okay, I see your point about wasting memory bandwidth, but what if you really do need to know whether something is true or not?” That sat in the back of my mind for a while, and I didn’t see a solution. The other day though, Jérémie Chassaing gave an excellent talk on writing high-performance F# with no memory allocation. In it, he showed an elegant trick for storing two int32
in a single int64
. You can watch the bit here. He holds one of the int32
in the bottom 32-bits of the int64
and the other int32
value in the top 32-bits. That may seem silly, but it provides the compiler a unique optimization that it wouldn’t have otherwise. It can store both of those int32
in a single 64-bit register. This is the fastest form of memory in a CPU.
Now, who don’t we take that same trick, and instead of storing 2 int32
values in an int64
, what about 64 bools in an int64
? Each bit of the int64
can correspond to a node or vertex in the graph. Instead of passing an array<bool>
around, we can pass a single int64
. Remember that for this use case, it is incredibly rare to see more than 20 nodes and 30 vertices, so 64-bits is plenty to store whether we have observed them or not.
Let’s create a new type for wrapping our int64
and provide some convenience methods for working with it. We will define it as a Struct
since that provides the .NET compiler a large number of options for optimizing it.
|
|
Let’s unpack this. The Value
field is used to track whether something has been observed or not. The Create
method gives us a new Int64Tracker
where all the bits are set to 0
. The IsSet
method takes a position and checks whether that position is 0 or not. It does this by taking a value of 1L
, which puts a value of 1 in the bottom bit and then left, shifting it to the desired position. It then does a bitwise AND
between the Value
and the bit-shifted value. This will return 0s for every location except for possibly the position we are interested in. If there is a 1 in the desired position, it will AND
with our shifted 1 and return a non-zero number. If there is a 0
in the position of interest, all the bits will return 0
, and the resulting value will be equal to 0
.
The Set
method takes a position and sets the bit in that position to 1
. It does this by taking a value of 1L
, which has a 1
in the bottom bit, left-shifts it to the desired position, and then performs a bitwise OR
with Value
. If the value in the given position was already 1
, it will still be a 1
. If it was 0
, it will be changed to 1
. Here’s a quick video showing the operation.
The final method, UnSet
, takes 1L
and left-shifts it to the desired position. It then does a bitwise negation. This means that all the bits will be 1
except for the position that we want to turn into a 0
. We then AND
that mask with Value
to get our updated Value
. Here’s a video showing how this is done.
We now have a new type that allows us to track up to 64 different items, which is enough for our use case. Let’s add a benchmark.
|
|
And we run our benchmarks to see how fast we are now.
|
|
Excellent! We’ve shaved off another 20%. This may not seem like much but let me note something important. The BoolArrayTracker
takes an entire cache-line to store the information for 64 items. Int64Tracker
only uses 4 bytes out of the 64 that are available. This means that in the full algorithm, we can use the remaining 60 bytes to track other information and ensure all of that information is on the same cache-line. This may seem silly at the moment but can have a significant impact on overall performance if we experience cache eviction in the middle of our tightest loop. More to come!
If you have critiques or thoughts on how to improve, I’d love to hear it. All the code can be found here. Feel free to make suggestions or pull requests.
If you liked this post and want to hear more about writing high-performance F# and Mathematical Programming, please subscribe to my list!