Part of my work is writing algorithms to analyze networks of nodes representing manufacturing systems. Each node can be one of four different types: Buffer, Constraint, Merge, and Split. A manufacturing system that we would want to simulate is typically made up of no more than 100 of these nodes. A natural way to encode these types would be to use Discriminated Unions (DU). I also use Units of Measure to annotate integers that are the ids for these manufacturing nodes.
|
|
The default encoding for a DU in F# is as a reference type. This means that if you have a Node array
, each array element will be a pointer to where the Node
itself is stored in memory. If you need to quickly lookup up Nodes and what type they are, you can run into a phenomenon known as Pointer Chasing. Pointer Chasing is when the CPU is trying to run your code but constantly has to look up new regions of memory because the data is spread out. As .NET developers, we tend not to think about this much, but it can become a severe problem in performance-sensitive code.
Fortunately, F# allows us to encode DUs as structs using the [<Struct>]
attribute. Here is what that looks like:
|
|
Now, if you have a Node array
, the data for the node will be stored in the array itself so you can eliminate having to perform an additional lookup. There is a serious downside to this, though. The F# compiler will allocate space for each possible case of the DU instead of only the area necessary for the instantiated individual case. This means that instead of just taking up the space of just two int
(one to encode the case and one for the value), this struct Node
will take up one int
to encode the case and four more int
for each possible case. For a deeper explanation of this, I refer you to this excellent post by Bartosz Sypytkowski. If a DU has too many cases, the benefits of the struct layout will quickly be negated by this padding.
Alternative Encoding
I was curious if there was another way to approach my problem. I like the elegance of the match ... with
syntax in F#, and I am loathed to give it up. Since my Node
type is just encoding different possible int
values, why not do some bit hacking? Now, I will be the first to say this is non-traditional F# code, but I’m curious, so why not perform the experiment?
I’ll define a new version of Node
that will use an int
to hold the data about which case it represents and the value. I will use the last 4 bits of the Value
field to encode which case the Node
represents and the top 28 bits will hold the id value. This is cutting off some of the space that int
can express, but since our networks are never more than 1,000 nodes, there is no practical loss of modeling space.
|
|
The static members BufferIdCode
, ConstraintIdCode
, MergeIdCode
, and SplitIdCode,
will be the values I use to encode the Node cases. To still use the match...with
syntax, I will need to define an Active Pattern for unpacking the case. Active Patterns are an elegant feature of F# for decomposing data into different forms. In this case, I will take the int
held in my Node
type, check which case it is, and then return the corresponding id.
I use a mask and a bitwise AND operation to get the last 4 bits (also known as a nibble) of the Value
field, which gives me an int
. I compare that int
with the possible code values to figure out which type of node it is. I then bit shift the Value
field to the right 4 bits to convert it to the id value it stores and multiply the result by the right Unit of Measure to ensure type safety.
|
|
There is a downside to this technique, though. Active Patterns will cause additional allocations and trigger additional Garbage Collection (GC). Fortunately, F# recognized this and enabled the returning of a struct from a Partial Active Pattern. The syntax is a little odd due to a limitation in the compiler, but it works for our purposes.
Here is the equivalent approach using the Partial Active Pattern that returns a ValueStruct
to reduce GC pressure. Compared to our first Active Pattern, the downside is that we have to define a separate one for each case we want to match against.
|
|
Benchmark Setup
We will now set up two types of tests. The first set of tests will randomly jump around a Node array
, check the Node type, and perform different work based on the case. This is most similar to the workloads I experience when writing algorithms for simulating manufacturing systems. For curiosity, I will also have a tests for a linear traversal of a Node array
and perform the same work as the random access. This should illustrate the difference in performance between a predictable access pattern and a random one. The branch predictor in the CPU will have a more challenging time with the random access, and we expect it to be slower.
To see the impact of the Active Pattern on memory allocation and GC, we will include the [<MemoryDiagnoser>]
attribute on a Benchmarks
class that holds our tests. This will tell BenchmarkDotNet to monitor how much allocation is occurring. We should see the Active Pattern approach incur more GC activity. We also include the BranchMispredictions
and CacheMisses
hardware counters to see how well the CPU can optimize our code. The ideal code has 0
Branch Mispredictions. Whenever we mispredict a branch, we can lose 20 - 30 cycles worth of work depending on the CPU. Cache Misses occur when our data is not in the cache, and the CPU has to go out to memory to retrieve the data. The CPU will do its best to predict what data it needs and fetch it ahead of time. When it guesses wrong, we can incur a severe performance penalty.
|
|
At the beginning of the Benchmarks
class, we declare a random number generator rng
which we will use to produce random nodes and indices to look up. We have a nodeCount
of 100, which is the number of nodes we will generate. The lookupCount
is the number of node lookups we will perform during each test loop. The loopCount
is the number of loops we will perform in each test. nodes
is an array of randomly generated nodes for our test.
Our first test performs random lookups in the nodes
array. It matches against the case of the DU and then adds a different amount to an accumulator based on the case. This is to simulate some amount of work being done for each node that was looked up.
|
|
The second test does the same work, but the Node
type is the struct representation instead of the reference-based one.
|
|
The third and fourth tests also perform the same work but with the Active Pattern and Partial Active Pattern approaches. Remember, the Active Pattern allocates objects while the Partial Active Pattern is returning a struct and therefore not causing any GC overhead.
|
|
I also create four additional tests which perform the same amount of work as the above four, but iterate through the arrays in order instead of randomly jumping around. This will show us the performance difference between random versus sequential access.
Results
Note: Since I am measuring hardware counters, I have to run the terminal as admin; otherwise, I won’t have access to the data. If you want to test this yourself, you need to do the same.
When I run the tests, I get the following table:
Method | Mean | Error | StdDev | BranchMispredictions/Op | CacheMisses/Op | Gen 0 | Allocated |
---|---|---|---|---|---|---|---|
DuEncodingRandomAccess | 83.09 ms | 1.122 ms | 0.995 ms | 7,414,534 | 400,225 | - | 175 B |
StructDuEncodingRandomAccess | 83.56 ms | 0.766 ms | 0.640 ms | 7,415,626 | 409,418 | - | 175 B |
IntEncodingWithActivePatternRandomAccess | 134.88 ms | 2.650 ms | 4.845 ms | 8,126,171 | 1,070,592 | 28500.0000 | 240,000,358 B |
IntEncodingWithPartialActivePatternRandomAccess | 86.43 ms | 0.841 ms | 0.787 ms | 7,995,620 | 406,096 | - | 191 B |
DuEncodingLinearAccess | 14.86 ms | 0.090 ms | 0.084 ms | 5,073 | 7,701 | - | 22 B |
StructDuEncodingLinearAccess | 17.35 ms | 0.151 ms | 0.142 ms | 119,799 | 13,508 | - | 36 B |
IntEncodingWithActivePatternLinearAccess | 74.67 ms | 1.018 ms | 0.903 ms | 167,078 | 677,829 | 28571.4286 | 240,000,191 B |
IntEncodingWithPartialActivePatternLinearAccess | 22.83 ms | 0.372 ms | 0.348 ms | 8,151 | 33,225 | - | 45 B |
We see that the normal reference-based DU encoding gives us the best performance for both tests. This honestly surprised me. I would have thought that the Int Encoding would have yielded better results. There is some serious voodoo going on in the F# compiler. I wanted to dig into this more, but my work demands that I cut myself off here. I would like to get this info out for others to look at since I have not been able to find other blog posts which deal with this.
I welcome feedback and critique. You can find the code here. Let me know if I missed something obvious. Eventually, I hope to have the time to dig deeper into this. In the meantime, I plan to stick with the default DU until I can figure out if I can make something faster for my use case. Let me know if you have ideas for going faster or other ideas I should test. I can be found on Twitter @McCrews, or you can email matthew@crews.email
.