This article is for the 2022 F# Advent Calendar
Full code for article found here
While building the Aidos engine for Discrete Rate Simulation, I often have to build custom collections to meet the project’s performance requirements. Don’t get me wrong, many of the collections built into .NET are great, but they are general-purpose. General-purpose collections must meet the requirements of many use cases. I have a single use case in mind for my work, and performance is one of the critical features. If our engine isn’t orders of magnitude faster than the competition, we don’t have a compelling product.
In Aidos, I often need to track items that have changed during a time step of the simulation. I also track entities with an
int that has been annotated with a Unit of Measure. This means an entity ends up being an
int<'EntityType>. I also cannot have duplicates for my use case, so I need to maintain a distinct set of entities that have changed. One way to do this would be to use a
HashSet. HashSet is built into .NET and provides
O(1) insertion, which you would think would be ideal for this use case. The downside to a
HashSet is that its memory will be allocated on the heap. If you have to create a
HashSet for every iteration of a hot loop, this can cause excess GC pressure.
Instead, what I use is a custom
BitSet is a struct that wraps an array of
uint64, which acts as a bit array that I manually manage. The .NET runtime has a
BitArray class, but it does not provide the API I need for my use cases. I need to iterate through all the set bits and call a function with the index of the set bit as an argument.
One of the advantages of the
BitSet approach over
HashSet is that the array used by the
BitSet can be allocated from an
ArrayPool, which means that
BitSet will never increase GC pressure or take up room on the heap. The other is that it is faster than
HashSet for iterating through the set bits using the
Here is the definition of the
We also have a
BitSet module where we define the functions for operating on
BitSet. Here I show just the
iter loops through each set bit in the array and calls the lambda
f with the index of the set bit as the argument.
I set up a benchmark to compare the performance of
BitSet. I define a unit of measure,
Chicken, which I will use as the entity type to annotate my
int values. I add 10
int<Chicken> between the values of
99<Chicken> to both of these collections. This range of values is a good representation of the use case that I am concerned with. I then write a benchmark for measuring the time to iterate through the values in both collections to see how long it takes for both collections. When I run the benchmark using
BenchmarkDotNet I get the following result:
We see that the
iter function for
BitSet is approximately 3x faster than
HashSet for iterating through the values.
Now, you may be thinking that
BitSet is great, but there is a downside to this approach. The
iter function takes a lambda as one of its arguments. Whenever the
BitSet encounters a set bit, it then calls the lambda with the index of the set bit. Lambdas are intrinsic to programming in F#, but they have limitations. One of those limitations is that they cannot capture
ByRefLike types. Most of the time, this is not a big deal. F# developers are not often known as hardcore performance programmers, so most F# developers will not run into this problem.
I, on the other hand, work with
ByRefLike types all the time. They can be incredibly powerful for increasing your program’s performance and decreasing memory allocations. Now, a word of caution. You probably don’t need this. You can lead a very happy life as an F# developer, and never worry about this. This limitation only becomes an issue when you are trying to eke out every bit of performance you can, and you are likely not in that scenario. I happen to be in an odd situation because I work for a company with an F# dev team, and I’m tasked with writing libraries for others to use that must be fast. For strategic reasons, we constrain ourselves to F#, so calling out to C/C++/Rust is not an option. You will appreciate what we cover next if you find yourself in a similar situation.
I have asked about relaxing some of the compiler restrictions around lambdas and capturing
Span<'T>, but the effort would be large. The more I dug into how the F# compiler and the CLR interact, my appreciation for the complexity of the problem grows. This is also not the most important feature for the growth of F#, so I’m not going to push for it. I hope to get good enough to contribute it to the F# compiler someday 😊.
So how do we get around this limitation?
BitSet is intended for these hot loops where we likely want to be able to use some stack-allocated memory. This means we must be able to work with
Span<'T> simultaneously. The simple solution is to expose a new way of iterating through the set bits in the
BitSet. We can implement
BitSet and use a
The easiest way to implement
BitSet is to define a
BitSetEnumerator, which takes the logic used in the
iter function but exposes it in a way that the
IEnumerable<'T> interface expects. Let’s see what that looks like:
BitSetEnumerator type defines three methods for fulfilling the
Reset. You can see how the
BitSetEnumerator fulfills the
IEnumerable<'T> interface at the bottom. The type uses the same bit-shifting logic
iter uses but breaks it up to support the methods that
We can then have the
BitSet collection implement the
IEnumerable<'T> interface by returning an instance of the
BitSetEnumerator when calling the
When we add this approach to the benchmarks, we get the following result:
IEnumerable<'T> approach is twice as slow as using a
HashSet. This is less than ideal. It is also allocating on the heap. This is because the interface necessitates the creation of an object on the heap. We’ve negated most, if not all, of the benefits we hope to get from
BitSet. What can we do?
Ducks All The Way Down
There’s a feature of .NET that I don’t hear about much but is important in this scenario. The .NET runtime will use duck-typing to implement C#
foreach loops and their equivalents. The runtime will look at the type and see if it has a
GetEnumerator method that returns a type with the
Current field and the
MoveNext method. Well, the
for...in...do loop in F# is the equivalent to the C#
What if instead of implementing
IEnumerable<'T> we rely on the .NET duck-typing approach? We can change our enumerator to be a
struct so that it doesn’t allocate any memory on the heap, and we’ll avoid the overhead of an interface.
Here’s what the
BitSetEnumerator looks like as a struct with only the necessary pieces for duck-typing.
Things look a bit different since
BitSetEnumerator is now a struct and therefore requires different approaches to handling the internal data.
We also change the
BitSet type to only have a
GetEnumerator() method instead of implementing
When we benchmark this approach, we get the following:
This is much better. Our performance is almost that of a
HashSet. Something to be aware of, the duck-typing approach and the
IEnumerable<'T> are not mutually exclusive. If you implement both, the runtime will pick the faster approach in the testing I have done. In the production code, we include both because the
IEnumerable<'T> is necessary for using the
BitSet with the
Inline All The Things (When it helps)
You have probably noticed that our loop-based approach’s performance is still not near the
iter method. That’s to be expected. The for-loop approach adds overhead to the iteration. The F# compiler has some special transforms that it does for arrays which makes using a
for...in...do loop over the elements of an array incredibly fast, but most other collections do not get that special treatment.
There is something we can do to get a little more performance, though. Right now, each time the
MoveNext method is called, it creates a new stack frame. This adds overhead to the loop when it has to copy data for each instance of the stack frame. If we could inline the logic of the
MoveNext method, we could reduce the number of stack frames created and potentially get a performance boost.
If you try to add the
inline keyword to
BitSetEnumerator, you will have a problem. The compiler will give you an error that looks something like this:
That looks like a lot of garbage, but the important part is near the end. It reports an error on line 55 of our
DuckTyping.fs, which mentions “perhaps because a recursive value was marked
inline.” That’s the clue we need. The
MoveNext method is recursive at the moment, so the inlining logic of the F# compiler cannot work. What we need to do is remove this recursion. When we remove the recursion from the
MoveNext method, we get the following:
The logic for moving to the next bucket and checking for values has gotten more complex, but it no longer recurses. This allows us to use the
inline keyword to get the F# compiler to inline this logic where it is used. This will reduce the number of stack frames used in our loop. Let’s see what the performance of this version is:
Inlining is now faster than
HashSet but still slower than
Iter. This is a win for me because there’s now no performance downside to
BitSet compared to
HashSet for this scenario. Would I like to be able to match the performance of
Iter? Yes, absolutely, but this is already nowhere near the bottleneck of our engine, so I moved on to other problems.
You’ve learned a little about implementing
IEnumerable<'T> for custom collections that you write and how to use the duck-typing of the
foreach loop in .NET to get even better performance. We’ve also shown that we can perform even better using the
inline keyword to remove stack frames.
I recommend that you stick with the built-in looping functions provided by F#:
iteri, etc. They are highly optimized and will give you great performance out of the box. In rare cases, you should consider other options where the need to capture a
Span<'T> or another restriction forces you to use other looping constructs. I hope you find this helpful. Please feel free to reach out with any questions or critiques 😊.