This article is for the 2022 F# Advent Calendar
Full code for article found here
While building the Aidos engine for Discrete Rate Simulation, I often have to build custom collections to meet the project’s performance requirements. Don’t get me wrong, many of the collections built into .NET are great, but they are general-purpose. General-purpose collections must meet the requirements of many use cases. I have a single use case in mind for my work, and performance is one of the critical features. If our engine isn’t orders of magnitude faster than the competition, we don’t have a compelling product.
In Aidos, I often need to track items that have changed during a time step of the simulation. I also track entities with an int
that has been annotated with a Unit of Measure. This means an entity ends up being an int<'EntityType>
. I also cannot have duplicates for my use case, so I need to maintain a distinct set of entities that have changed. One way to do this would be to use a HashSet
. HashSet is built into .NET and provides O(1)
insertion, which you would think would be ideal for this use case. The downside to a HashSet
is that its memory will be allocated on the heap. If you have to create a HashSet
for every iteration of a hot loop, this can cause excess GC pressure.
Instead, what I use is a custom BitSet
. BitSet
is a struct that wraps an array of uint64,
which acts as a bit array that I manually manage. The .NET runtime has a BitArray
class, but it does not provide the API I need for my use cases. I need to iterate through all the set bits and call a function with the index of the set bit as an argument.
One of the advantages of the BitSet
approach over HashSet
is that the array used by the BitSet
can be allocated from an ArrayPool
, which means that BitSet
will never increase GC pressure or take up room on the heap. The other is that it is faster than HashSet
for iterating through the set bits using the BitSet.iter
function.
Here is the definition of the BitSet
type:
|
|
We also have a BitSet
module where we define the functions for operating on BitSet
. Here I show just the iter
function. iter
loops through each set bit in the array and calls the lambda f
with the index of the set bit as the argument.
|
|
I set up a benchmark to compare the performance of HashSet
and BitSet
. I define a unit of measure, Chicken
, which I will use as the entity type to annotate my int
values. I add 10 int<Chicken>
between the values of 0<Chicken>
and 99<Chicken>
to both of these collections. This range of values is a good representation of the use case that I am concerned with. I then write a benchmark for measuring the time to iterate through the values in both collections to see how long it takes for both collections. When I run the benchmark using BenchmarkDotNet
I get the following result:
|
|
We see that the iter
function for BitSet
is approximately 3x faster than HashSet
for iterating through the values.
The Problem
Now, you may be thinking that BitSet
is great, but there is a downside to this approach. The iter
function takes a lambda as one of its arguments. Whenever the BitSet
encounters a set bit, it then calls the lambda with the index of the set bit. Lambdas are intrinsic to programming in F#, but they have limitations. One of those limitations is that they cannot capture Span<'T>
or ByRefLike
types. Most of the time, this is not a big deal. F# developers are not often known as hardcore performance programmers, so most F# developers will not run into this problem.
I, on the other hand, work with Span<'T>
and ByRefLike
types all the time. They can be incredibly powerful for increasing your program’s performance and decreasing memory allocations. Now, a word of caution. You probably don’t need this. You can lead a very happy life as an F# developer, and never worry about this. This limitation only becomes an issue when you are trying to eke out every bit of performance you can, and you are likely not in that scenario. I happen to be in an odd situation because I work for a company with an F# dev team, and I’m tasked with writing libraries for others to use that must be fast. For strategic reasons, we constrain ourselves to F#, so calling out to C/C++/Rust is not an option. You will appreciate what we cover next if you find yourself in a similar situation.
I have asked about relaxing some of the compiler restrictions around lambdas and capturing Span<'T>
, but the effort would be large. The more I dug into how the F# compiler and the CLR interact, my appreciation for the complexity of the problem grows. This is also not the most important feature for the growth of F#, so I’m not going to push for it. I hope to get good enough to contribute it to the F# compiler someday 😊.
The Solution
So how do we get around this limitation? BitSet
is intended for these hot loops where we likely want to be able to use some stack-allocated memory. This means we must be able to work with BitSet
and Span<'T>
simultaneously. The simple solution is to expose a new way of iterating through the set bits in the BitSet
. We can implement IEnumerable<'T>
for BitSet
and use a for...in...do
loop.
The easiest way to implement IEnumerable<'T>
for BitSet
is to define a BitSetEnumerator
, which takes the logic used in the iter
function but exposes it in a way that the IEnumerable<'T>
interface expects. Let’s see what that looks like:
|
|
The BitSetEnumerator
type defines three methods for fulfilling the IEnumerable<'T>
contract: Current
, MoveNext
, and Reset
. You can see how the BitSetEnumerator
fulfills the IEnumerable<'T>
interface at the bottom. The type uses the same bit-shifting logic iter
uses but breaks it up to support the methods that IEnumerable<'T>
expects.
We can then have the BitSet
collection implement the IEnumerable<'T>
interface by returning an instance of the BitSetEnumerator
when calling the GetEnumerator
method.
|
|
When we add this approach to the benchmarks, we get the following result:
|
|
The IEnumerable<'T>
approach is twice as slow as using a HashSet
. This is less than ideal. It is also allocating on the heap. This is because the interface necessitates the creation of an object on the heap. We’ve negated most, if not all, of the benefits we hope to get from BitSet
. What can we do?
Ducks All The Way Down
There’s a feature of .NET that I don’t hear about much but is important in this scenario. The .NET runtime will use duck-typing to implement C# foreach
loops and their equivalents. The runtime will look at the type and see if it has a GetEnumerator
method that returns a type with the Current
field and the MoveNext
method. Well, the for...in...do
loop in F# is the equivalent to the C# foreach
loop.
What if instead of implementing IEnumerable<'T>
we rely on the .NET duck-typing approach? We can change our enumerator to be a struct
so that it doesn’t allocate any memory on the heap, and we’ll avoid the overhead of an interface.
Here’s what the BitSetEnumerator
looks like as a struct with only the necessary pieces for duck-typing.
|
|
Things look a bit different since BitSetEnumerator
is now a struct and therefore requires different approaches to handling the internal data.
We also change the BitSet
type to only have a GetEnumerator()
method instead of implementing IEnumerable<'T>
.
|
|
When we benchmark this approach, we get the following:
|
|
This is much better. Our performance is almost that of a HashSet
. Something to be aware of, the duck-typing approach and the IEnumerable<'T>
are not mutually exclusive. If you implement both, the runtime will pick the faster approach in the testing I have done. In the production code, we include both because the IEnumerable<'T>
is necessary for using the BitSet
with the Seq
module.
Inline All The Things (When it helps)
You have probably noticed that our loop-based approach’s performance is still not near the iter
method. That’s to be expected. The for-loop approach adds overhead to the iteration. The F# compiler has some special transforms that it does for arrays which makes using a for...in...do
loop over the elements of an array incredibly fast, but most other collections do not get that special treatment.
There is something we can do to get a little more performance, though. Right now, each time the MoveNext
method is called, it creates a new stack frame. This adds overhead to the loop when it has to copy data for each instance of the stack frame. If we could inline the logic of the MoveNext
method, we could reduce the number of stack frames created and potentially get a performance boost.
If you try to add the inline
keyword to Current
and MoveNext
on BitSetEnumerator
, you will have a problem. The compiler will give you an error that looks something like this:
|
|
That looks like a lot of garbage, but the important part is near the end. It reports an error on line 55 of our DuckTyping.fs
, which mentions “perhaps because a recursive value was marked inline
.” That’s the clue we need. The MoveNext
method is recursive at the moment, so the inlining logic of the F# compiler cannot work. What we need to do is remove this recursion. When we remove the recursion from the MoveNext
method, we get the following:
|
|
The logic for moving to the next bucket and checking for values has gotten more complex, but it no longer recurses. This allows us to use the inline
keyword to get the F# compiler to inline this logic where it is used. This will reduce the number of stack frames used in our loop. Let’s see what the performance of this version is:
|
|
Inlining is now faster than HashSet
but still slower than Iter
. This is a win for me because there’s now no performance downside to BitSet
compared to HashSet
for this scenario. Would I like to be able to match the performance of Iter
? Yes, absolutely, but this is already nowhere near the bottleneck of our engine, so I moved on to other problems.
Conclusion
You’ve learned a little about implementing IEnumerable<'T>
for custom collections that you write and how to use the duck-typing of the foreach
loop in .NET to get even better performance. We’ve also shown that we can perform even better using the inline
keyword to remove stack frames.
I recommend that you stick with the built-in looping functions provided by F#: map
, iter
, iteri
, etc. They are highly optimized and will give you great performance out of the box. In rare cases, you should consider other options where the need to capture a Span<'T>
or another restriction forces you to use other looping constructs. I hope you find this helpful. Please feel free to reach out with any questions or critiques 😊.