Medo's Home Page

Modulo or Bitwise

I had an interesting thing said to me: “Did you know that modulo is much less efficient than bitwise comparison?” As someone who spent time I painstakingly went through all E-series resistor values to find those that would make my voltage divider be power of 2, I definitely saw that in action. But, that got me thinking. While 8-bit PIC microcontroller doesn’t have a hardware divider and thus any modulo is a torture, what about modern computers? How much slower do they get?

Quick search brought me a few hits and one conclusive StackOverflow answer. Searching a bit more brought me to another answer where they even did measurements. And difference was six-fold. But I was left with a bit of nagging as both of these were 10+ years old. What is a difference you might expect on a modern CPU? And, more importantly for me, what are differences in C#?

Well, I quickly ran some benchmarks and results are below.

Test	Parallel	Mean	StDev
(i % 4) == 0	No	202.3 us	0.24 us
(i & 0b11) == 0	No	201.9 us	0.12 us
(i % 4) == 0	CpuCount	206.4 us	7.78 us
(i & 0b11) == 0	CpuCount	196.5 us	5.63 us
(i % 4) == 0	CpuCount*2	563.9 us	7.90 us
(i & 0b11) == 0	CpuCount*2	573.9 us	6.52 us

My expectations were not only wrong but slightly confusing too.

As you can see from table above, I did 3 tests, single threaded, default parallel for, and then parallel for loop with CPU overcommitment. Single threaded test is where I saw what I expected but not in amount expected. Bitwise was quite consistently winning but by ridiculous margins. Unless I was doing something VERY specific, there is no chance I would care about the difference.

If we run test in Parallel.For, difference becomes slightly more obvious. And had I stayed just on those two, I would have said that assumption holds for modern CPUs too.

However, once I overcommitted CPU resources, suddely modulo was actually better. And that is something that’s hard to explain if we take assumption that modulo just uses divide to be true.

So, I decided to sneak a bit larger peek - into .NET CLR. And I discovered that bitwise operation was fully omitted while modulo operation was still there. However, then runtime smartly decided to remove both. Thus, I was testing nothing vs. almost nothing.

Ok, after I placed a strategic extra instructions to prevent optimization, I got the results below.

Test	Parallel	Mean	StDev
(i % 4) == 0	No	203.1 us	0.16 us
(i & 0b11) == 0	No	202.9 us	0.06 us
(i % 4) == 0	CpuCount	1,848.6 us	13.13 us
(i & 0b11) == 0	CpuCount	1,843.9 us	6.76 us
(i % 4) == 0	CpuCount*2	1,202.7 us	7.32 us
(i & 0b11) == 0	CpuCount*2	1,201.6 us	6.75 us

And yes, bitwise is indeed faster than modulo but by really low margin. The only thing new test “fixed” was that discrepancy in speed when you have too many threads.

Just to make extra sure that the compiler wasn’t doing “funny stuff”, I decompiled both to IL.

ldloc.1
ldc.i4.4
rem
ldc.i4.0
ceq

ldloc.1
ldc.i4.3
and
ldc.i4.0
ceq

Pretty much exactly the same, the only difference being usage of and for bitwise check while rem was used for modulo. In modern CPUs these two instructions seem pretty much equivalent. And when I say modern, I use that lossely since I saw the same going back a few generations .

Interestingly, just in case runtime changed those to the same code, I also checked modulo 10 just to confirm. That one was actually faster than modulo 4. That leads me to believe there are some nice optimizations happening here. But I still didn’t know if this was .NET framework or really something CPU does.

As a last resort, I went down to C and compiled it with -O0 -S. Unfortunately, even with -O0, if you use % 4, it will be converted for bitwise. Thus, I checked it against % 5.

Bitwise check compiled down to just 3 instructions (or just one if we exclude load and check).

movl	-28(%rbp), %eax
andl	$3, %eax
testl	%eax, %eax

But modulo went crazy route.

movl	-28(%rbp), %ecx
movslq	%ecx, %rax
imulq	$1717986919, %rax, %rax
shrq	$32, %rax
movl	%eax, %edx
sarl	%edx
movl	%ecx, %eax
sarl	$31, %eax
subl	%eax, %edx
movl	%edx, %eax
sall	$2, %eax
addl	%edx, %eax
subl	%eax, %ecx
movl	%ecx, %edx
testl	%edx, %edx

It converted division into multiplication and gets to remainder that way. All in all, quite impressive optimization. And yes, this occupies more memory so there are other consequences to the performance (e.g. uses more cache memory).

So, if you are really persistent with testint, difference does exist. It’s not six-fold but it can be noticeable.

At the end, do I care? Not really. Unless I am working on microcontrollers, I won’t stop using modulo where it makes sense. It makes intent much more clear and that, to me, is worth it. Even better, compilers will just take care of this for you.

So, while modulo is less efficient, stories of its slowness have been exaggerated a bit.

PS: If you want to run my tests on your system, files are available.

Never Gonna BOM You Up

.NET supported Unicode from its very beginning. Pretty much anything you might need for Unicode manipulation is there. Yes, as early adopters, they made a bet on UTF-16 that didn’t pay off since rest of the world has moved toward UTF-8 as an (almost) exclusive encoding. However, if we ignore a bit higher memory footprint, C# strings made Unicode as easy as it gets.

And, while UTF-8 is not a native encoding for its strings, C# is no slouch and has a convenient Encoding.UTF8 static property allowing for easy conversion. However, if you do use that Encoding.UTF8.GetBytes() function, you will get a bit extra.

That something extra is Byte order mark. Its intention is noble - to help detect endianess. However, its usage for UTF-8 is of dubious help since 8-bit encoding doesn’t really have issues with endianness to start with. Unicode specification itself does allows for one but doesn’t recommend it. It merely acknowledges it might happen as a side-effect of data conversion from other unicode encodings that do have endianness.

So, in theory, UTF-8 with BOM should be perfectly acceptable. In practice, only Microsoft really embraced UTF-8 BOM. Pretty much everybody else decided to have UTF-8 without BOM as that allowed for full compatibility with 7-bit ASCII.

With time, .NET/C# stopped being Windows-only and, by today, became really good multiplatform solution. And now, helper function that ought to simplify things is actually producing output that will annoy many command-line tools that don’t expect it. If you read the documentation, solution exists - just create your own UTF-8 converter instance.

private static readonly Encoding Utf8 = new UTF8Encoding(encoderShouldEmitUTF8Identifier: false);

Now you can call Utf8.GetBytes() instead and you will get expected result on all platforms, including Windows - no BOM, no problems.

So, one could argue that Encoding.UTF8 default should be changed to what is more appropriate value. I mean, .NET is multiplatform and the current default doesn’t work everywhere. One could argue but this default is not changing, ever.

When any project starts, decisions must be made. And you won’t know for a while if those decisions were good. On the other hand, people will start depending on whatever behavior you selected.

In the case of BOM, it might be that developer got so used to having those three extra bytes that, instead checking the file content, they simply use <=3 as a signal file is empty. Or they have a script that takes output of some C# application and just strips the first three bytes blindly before moving it to non-BOM friendly input. Or any other decision somebody made in project years ago. It doesn’t really matter how bad someones code is. What matters is that code is currently working and new C# release shouldn’t silently break somebody’s code.

So, I am reasonably sure that Microsoft won’t ever change this default. And, begrudgingly, I agree with that. Some bad choices are simply meant to stay around.

PS: And don’t let me start talking about GUIDs and their binary format…

CoreCompile into the Ages

For one project of mine I started having a curious issue. After adding a few, admittedly a bit complicated, classes my compile times under Linux shot to eternity. But that was only when running with dotnet command line tools. In Visual Studio under Windows, all worked just fine.

Under dotnet I would just see CoreCompile step counting seconds, and then minutes. I tried increasing log level - nothing. I tried not cleaning stuff, i.e. using cached files - nothing. So, I tried cleaning up my .csproj file - hm… things improved, albeit just a bit.

A bit of triage later and I was reasonably sure that .NET code analyzer are the culprit. Reason why changes to .csproj reduced the time was because I had AnalysisMode set quite high. Default AnalysisMode simply checks less.

While disabling .NET analyzers altogether was out of question, I was quite OK with not running them all the time. So, until .NET under Linux gets a bit more performant, I simply included EnableNETAnalyzers=false in my build scripts.

  -p:EnableNETAnalyzers=false

Another problem solved.

Custom StringBuilder Pool

In my last post I grumbled about ObjectPool being a separate package. That was essentially the single downside to use it. So, how hard is to implement our own StringBuilder pool?

Well, not that hard. The whole thing can be something like this:

internal static class StringBuilderPool {

    private static readonly ConcurrentQueue<StringBuilder> Pool = new();

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static StringBuilder Get() {
        return Pool.TryDequeue(out StringBuilder? sb) ? sb : new StringBuilder(4096);
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static bool Return(StringBuilder sb) {
        sb.Length = 0;
        Pool.Enqueue(sb);
        return true;
    }

}

In our Get method we check if we have any stored StringBuilder. If yes, we just return the same. If no, we create a new instance.

In the Return method we just add the returned instance to the queue.

Now, this is not exactly an ObjectPool equivalent. For example, it doesn’t limit the pool size. And it will keep large objects around forever. However, for my case it was good enough and unlikely to cause any problems.

And performance… Well, performance is promising, to say the least:

Test	Mean	Error	StdDev	Gen0	Gen1	Allocated
StringBuilder (small)	15.762 ns	0.3650 ns	0.4057 ns	0.0181	-	152 B
ObjectPool (small)	17.257 ns	0.0616 ns	0.0576 ns	0.0057	-	48 B
Custom pool (small)	16.864 ns	0.0192 ns	0.0150 ns	0.0057	-	48 B
Concatenation (small)	9.716 ns	0.1634 ns	0.1528 ns	0.0105	-	88 B
StringBuilder (medium)	58.125 ns	0.6429 ns	0.6013 ns	0.0526	-	440 B
ObjectPool (medium)	23.226 ns	0.0517 ns	0.0484 ns	0.0115	-	96 B
Custom pool (medium)	23.660 ns	0.2515 ns	0.1963 ns	0.0115	-	96 B
Concatenation (medium)	66.353 ns	1.3307 ns	1.2447 ns	0.0793	-	664 B
StringBuilder (large)	190.293 ns	0.7781 ns	0.6498 ns	0.2496	0.0010	2088 B
ObjectPool (large)	92.556 ns	0.9281 ns	0.8228 ns	0.0755	-	632 B
Custom pool (large)	91.470 ns	0.5478 ns	0.5124 ns	0.0755	-	632 B
Concatenation (large)	1,430.599 ns	11.5971 ns	10.8479 ns	4.0169	0.0057	33600 B

Pretty much its on-par with ObjectPool implementation. Honestly, results are close enough to be equivalent for all practical purposes.

So, if you don’t want to pull the whole Microsoft.Extensions.ObjectPool just for caching a few StringBuilder instances, consider rolling your own.

To Pool or Not to Pool

For a project of mine I “had” to do a lot of string concatenations. Easy solution was just to have a string builder and go wild. But I wondered, does it make sense to use ObjectPool (found in Microsoft.Extensions.ObjectPool package). Thus, I decided to do a few benchmarks.

For my use case, “small” was just appending 3 items to a StringBuilder. The “medium” is does total of 21 appends. And finally, “large” does 201 appends. And no, there is no real reason why I used those exact numbers other than loop ended up being nice. :)

After all this, benchmark results (courtesy of BenchmarkDotNet):

Test	Mean	Error	StdDev	Gen0	Gen1	Allocated
StringBuilder (small)	16.295 ns	0.1240 ns	0.1160 ns	0.0181	-	152 B
StringBuilder Pool (small)	17.958 ns	0.3125 ns	0.2609 ns	0.0057	-	48 B
StringBuilder (medium)	87.052 ns	1.5177 ns	1.4197 ns	0.0832	0.0001	696 B
StringBuilder Pool (medium)	31.245 ns	0.1815 ns	0.1417 ns	0.0181	-	152 B
StringBuilder (large)	304.724 ns	1.6736 ns	1.3975 ns	0.4520	0.0029	3784 B
StringBuilder Pool (large)	172.615 ns	1.5325 ns	1.4335 ns	0.1471	-	1232 B

As you can see, if you are doing just a few appends, it’s probably not worth messing with ObjectPool. Not that you should use StringBuilder either. If you are adding 4 or fewer strings, you might as well concatenate them - it’s actually more performant.

However, if you are adding 5 or more strings together, pool is no worse than instantiating a new StringBuilder. So, for pretty much any scenario where you would use StringBuilder, it pays off to pool it.

Is there a situation where you would avoid pool? Well, performance-wise, I would say probably no. I ran multiple tests and, on my computer, there was no situation where StringBuilder alone was better than either pool or concat. Yes, StringBuilder is performant at low number of appends, but string concatenation is better. As soon as you go over a few appends, ObjectPool actually makes sense.

However, an elephant in the room is ObjectPool’s dependency on external package. Call me old fashioned but there is a value in not depending on extra packages.

The final decision is, of course, dependant on you. But, if performance is important, I see no reason why not to use ObjectPool. I only wish it wasn’t an extra package.

For curious ones, code was as follows:

[Benchmark]
public string Large_WithoutPool() {
    var sb = new StringBuilder();
    sb.Append("Hello");
    for (var i = 0; i < 100; i++) {
        sb.Append(' ');
        sb.Append("World");
    }
    return sb.ToString();
}

[Benchmark]
public string Large_WithPool() {
    var sb = StringBuilderPool.Get();
    try {
        sb.Append("Hello");
        for (var i = 0; i < 100; i++) {
            sb.Append(' ');
            sb.Append("World");
        }
        return sb.ToString();
    } finally {
        sb.Length = 0;
        StringBuilderPool.Return(sb);
    }
}

And yes, I also tested just a simple string concatenation (quite optimized for smaller number of concatenations):

Test	Mean	Error	StdDev	Gen0	Gen1	Allocated
Concatenation (small)	9.820 ns	0.2365 ns	0.2429 ns	0.0105	-	88 B
Concatenation (medium)	146.901 ns	1.6561 ns	1.2930 ns	0.2294	-	1920 B
Concatenation (large)	4,710.573 ns	43.5370 ns	96.4750 ns	15.2054	0.0458	127200 B

1 2 3 4 5 … 269