Long Division, Part 3
I ended part two of this series with an open question:
And of course I can’t help but wonder if the CLR is compiled with Visual C++, so doing arithmetic on 64-bit numbers in C# and other .NET languages ends up at the same runtime functions?
I don’t have a lot of experience in debugging the CLR myself, so I asked Brian Rasmussen if he might be interested in taking a look at it. He was kind enough to take the time to point me in the right direction.
A little digging showed that the CLR does in fact call some of these functions from the C runtime, but with a twist.
The C# function we looked at was this:
The JIT compiler produced the following 32-bit code:
We note the call to JIT_LDiv
, which is a helper function in clr.dll (the CLR
runtime, formerly mscorwks.dll). Let’s have a look at it:
This excerpt is from the Shared Source CLI, but the same code is present in clr.dll from .NET Framework 4.0.
The division in the return statement at line 28 generates a call to _alldiv
from the C runtime library.
Now, the twist is that this helper function, besides checking for division by zero and overflow, also optimizes division of two values that both fit in 32 bits (the division with casts at line 24).
This is one of the optimizations I used in WCRT as well, so I was interested in seeing how the CLR code performed. I removed the exception handling along with associated checks and ended up with this function, which I tested using the timing application from part 2:
Here is a summary of the results from my Athlon 64:
Function WCRT VC Diff
-------------------------------------------
(CLR runtime function)
HH JIT_LDiv 2125 4222 +49.7%
HL JIT_LDiv 4104 5128 +20.0%
LL JIT_LDiv 1312 1312 +0.0%
(C runtime function)
HH _alldiv 1709 3907 +56.2%
HL _alldiv 3260 4309 +24.3%
LL _alldiv 1204 2161 +44.2%
HH = 64-bit op 64-bit
HL = 64-bit op 32-bit
LL = 32-bit op 32-bit
In the WCRT column, JIT_LDiv
is compiled with _alldiv
from WCRT, and in the
VC column with _alldiv
from the VC CRT. So the first three rows compare the
speed of the CLR helper function when using those as fallback.
We can see that there is a noticeable improvement on HH and HL, even though the
check for two values that both fit in 32 bits is done twice (once in
JIT_LDiv
, and once in the WCRT version of _alldiv
).
The next three rows compare the C runtime function _alldiv
from WCRT and the
VC CRT. These results are the same as in part 2.
The special handling of two values that both fit in 32 bits is a big
improvement. On LL, JIT_LDiv
is almost as fast as the WCRT version of
_alldiv
. For HL and HH, JIT_LDiv
is slightly slower than the regular VC
_alldiv
due to the extra checks.
So, in conclusion, the CLR runtime is compiled with Visual C++, and when code runs under 32-bit, the CLR runtime will fall back to the 64-bit arithmetic functions from the C runtime library.
The CLR runtime contains optimizations for certain values (usually when both operands fit in 32 bits), but for other cases optimizing the C runtime functions appears to provide a direct improvement for the CLR as well.