Increasing the performance of D math code
An article about the new @fastmath
and __traits(target*)
features in LDC 1.1.0. The @fastmath
attribute relaxes floating point math constraints and is used by Mir to beat OpenBLAS and Eigen.
To avoid confusion: LLVM is doing the interesting work, LDC just adds a few pieces that allow LLVM to work its magic.
I’m afraid it won’t be a nice read if you are unfamiliar with some of peculiarities of the CPU’s floating point math, x86 assembly, or SIMD.
Numeric age for D
The available SIMD parallelism in the CPU is very powerful: with the AVX512 instruction set, the CPU packs 16 (!) 32-bit floats in one register and can do a SIMD multiplication-addition twice (!) per clock cycle. So programming for performance boils down to writing your code such that the compiler can generate code that maximally exploits SIMD.
Two weeks ago, Ilya Yaroshenko wrote the article “Numeric age for D: Mir GLAS is faster than OpenBLAS and Eigen”. The article presents performance benchmarks of his linear algebra additions to Mir, a generic numerical library written in D, demonstrating that his D code is fast and beats well-known assembly language implementations of the same functionality.
For very high performance, common implementations of basic math primitive functions are written in assembly language explicitly instead of trusting the compiler to generate equally fast machine code. In contrast, Ilya writes: “Mir GLAS (Generic Linear Algebra Subprograms) has a single generic kernel for all CPU targets, all floating point types, and all complex types. It is written completely in D, without any assembler blocks.” That’s quite an achievement: carefully optimized D code that performs as fast as carefully optimized assembly code. In this article I’ll present a few (simple) additions to LDC that enable LLVM’s magical optimizer to generate such fast assembly code for Ilya’s amazing D code.
Assembly code
I’ll show assembly code snippets generated by LDC using -c -output-s
. LDC version 1.1.0-beta2 or newer is needed. You can use the online compiler at ldc.acomirei.ru to look at the assembly code that LDC generates yourself, for the examples given here and for your own D code.
If you find that the generated assembly for your D code could be improved, submit an LDC issue on Github, like Ilya did!
Floating point math
The CPU’s floating point math is not the same as algebraic math, because of approximation errors due to the finite storage available for a floating point number after each calculation. The calculation a + (b + c)
may or may not be equal to (a + b) + c
, and (a + a) - a
may or may not be equal to a
. With fused multiply-add (FMA), a * b + c
is not the same as, well…, a * b + c
. This often prevents optimization of math code, unless the programmer tells the compiler that, really, it’s all the same to him.
Let’s look at a basic math function and see how we can improve the compiler-generated machine code.
Vector dot product
The following code calculates the dot-product of two vectors of arbitrary length:
double dot(double[] a, double[] b) {
double s = 0;
foreach (size_t i; 0 .. a.length) {
s += a[i] * b[i];
}
return s;
}
The algebraic equation for the dot product is
s = (a1 * b1) + (a2 * b2) + (a3 * b3) + …
but because of the inaccuracies of floating-point math, the D code calculation must be described as:
s = ((((0 + a1 * b1) + a2 * b2) + a3 * b3) + … )
The parens are important: they signify the order of operations, absolutely critical for the outcome. From this equation, we can see that the multiplications can be done in any order and in parallel, but that the additions must be done sequentially and in a very specific order.
Let’s see what assembly code LDC 1.1.0 generates (AT&T dialect). I’ll only show the meat of the dot function, in this case the 4-fold-unrolled inner loop:
ldc2 -c -output-s -O3 -release dot.d
LBB0_4:
movsd (%rcx,%rdi,8), %xmm1
mulsd (%rsi,%rdi,8), %xmm1
addsd %xmm0, %xmm1
movsd 8(%rcx,%rdi,8), %xmm0
mulsd 8(%rsi,%rdi,8), %xmm0
addsd %xmm1, %xmm0
movsd 16(%rcx,%rdi,8), %xmm1
mulsd 16(%rsi,%rdi,8), %xmm1
addsd %xmm0, %xmm1
movsd 24(%rcx,%rdi,8), %xmm0
mulsd 24(%rsi,%rdi,8), %xmm0
addsd %xmm1, %xmm0
addq $4, %rdi
cmpq %rdi, %rdx
jne LBB0_4
No vectorization: it’s just sequential multiply-add-multiply-add-multiply-add-…
But wait, I forgot to tell the compiler that my CPU is newer than the default core2
CPU (nota bene, at least on OS X, by default LDC compiles your code for a 10-year-old processor). In this case it doesn’t matter much:
ldc2 -c -output-s -O3 -release dot.d -mcpu=haswell
...
vmovsd 24(%rcx,%rdi,8), %xmm2
vmulsd 24(%rsi,%rdi,8), %xmm2, %xmm2
vaddsd %xmm1, %xmm0, %xmm0
vmovsd 32(%rcx,%rdi,8), %xmm1
vmulsd 32(%rsi,%rdi,8), %xmm1, %xmm1
vaddsd %xmm2, %xmm0, %xmm0
...
Don’t be fooled by the “v” seemingly implying “vector”. Those multiply and add instructions are mere single double-precision operations. The loop is unrolled 8-fold, and that’s all that changed. Without relaxing some constraints on the mathematics, it’s the best possible.
Fused multiply-add
One performance improvement is to allow the compiler to ‘fuse’ a*b+c
into one instruction fma(a,b,c)
. It will change the outcome of the calculation, but for many use-cases it won’t matter. Note: the result using FMA is not necessarily worse than before (it may well be more accurate), it’s just different and that’s why the compiler can’t use FMA per default.
We can instruct LLVM to allow “unsafe-fp-math” transformations such as FMA using the LDC attribute @llvmAttr("unsafe-fp-math", "true")
(you’re not recommended to use this low-level attribute, please read on).
import ldc.attributes : llvmAttr;
@llvmAttr("unsafe-fp-math", "true")
double dot(double[] a, double[] b) {
double s = 0;
foreach (size_t i; 0 .. a.length) {
s += a[i] * b[i];
}
return s;
}
ldc2 -c -output-s -O3 -release dot.d -mcpu=haswell
...
vmovsd (%rcx,%rdi,8), %xmm1
vfmadd132sd (%rsi,%rdi,8), %xmm0, %xmm1
vmovsd 8(%rcx,%rdi,8), %xmm0
vfmadd132sd 8(%rsi,%rdi,8), %xmm1, %xmm0
...
LLVM replaced the multiply and add instructions with non-vectorized fused multiply-add instructions, clearly a win. But the code is still not using the vectorized instruction set of the CPU.
Vectorization and @fastmath
For vectorization of our code, there is one last piece missing: telling LLVM that it’s OK to do whatever it wants with our floating point operations: e.g. the code does not have to support NaN or infinity, and it can change the order (!) of the add operations in our dot product. LLVM’s Fast Math Flags are used for this, and you can apply this to all floating point operations in a function using the (low-level, not recommended) LDC attribute @llvmFastMathFlag
. We combined @llvmAttr
and @llvmFastMathFlag
into the @fastmath
attribute.
I recommend the usage of @fastmath
instead of the other two low-level attributes; it is much more resilient against LLVM changes. The burder is on LDC to make @fastmath
work with whatever attributes LLVM will require in the future.
With @fastmath
the result becomes:
import ldc.attributes : fastmath;
@fastmath
double dot(double[] a, double[] b) {
double s = 0;
foreach (size_t i; 0 .. a.length) {
s += a[i] * b[i];
}
return s;
}
ldc2 -c -output-s -O3 -release dot.d -mcpu=haswell
LBB0_6:
vmovupd (%rsi,%rdi,8), %ymm4
vmovupd 32(%rsi,%rdi,8), %ymm5
vmovupd 64(%rsi,%rdi,8), %ymm6
vmovupd 96(%rsi,%rdi,8), %ymm7
vfmadd132pd (%rcx,%rdi,8), %ymm0, %ymm4
vfmadd132pd 32(%rcx,%rdi,8), %ymm1, %ymm5
vfmadd132pd 64(%rcx,%rdi,8), %ymm2, %ymm6
vfmadd132pd 96(%rcx,%rdi,8), %ymm3, %ymm7
vmovupd 128(%rsi,%rdi,8), %ymm0
vmovupd 160(%rsi,%rdi,8), %ymm1
vmovupd 192(%rsi,%rdi,8), %ymm2
vmovupd 224(%rsi,%rdi,8), %ymm3
vfmadd132pd 128(%rcx,%rdi,8), %ymm4, %ymm0
vfmadd132pd 160(%rcx,%rdi,8), %ymm5, %ymm1
vfmadd132pd 192(%rcx,%rdi,8), %ymm6, %ymm2
vfmadd132pd 224(%rcx,%rdi,8), %ymm7, %ymm3
addq $32, %rdi
addq $2, %rax
jne LBB0_6
We now have a nicely vectorized dot-product function: the vfmadd132pd
instruction performs a vectorized fused multiply-add operation, in this case it’s the equivalent of 4 non-vectorized operations. Each loop iteration processes 32 vector elements.
All D code, target independent
In the examples above, the D code did not specify the target architecture: all vectorization happens automatically. All credit must go to LLVM’s amazing backend that is able to analyze and optimize the code so well. So when you’d be compiling for a new processor with wider SIMD registers, all you’d have to do is recompile for that CPU and the D code would become even faster:
ldc2 -c -output-s -O3 -release dot.d -mcpu=knl
results in vfmadd231pd (%rcx,%rdi,8), %zmm4, %zmm0
instructions that process 8 vector elements at a time (“knl” stands for Intel’s Knights Landing).
If your target is an ARM processor with NEON SIMD tech, you’ll automatically get vectorized FMA code too:
ldc2 -c -output-s -O3 -release dot.d -mtriple=aarch64-none-linux-gnu -mattr=+neon
.LBB0_5:
ldp q2, q3, [x9, #-16]
ldp q4, q5, [x10, #-16]
add x9, x9, #32
add x10, x10, #32
sub x11, x11, #4
fmla v0.2d, v4.2d, v2.2d
fmla v1.2d, v5.2d, v3.2d
cbnz x11, .LBB0_5
Traits
The dot-product example above is a very simple function. For more complex algorithms, you can get better performance if you take the sizes of the CPU’s caches into account; the characteristics of the instruction and data caches are both important for performance tuning.
LDC 1.1.0 has two special __traits
that give compile-time information about the compile target architecture: __traits(targetHasFeature, "...")
and __traits(targetCPU)
. For example, __traits(targetHasFeature, "avx2")
returns true or false depending on whether the compile target supports AVX2 instructions or not.
Mir uses __traits(targetHasFeature, "...")
to decide what vector types (say, a vector of 4 or 8 floats) to use in its algorithms, but I don’t know the exact how-and-why (yet? Ilya fill us in on the details! ;-)
).
Feedback
Constructive feedback, positive and negative, is always appreciated. The best way to get in touch with me and the other LDC developers is via the digitalmars.D.ldc forum and our Gitter chat.
Links to discussions:
Reddit
D forum thread