An article about the new @fastmath and __traits(target*) features in LDC 1.1.0. The @fastmath attribute relaxes floating point math constraints and is used by Mir to beat OpenBLAS and Eigen.
To avoid confusion: LLVM is doing the interesting work, LDC just adds a few pieces that allow LLVM to work its magic.
I’m afraid it won’t be a nice read if you are unfamiliar with some of peculiarities of the CPU’s floating point math, x86 assembly, or SIMD.

Numeric age for D

The available SIMD parallelism in the CPU is very powerful: with the AVX512 instruction set, the CPU packs 16 (!) 32-bit floats in one register and can do a SIMD multiplication-addition twice (!) per clock cycle. So programming for performance boils down to writing your code such that the compiler can generate code that maximally exploits SIMD.
Two weeks ago, Ilya Yaroshenko wrote the article “Numeric age for D: Mir GLAS is faster than OpenBLAS and Eigen”. The article presents performance benchmarks of his linear algebra additions to Mir, a generic numerical library written in D, demonstrating that his D code is fast and beats well-known assembly language implementations of the same functionality.
For very high performance, common implementations of basic math primitive functions are written in assembly language explicitly instead of trusting the compiler to generate equally fast machine code. In contrast, Ilya writes: “Mir GLAS (Generic Linear Algebra Subprograms) has a single generic kernel for all CPU targets, all floating point types, and all complex types. It is written completely in D, without any assembler blocks.” That’s quite an achievement: carefully optimized D code that performs as fast as carefully optimized assembly code. In this article I’ll present a few (simple) additions to LDC that enable LLVM’s magical optimizer to generate such fast assembly code for Ilya’s amazing D code.

Assembly code

I’ll show assembly code snippets generated by LDC using -c -output-s. LDC version 1.1.0-beta2 or newer is needed. You can use the online compiler at ldc.acomirei.ru to look at the assembly code that LDC generates yourself, for the examples given here and for your own D code.
If you find that the generated assembly for your D code could be improved, submit an LDC issue on Github, like Ilya did!

Floating point math

The CPU’s floating point math is not the same as algebraic math, because of approximation errors due to the finite storage available for a floating point number after each calculation. The calculation a + (b + c) may or may not be equal to (a + b) + c, and (a + a) - a may or may not be equal to a. With fused multiply-add (FMA), a * b + c is not the same as, well…, a * b + c. This often prevents optimization of math code, unless the programmer tells the compiler that, really, it’s all the same to him.
Let’s look at a basic math function and see how we can improve the compiler-generated machine code.

Vector dot product

The following code calculates the dot-product of two vectors of arbitrary length:

double dot(double[] a, double[] b) {
    double s = 0;
    foreach (size_t i; 0 .. a.length) {
        s += a[i] * b[i];
    }
    return s;
}

The algebraic equation for the dot product is

s = (a1 * b1) + (a2 * b2) + (a3 * b3) + …

but because of the inaccuracies of floating-point math, the D code calculation must be described as:

s = ((((0 + a1 * b1) + a2 * b2) + a3 * b3) + … )

The parens are important: they signify the order of operations, absolutely critical for the outcome. From this equation, we can see that the multiplications can be done in any order and in parallel, but that the additions must be done sequentially and in a very specific order.

Let’s see what assembly code LDC 1.1.0 generates (AT&T dialect). I’ll only show the meat of the dot function, in this case the 4-fold-unrolled inner loop:

ldc2 -c -output-s -O3 -release dot.d

LBB0_4:
  movsd (%rcx,%rdi,8), %xmm1
  mulsd (%rsi,%rdi,8), %xmm1
  addsd %xmm0, %xmm1
  movsd 8(%rcx,%rdi,8), %xmm0
  mulsd 8(%rsi,%rdi,8), %xmm0
  addsd %xmm1, %xmm0
  movsd 16(%rcx,%rdi,8), %xmm1
  mulsd 16(%rsi,%rdi,8), %xmm1
  addsd %xmm0, %xmm1
  movsd 24(%rcx,%rdi,8), %xmm0
  mulsd 24(%rsi,%rdi,8), %xmm0
  addsd %xmm1, %xmm0
  addq  $4, %rdi
  cmpq  %rdi, %rdx
  jne LBB0_4

No vectorization: it’s just sequential multiply-add-multiply-add-multiply-add-…

But wait, I forgot to tell the compiler that my CPU is newer than the default core2 CPU (nota bene, at least on OS X, by default LDC compiles your code for a 10-year-old processor). In this case it doesn’t matter much:

ldc2 -c -output-s -O3 -release dot.d -mcpu=haswell

...
  vmovsd  24(%rcx,%rdi,8), %xmm2
  vmulsd  24(%rsi,%rdi,8), %xmm2, %xmm2
  vaddsd  %xmm1, %xmm0, %xmm0
  vmovsd  32(%rcx,%rdi,8), %xmm1
  vmulsd  32(%rsi,%rdi,8), %xmm1, %xmm1
  vaddsd  %xmm2, %xmm0, %xmm0
...

Don’t be fooled by the “v” seemingly implying “vector”. Those multiply and add instructions are mere single double-precision operations. The loop is unrolled 8-fold, and that’s all that changed. Without relaxing some constraints on the mathematics, it’s the best possible.

Fused multiply-add

One performance improvement is to allow the compiler to ‘fuse’ a*b+c into one instruction fma(a,b,c). It will change the outcome of the calculation, but for many use-cases it won’t matter. Note: the result using FMA is not necessarily worse than before (it may well be more accurate), it’s just different and that’s why the compiler can’t use FMA per default.

We can instruct LLVM to allow “unsafe-fp-math” transformations such as FMA using the LDC attribute @llvmAttr("unsafe-fp-math", "true") (you’re not recommended to use this low-level attribute, please read on).

import ldc.attributes : llvmAttr;
@llvmAttr("unsafe-fp-math", "true")
double dot(double[] a, double[] b) {
    double s = 0;
    foreach (size_t i; 0 .. a.length) {
        s += a[i] * b[i];
    }
    return s;
}

ldc2 -c -output-s -O3 -release dot.d -mcpu=haswell

...
  vmovsd  (%rcx,%rdi,8), %xmm1
  vfmadd132sd (%rsi,%rdi,8), %xmm0, %xmm1
  vmovsd  8(%rcx,%rdi,8), %xmm0
  vfmadd132sd 8(%rsi,%rdi,8), %xmm1, %xmm0
...

LLVM replaced the multiply and add instructions with non-vectorized fused multiply-add instructions, clearly a win. But the code is still not using the vectorized instruction set of the CPU.

Vectorization and @fastmath

For vectorization of our code, there is one last piece missing: telling LLVM that it’s OK to do whatever it wants with our floating point operations: e.g. the code does not have to support NaN or infinity, and it can change the order (!) of the add operations in our dot product. LLVM’s Fast Math Flags are used for this, and you can apply this to all floating point operations in a function using the (low-level, not recommended) LDC attribute @llvmFastMathFlag. We combined @llvmAttr and @llvmFastMathFlag into the @fastmath attribute.

I recommend the usage of @fastmath instead of the other two low-level attributes; it is much more resilient against LLVM changes. The burder is on LDC to make @fastmath work with whatever attributes LLVM will require in the future.

With @fastmath the result becomes:

import ldc.attributes : fastmath;
@fastmath
double dot(double[] a, double[] b) {
    double s = 0;
    foreach (size_t i; 0 .. a.length) {
        s += a[i] * b[i];
    }
    return s;
}

ldc2 -c -output-s -O3 -release dot.d -mcpu=haswell

LBB0_6:
  vmovupd (%rsi,%rdi,8), %ymm4
  vmovupd 32(%rsi,%rdi,8), %ymm5
  vmovupd 64(%rsi,%rdi,8), %ymm6
  vmovupd 96(%rsi,%rdi,8), %ymm7
  vfmadd132pd (%rcx,%rdi,8), %ymm0, %ymm4
  vfmadd132pd 32(%rcx,%rdi,8), %ymm1, %ymm5
  vfmadd132pd 64(%rcx,%rdi,8), %ymm2, %ymm6
  vfmadd132pd 96(%rcx,%rdi,8), %ymm3, %ymm7
  vmovupd 128(%rsi,%rdi,8), %ymm0
  vmovupd 160(%rsi,%rdi,8), %ymm1
  vmovupd 192(%rsi,%rdi,8), %ymm2
  vmovupd 224(%rsi,%rdi,8), %ymm3
  vfmadd132pd 128(%rcx,%rdi,8), %ymm4, %ymm0
  vfmadd132pd 160(%rcx,%rdi,8), %ymm5, %ymm1
  vfmadd132pd 192(%rcx,%rdi,8), %ymm6, %ymm2
  vfmadd132pd 224(%rcx,%rdi,8), %ymm7, %ymm3
  addq  $32, %rdi
  addq  $2, %rax
  jne LBB0_6

We now have a nicely vectorized dot-product function: the vfmadd132pd instruction performs a vectorized fused multiply-add operation, in this case it’s the equivalent of 4 non-vectorized operations. Each loop iteration processes 32 vector elements.

All D code, target independent

In the examples above, the D code did not specify the target architecture: all vectorization happens automatically. All credit must go to LLVM’s amazing backend that is able to analyze and optimize the code so well. So when you’d be compiling for a new processor with wider SIMD registers, all you’d have to do is recompile for that CPU and the D code would become even faster:
ldc2 -c -output-s -O3 -release dot.d -mcpu=knl results in vfmadd231pd (%rcx,%rdi,8), %zmm4, %zmm0 instructions that process 8 vector elements at a time (“knl” stands for Intel’s Knights Landing).
If your target is an ARM processor with NEON SIMD tech, you’ll automatically get vectorized FMA code too:

ldc2 -c -output-s -O3 -release dot.d -mtriple=aarch64-none-linux-gnu -mattr=+neon

.LBB0_5:
  ldp q2, q3, [x9, #-16]
  ldp q4, q5, [x10, #-16]
  add x9, x9, #32
  add x10, x10, #32
  sub x11, x11, #4
  fmla  v0.2d, v4.2d, v2.2d
  fmla  v1.2d, v5.2d, v3.2d
  cbnz  x11, .LBB0_5

Traits

The dot-product example above is a very simple function. For more complex algorithms, you can get better performance if you take the sizes of the CPU’s caches into account; the characteristics of the instruction and data caches are both important for performance tuning.

LDC 1.1.0 has two special __traits that give compile-time information about the compile target architecture: __traits(targetHasFeature, "...") and __traits(targetCPU). For example, __traits(targetHasFeature, "avx2") returns true or false depending on whether the compile target supports AVX2 instructions or not.

Mir uses __traits(targetHasFeature, "...") to decide what vector types (say, a vector of 4 or 8 floats) to use in its algorithms, but I don’t know the exact how-and-why (yet? Ilya fill us in on the details! ;-)).

Feedback

Constructive feedback, positive and negative, is always appreciated. The best way to get in touch with me and the other LDC developers is via the digitalmars.D.ldc forum and our Gitter chat.

Links to discussions:
Reddit
D forum thread