Given the following test program:
#include <math.h> void test(float *a, float *b, int n) { for(int i = 0; i < n; i++) b[i] = sinf(a[i]); }
If we tell the compiler we have a vector-library available and compile it as follows:
$ clang -O2 --target=x86_64-unknown-linux -march=btver2 -mllvm -vector-library=SVML -S test.c
The loop will be vectorized with a vectorization factor of 8, and the call to sinf will be widened to a vector library call (__svml_sinf8):
.LBB0_6: # %vector.body # =>This Inner Loop Header: Depth=1 vmovups (%r12,%r13,4), %ymm0 vmovups 32(%r12,%r13,4), %ymm1 vmovups 64(%r12,%r13,4), %ymm3 vmovups 96(%r12,%r13,4), %ymm2 vmovups %ymm1, (%rsp) # 32-byte Spill vmovups %ymm3, 32(%rsp) # 32-byte Spill vmovups %ymm2, 96(%rsp) # 32-byte Spill callq __svml_sinf8 vmovups %ymm0, 64(%rsp) # 32-byte Spill vmovups (%rsp), %ymm0 # 32-byte Reload callq __svml_sinf8 vmovups %ymm0, (%rsp) # 32-byte Spill vmovups 32(%rsp), %ymm0 # 32-byte Reload callq __svml_sinf8 vmovups %ymm0, 32(%rsp) # 32-byte Spill vmovups 96(%rsp), %ymm0 # 32-byte Reload callq __svml_sinf8 vmovups 64(%rsp), %ymm1 # 32-byte Reload vmovups (%rsp), %ymm3 # 32-byte Reload vmovups 32(%rsp), %ymm2 # 32-byte Reload vmovups %ymm1, (%r14,%r13,4) vmovups %ymm3, 32(%r14,%r13,4) vmovups %ymm2, 64(%r14,%r13,4) vmovups %ymm0, 96(%r14,%r13,4) addq $32, %r13 cmpq %r13, %rbx jne .LBB0_6
However, as can be seen the code generated is poor, containing a large number of spills and reloads. The reason for this is the loop vectorizer has chosen an interleave count (aka unroll factor) of 4.
In general, the heuristics tries to create parallel instances of the loop to expose ILP without causing spilling. It bases this on the number of registers used in the loop and the number of registers available. However, due to the way instructions are interleaved, the vector call causes the registers for the other instances to be spilled (thus defeating the heuristics).
This patch changes the heuristics to use an interleave count of 1 when a call will be vectorized to a library call. The test above now generates:
.LBB0_6: # %vector.body # =>This Inner Loop Header: Depth=1 vmovups (%r12,%r13,4), %ymm0 callq __svml_sinf8 vmovups %ymm0, (%r14,%r13,4) addq $8, %r13 cmpq %r13, %rbx jne .LBB0_6