This changes the way we treat widening of induction variables.
In the existing code, whenever we need a widened IV, we widen the scalar IV on the fly, by splatting it and adding the step vector.
Instead, we can create a real vector IV, which tends to save a couple of instructions per iteration. This patch only changes the behavior in the most basic case - integer primary IVs with a constant step. If this looks sensible, I'll try to follow-up with the other cases.
It seems to be more or less performance neutral, but for basic cases the code looks better, so I have the feeling this is a step in the right direction.
To take the most trivial example:
void vec(unsigned int *a, unsigned int k) {
#pragma clang loop vectorize_width(4) interleave_count(1)
#pragma nounroll
for(unsigned int i = 0; i < k; ++i)
a[i] = i;
}For AVX, without this patch, we get:
# BB#5: xorl %ecx, %ecx vmovdqa .LCPI0_0(%rip), %xmm0 # xmm0 = [0,1,2,3] .p2align 4, 0x90 .LBB0_6: # =>This Inner Loop Header: Depth=1 vmovd %ecx, %xmm1 vpshufd $0, %xmm1, %xmm1 # xmm1 = xmm1[0,0,0,0] vpaddd %xmm0, %xmm1, %xmm1 vmovdqu %xmm1, (%rdi,%rcx,4) addq $4, %rcx cmpq %rcx, %rdx jne .LBB0_6
And with this patch:
# BB#5: # %vector.body.preheader
vmovdqa .LCPI0_0(%rip), %xmm1 # xmm1 = [0,1,2,3]
vmovdqa .LCPI0_1(%rip), %xmm0 # xmm0 = [4,4,4,4]
movq %rdi, %rcx
movq %r8, %rdx
.p2align 4, 0x90
.LBB0_6: # %vector.body
# =>This Inner Loop Header: Depth=1
vmovdqu %xmm1, (%rcx)
vpaddd %xmm0, %xmm1, %xmm1
addq $16, %rcx
addq $-4, %rdx
jne .LBB0_6As this example shows, when we actually need the scalar IV, e.g. for a scalar GEP, InstCombine seems to clean things up nicely, so it doesn't look like LV needs to consider that.
Other views (especially on when this may be a bad thing) are welcome.
step is always a SCEV now. You mean step which is a constant SCEV.