This changes the way we treat widening of induction variables.
In the existing code, whenever we need a widened IV, we widen the scalar IV on the fly, by splatting it and adding the step vector.
Instead, we can create a real vector IV, which tends to save a couple of instructions per iteration. This patch only changes the behavior in the most basic case - integer primary IVs with a constant step. If this looks sensible, I'll try to follow-up with the other cases.
It seems to be more or less performance neutral, but for basic cases the code looks better, so I have the feeling this is a step in the right direction.
To take the most trivial example:
void vec(unsigned int *a, unsigned int k) { #pragma clang loop vectorize_width(4) interleave_count(1) #pragma nounroll for(unsigned int i = 0; i < k; ++i) a[i] = i; }
For AVX, without this patch, we get:
# BB#5: xorl %ecx, %ecx vmovdqa .LCPI0_0(%rip), %xmm0 # xmm0 = [0,1,2,3] .p2align 4, 0x90 .LBB0_6: # =>This Inner Loop Header: Depth=1 vmovd %ecx, %xmm1 vpshufd $0, %xmm1, %xmm1 # xmm1 = xmm1[0,0,0,0] vpaddd %xmm0, %xmm1, %xmm1 vmovdqu %xmm1, (%rdi,%rcx,4) addq $4, %rcx cmpq %rcx, %rdx jne .LBB0_6
And with this patch:
# BB#5: # %vector.body.preheader vmovdqa .LCPI0_0(%rip), %xmm1 # xmm1 = [0,1,2,3] vmovdqa .LCPI0_1(%rip), %xmm0 # xmm0 = [4,4,4,4] movq %rdi, %rcx movq %r8, %rdx .p2align 4, 0x90 .LBB0_6: # %vector.body # =>This Inner Loop Header: Depth=1 vmovdqu %xmm1, (%rcx) vpaddd %xmm0, %xmm1, %xmm1 addq $16, %rcx addq $-4, %rdx jne .LBB0_6
As this example shows, when we actually need the scalar IV, e.g. for a scalar GEP, InstCombine seems to clean things up nicely, so it doesn't look like LV needs to consider that.
Other views (especially on when this may be a bad thing) are welcome.