This is an archive of the discontinued LLVM Phabricator instance.

[LV] For some induction variables, use vector phis instead of widening the scalar in the loop body
ClosedPublic

Authored by mkuper on May 16 2016, 5:56 PM.

Details

Summary

This changes the way we treat widening of induction variables.

In the existing code, whenever we need a widened IV, we widen the scalar IV on the fly, by splatting it and adding the step vector.
Instead, we can create a real vector IV, which tends to save a couple of instructions per iteration. This patch only changes the behavior in the most basic case - integer primary IVs with a constant step. If this looks sensible, I'll try to follow-up with the other cases.

It seems to be more or less performance neutral, but for basic cases the code looks better, so I have the feeling this is a step in the right direction.
To take the most trivial example:

void vec(unsigned int *a, unsigned int k) {
#pragma clang loop vectorize_width(4) interleave_count(1)
#pragma nounroll
  for(unsigned int i = 0; i < k; ++i)
    a[i] = i;
}

For AVX, without this patch, we get:

# BB#5:
	xorl	%ecx, %ecx
	vmovdqa	.LCPI0_0(%rip), %xmm0   # xmm0 = [0,1,2,3]
	.p2align	4, 0x90
.LBB0_6:                                # =>This Inner Loop Header: Depth=1
	vmovd	%ecx, %xmm1
	vpshufd	$0, %xmm1, %xmm1        # xmm1 = xmm1[0,0,0,0]
	vpaddd	%xmm0, %xmm1, %xmm1
	vmovdqu	%xmm1, (%rdi,%rcx,4)
	addq	$4, %rcx
	cmpq	%rcx, %rdx
	jne	.LBB0_6

And with this patch:

# BB#5:                                 # %vector.body.preheader
	vmovdqa	.LCPI0_0(%rip), %xmm1   # xmm1 = [0,1,2,3]
	vmovdqa	.LCPI0_1(%rip), %xmm0   # xmm0 = [4,4,4,4]
	movq	%rdi, %rcx
	movq	%r8, %rdx
	.p2align	4, 0x90
.LBB0_6:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
	vmovdqu	%xmm1, (%rcx)
	vpaddd	%xmm0, %xmm1, %xmm1
	addq	$16, %rcx
	addq	$-4, %rdx
	jne	.LBB0_6

As this example shows, when we actually need the scalar IV, e.g. for a scalar GEP, InstCombine seems to clean things up nicely, so it doesn't look like LV needs to consider that.
Other views (especially on when this may be a bad thing) are welcome.

Diff Detail

Repository
rL LLVM

Event Timeline

mkuper updated this revision to Diff 57415.May 16 2016, 5:56 PM
mkuper retitled this revision from to [LV] For some induction variables, use vector phis instead of widening the scalar in the loop body.
mkuper updated this object.
mkuper added reviewers: delena, jmolloy, danielcdh.
mkuper added subscribers: llvm-commits, wmi, Ayal, davidxl.

Test case explicitly testing this should probably be added.

I tried the patch with the following program:

attribute((noinline)) long long hot() {

long long x = 0;

#pragma clang loop vectorize_width(4) interleave_count(1)
#pragma nounroll

 for (int i = 0; i < 1000; i++) {
          x += i^2;
 }

return x;

}

It improves performance by about ~10%. However the generated code is still not optimal -- there are unncessary vector IV copy code that can be moved out of loop or removed. Perhaps a followup patch to address that?

LBB0_1: # =>This Inner Loop Header: Depth=1

movdqa  %xmm4, %xmm5                                <---- here
paddd   %xmm1, %xmm5
pxor    %xmm2, %xmm4
pshufd  $78, %xmm4, %xmm6       # xmm6 = xmm4[2,3,0,1]
movdqa  %xmm6, %xmm7
psrad   $31, %xmm7
punpckldq       %xmm7, %xmm6    # xmm6 = xmm6[0],xmm7[0],xmm6[1],xmm7[1]
movdqa  %xmm4, %xmm7
psrad   $31, %xmm7
punpckldq       %xmm7, %xmm4    # xmm4 = xmm4[0],xmm7[0],xmm4[1],xmm7[1]
paddq   %xmm4, %xmm0
paddq   %xmm6, %xmm3
addl    $-4, %eax
movdqa  %xmm5, %xmm4                      <----- here
jne     .LBB0_1

Right, I'll add an explicit test (the test in induction_plus is *sort of* that, but not quite), thanks David!

Regarding the extra movdqa - I think the copy may be necessary.
The problem is that the loop body needs both to modify the current IV (because of two-address instructions) and keep it so that it can generate the new IV.

GCC does something similar:

jmp	.L3
[...]
.L5:
	movdqa	%xmm4, %xmm1
.L3:
	movdqa	%xmm1, %xmm4
	pxor	%xmm6, %xmm1
	movdqa	%xmm5, %xmm2
	addl	$1, %eax
	cmpl	$250, %eax
	paddd	%xmm7, %xmm4
	pcmpgtd	%xmm1, %xmm2
	movdqa	%xmm1, %xmm3
	punpckhdq	%xmm2, %xmm1
	punpckldq	%xmm2, %xmm3
	paddq	%xmm3, %xmm0
	paddq	%xmm1, %xmm0
	jne	.L5

The difference is that the loop is rotated, so there is no copy on the first iteration, but other than that, we still have the same two movdqas per iteration.

In any case, the extra mov disappears once we have AVX and three-address instructions, so we no longer update the current IV destructively.

mkuper updated this revision to Diff 57666.May 18 2016, 12:50 PM

Added tests per David's suggestion.

wmi added a comment.May 20 2016, 10:57 AM

However the generated code is still not optimal -- there are unncessary vector IV copy code that can be moved out of loop or removed. Perhaps a followup patch to address that?

LBB0_1: # =>This Inner Loop Header: Depth=1

movdqa  %xmm4, %xmm5                                <---- here
paddd   %xmm1, %xmm5
pxor    %xmm2, %xmm4
pshufd  $78, %xmm4, %xmm6       # xmm6 = xmm4[2,3,0,1]
movdqa  %xmm6, %xmm7
psrad   $31, %xmm7
punpckldq       %xmm7, %xmm6    # xmm6 = xmm6[0],xmm7[0],xmm6[1],xmm7[1]
movdqa  %xmm4, %xmm7
psrad   $31, %xmm7
punpckldq       %xmm7, %xmm4    # xmm4 = xmm4[0],xmm7[0],xmm4[1],xmm7[1]
paddq   %xmm4, %xmm0
paddq   %xmm6, %xmm3
addl    $-4, %eax
movdqa  %xmm5, %xmm4                      <----- here
jne     .LBB0_1

I think the first movdqa can be at least promoted to the loop preheader.

The original generated code with loop header is:

BB#0: # %entry

pxor    %xmm0, %xmm0
movdqa  .LCPI0_0(%rip), %xmm4   # xmm4 = [0,1,2,3]
movl    $1000, %eax             # imm = 0x3E8
movdqa  .LCPI0_1(%rip), %xmm1   # xmm1 = [4,4,4,4]
movdqa  .LCPI0_2(%rip), %xmm2   # xmm2 = [2,2,2,2]
pxor    %xmm3, %xmm3
.p2align        4, 0x90

.LBB0_1: # %vector.body

  1. =>This Inner Loop Header: Depth=1 movdqa %xmm4, %xmm5 paddd %xmm1, %xmm5 pxor %xmm2, %xmm4 pshufd $78, %xmm4, %xmm6 # xmm6 = xmm4[2,3,0,1] movdqa %xmm6, %xmm7 psrad $31, %xmm7 punpckldq %xmm7, %xmm6 # xmm6 = xmm6[0],xmm7[0],xmm6[1],xmm7[1] movdqa %xmm4, %xmm7 psrad $31, %xmm7 punpckldq %xmm7, %xmm4 # xmm4 = xmm4[0],xmm7[0],xmm4[1],xmm7[1] paddq %xmm4, %xmm0 paddq %xmm6, %xmm3 addl $-4, %eax movdqa %xmm5, %xmm4 jne .LBB0_1

It is safe to mov "movdqa %xmm4, %xmm5" at the start of LBB0_1 to the
end of all its predecessors: the end of BB#0 and the end of LBB0_1.

BB#0: # %entry

pxor    %xmm0, %xmm0
movdqa  .LCPI0_0(%rip), %xmm4   # xmm4 = [0,1,2,3]
movl    $1000, %eax             # imm = 0x3E8
movdqa  .LCPI0_1(%rip), %xmm1   # xmm1 = [4,4,4,4]
movdqa  .LCPI0_2(%rip), %xmm2   # xmm2 = [2,2,2,2]
pxor    %xmm3, %xmm3
movdqa  %xmm4, %xmm5  ==> promoted to preheader
.p2align        4, 0x90

.LBB0_1: # %vector.body

  1. =>This Inner Loop Header: Depth=1 paddd %xmm1, %xmm5 pxor %xmm2, %xmm4 pshufd $78, %xmm4, %xmm6 # xmm6 = xmm4[2,3,0,1] movdqa %xmm6, %xmm7 psrad $31, %xmm7 punpckldq %xmm7, %xmm6 # xmm6 = xmm6[0],xmm7[0],xmm6[1],xmm7[1] movdqa %xmm4, %xmm7 psrad $31, %xmm7 punpckldq %xmm7, %xmm4 # xmm4 = xmm4[0],xmm7[0],xmm4[1],xmm7[1] paddq %xmm4, %xmm0 paddq %xmm6, %xmm3 addl $-4, %eax movdqa %xmm5, %xmm4 movdqa %xmm4, %xmm5 ==> apparently redundent and will be deleted. jne .LBB0_1

I think this is actually a weakness in register coalescing. I already
have a similar testcase which probably have the same cause. It is good
to have another one now. It shows the problem may be more general than
I thought and justifies more work to improve it. Will file a separate
bug for it.

Thanks,
Wei.

wmi added a comment.May 20 2016, 1:16 PM

I think this is actually a weakness in register coalescing. I already
have a similar testcase which probably have the same cause. It is good
to have another one now. It shows the problem may be more general than
I thought and justifies more work to improve it. Will file a separate
bug for it.

Thanks,
Wei.

Looks like phabricator did some massage for my mail and made it hard to read.

Filed a bug for the partial redundent mov problem:
https://llvm.org/bugs/show_bug.cgi?id=27827

Thanks,
Wei.

delena edited edge metadata.May 31 2016, 10:33 AM

Some minor comments. Can you show IR before and after your change?

lib/Transforms/Vectorize/LoopVectorize.cpp
427 ↗(On Diff #57666)

step is always a SCEV now. You mean step which is a constant SCEV.

2118 ↗(On Diff #57666)

Check (Step != 0) here.

4110 ↗(On Diff #57666)

do we need this code if VF==1?

test/Transforms/LoopVectorize/X86/gather_scatter.ll
98 ↗(On Diff #57666)

I propose to remove variable name from this test.

Thanks, Elena!

lib/Transforms/Vectorize/LoopVectorize.cpp
427 ↗(On Diff #57666)

Right, thanks!

2118 ↗(On Diff #57666)

There's already an assert that this is non-null before each callsite. Do you want another assert here?

4110 ↗(On Diff #57666)

We certainly need the loop below the getBroadcastInstrs call (note that loops until UF, not VF).
As to getBroadcastInstrs itself - InnerLoopUnroller::getBarocastInstrs() is a nop, which exists, I think, precisely so that we don't need to special-case VF==1.

test/Transforms/LoopVectorize/X86/gather_scatter.ll
98 ↗(On Diff #57666)

Sure.

delena added inline comments.May 31 2016, 1:03 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
2118 ↗(On Diff #57666)

Probably yes..

2121 ↗(On Diff #57666)

I assume that loop invariant code will be hoisted anyway.

4295 ↗(On Diff #57666)

this line should be moved down, under the if (VF==1)

4306 ↗(On Diff #57666)

you don't need assert here, you are under the "if"

mkuper added inline comments.May 31 2016, 1:26 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
2121 ↗(On Diff #57666)

Probably, but since we know this belongs in the preheader, I'd prefer to put it there to begin with, if it's easy - and it seems like it is.
Or are you afraid this may be wrong?

(The reason I'm doing the saveIP/restoreIP dance instead of just creating a new IRBuilder is that getStepVector expects Builder to point at the right location. I can refactor this to have getStepVector accept an IRBuilder parameter instead, if you think that'll look better.)

4295 ↗(On Diff #57666)

Right, thanks.

4306 ↗(On Diff #57666)

Right, this was just for equivalence with the other callsite. I'll sink the assert into the function, like you suggested above.

mkuper updated this revision to Diff 59148.May 31 2016, 4:13 PM
mkuper edited edge metadata.

Updated with Elena's comments.

delena accepted this revision.May 31 2016, 11:20 PM
delena edited edge metadata.
This revision is now accepted and ready to land.May 31 2016, 11:20 PM
This revision was automatically updated to reflect the committed changes.