This is an archive of the discontinued LLVM Phabricator instance.

Differential D20315

[LV] For some induction variables, use vector phis instead of widening the scalar in the loop body
ClosedPublic

Authored by mkuper on May 16 2016, 5:56 PM.

Download Raw Diff

Details

Reviewers

anemet
delena
danielcdh
jmolloy
hfinkel

Commits

rG3a3c64d23e3d: [LV] For some IVs, use vector phis instead of widening in the loop body
rL271410: [LV] For some IVs, use vector phis instead of widening in the loop body

Summary

This changes the way we treat widening of induction variables.

In the existing code, whenever we need a widened IV, we widen the scalar IV on the fly, by splatting it and adding the step vector.
Instead, we can create a real vector IV, which tends to save a couple of instructions per iteration. This patch only changes the behavior in the most basic case - integer primary IVs with a constant step. If this looks sensible, I'll try to follow-up with the other cases.

It seems to be more or less performance neutral, but for basic cases the code looks better, so I have the feeling this is a step in the right direction.
To take the most trivial example:

void vec(unsigned int *a, unsigned int k) {
#pragma clang loop vectorize_width(4) interleave_count(1)
#pragma nounroll
  for(unsigned int i = 0; i < k; ++i)
    a[i] = i;
}

For AVX, without this patch, we get:

# BB#5:
	xorl	%ecx, %ecx
	vmovdqa	.LCPI0_0(%rip), %xmm0   # xmm0 = [0,1,2,3]
	.p2align	4, 0x90
.LBB0_6:                                # =>This Inner Loop Header: Depth=1
	vmovd	%ecx, %xmm1
	vpshufd	$0, %xmm1, %xmm1        # xmm1 = xmm1[0,0,0,0]
	vpaddd	%xmm0, %xmm1, %xmm1
	vmovdqu	%xmm1, (%rdi,%rcx,4)
	addq	$4, %rcx
	cmpq	%rcx, %rdx
	jne	.LBB0_6

And with this patch:

# BB#5:                                 # %vector.body.preheader
	vmovdqa	.LCPI0_0(%rip), %xmm1   # xmm1 = [0,1,2,3]
	vmovdqa	.LCPI0_1(%rip), %xmm0   # xmm0 = [4,4,4,4]
	movq	%rdi, %rcx
	movq	%r8, %rdx
	.p2align	4, 0x90
.LBB0_6:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
	vmovdqu	%xmm1, (%rcx)
	vpaddd	%xmm0, %xmm1, %xmm1
	addq	$16, %rcx
	addq	$-4, %rdx
	jne	.LBB0_6

As this example shows, when we actually need the scalar IV, e.g. for a scalar GEP, InstCombine seems to clean things up nicely, so it doesn't look like LV needs to consider that.
Other views (especially on when this may be a bad thing) are welcome.

Diff Detail

Event Timeline

mkuper updated this revision to Diff 57415.May 16 2016, 5:56 PM

mkuper retitled this revision from to [LV] For some induction variables, use vector phis instead of widening the scalar in the loop body.

mkuper updated this object.

mkuper added reviewers: delena, jmolloy, danielcdh.

mkuper added subscribers: llvm-commits, wmi, Ayal, davidxl.

Herald added a subscriber: mzolotukhin. · View Herald TranscriptMay 16 2016, 5:56 PM

Test case explicitly testing this should probably be added.

I tried the patch with the following program:

attribute((noinline)) long long hot() {

long long x = 0;

#pragma clang loop vectorize_width(4) interleave_count(1)
#pragma nounroll

 for (int i = 0; i < 1000; i++) {
          x += i^2;
 }

return x;

}

It improves performance by about ~10%. However the generated code is still not optimal -- there are unncessary vector IV copy code that can be moved out of loop or removed. Perhaps a followup patch to address that?

LBB0_1: # =>This Inner Loop Header: Depth=1

movdqa  %xmm4, %xmm5                                <---- here
paddd   %xmm1, %xmm5
pxor    %xmm2, %xmm4
pshufd  $78, %xmm4, %xmm6       # xmm6 = xmm4[2,3,0,1]
movdqa  %xmm6, %xmm7
psrad   $31, %xmm7
punpckldq       %xmm7, %xmm6    # xmm6 = xmm6[0],xmm7[0],xmm6[1],xmm7[1]
movdqa  %xmm4, %xmm7
psrad   $31, %xmm7
punpckldq       %xmm7, %xmm4    # xmm4 = xmm4[0],xmm7[0],xmm4[1],xmm7[1]
paddq   %xmm4, %xmm0
paddq   %xmm6, %xmm3
addl    $-4, %eax
movdqa  %xmm5, %xmm4                      <----- here
jne     .LBB0_1

Right, I'll add an explicit test (the test in induction_plus is *sort of* that, but not quite), thanks David!

Regarding the extra movdqa - I think the copy may be necessary.
The problem is that the loop body needs both to modify the current IV (because of two-address instructions) and keep it so that it can generate the new IV.

GCC does something similar:

jmp	.L3
[...]
.L5:
	movdqa	%xmm4, %xmm1
.L3:
	movdqa	%xmm1, %xmm4
	pxor	%xmm6, %xmm1
	movdqa	%xmm5, %xmm2
	addl	$1, %eax
	cmpl	$250, %eax
	paddd	%xmm7, %xmm4
	pcmpgtd	%xmm1, %xmm2
	movdqa	%xmm1, %xmm3
	punpckhdq	%xmm2, %xmm1
	punpckldq	%xmm2, %xmm3
	paddq	%xmm3, %xmm0
	paddq	%xmm1, %xmm0
	jne	.L5

The difference is that the loop is rotated, so there is no copy on the first iteration, but other than that, we still have the same two movdqas per iteration.

In any case, the extra mov disappears once we have AVX and three-address instructions, so we no longer update the current IV destructively.

Added tests per David's suggestion.

mssimpso added a subscriber: mssimpso.May 18 2016, 2:46 PM

davidxl added a reviewer: hfinkel.May 19 2016, 10:38 AM

However the generated code is still not optimal -- there are unncessary vector IV copy code that can be moved out of loop or removed. Perhaps a followup patch to address that?

LBB0_1: # =>This Inner Loop Header: Depth=1
movdqa  %xmm4, %xmm5                                <---- here
paddd   %xmm1, %xmm5
pxor    %xmm2, %xmm4
pshufd  $78, %xmm4, %xmm6       # xmm6 = xmm4[2,3,0,1]
movdqa  %xmm6, %xmm7
psrad   $31, %xmm7
punpckldq       %xmm7, %xmm6    # xmm6 = xmm6[0],xmm7[0],xmm6[1],xmm7[1]
movdqa  %xmm4, %xmm7
psrad   $31, %xmm7
punpckldq       %xmm7, %xmm4    # xmm4 = xmm4[0],xmm7[0],xmm4[1],xmm7[1]
paddq   %xmm4, %xmm0
paddq   %xmm6, %xmm3
addl    $-4, %eax
movdqa  %xmm5, %xmm4                      <----- here
jne     .LBB0_1

I think the first movdqa can be at least promoted to the loop preheader.

The original generated code with loop header is:

BB#0: # %entry

pxor    %xmm0, %xmm0
movdqa  .LCPI0_0(%rip), %xmm4   # xmm4 = [0,1,2,3]
movl    $1000, %eax             # imm = 0x3E8
movdqa  .LCPI0_1(%rip), %xmm1   # xmm1 = [4,4,4,4]
movdqa  .LCPI0_2(%rip), %xmm2   # xmm2 = [2,2,2,2]
pxor    %xmm3, %xmm3
.p2align        4, 0x90

.LBB0_1: # %vector.body

=>This Inner Loop Header: Depth=1 movdqa %xmm4, %xmm5 paddd %xmm1, %xmm5 pxor %xmm2, %xmm4 pshufd $78, %xmm4, %xmm6 # xmm6 = xmm4[2,3,0,1] movdqa %xmm6, %xmm7 psrad $31, %xmm7 punpckldq %xmm7, %xmm6 # xmm6 = xmm6[0],xmm7[0],xmm6[1],xmm7[1] movdqa %xmm4, %xmm7 psrad $31, %xmm7 punpckldq %xmm7, %xmm4 # xmm4 = xmm4[0],xmm7[0],xmm4[1],xmm7[1] paddq %xmm4, %xmm0 paddq %xmm6, %xmm3 addl $-4, %eax movdqa %xmm5, %xmm4 jne .LBB0_1

It is safe to mov "movdqa %xmm4, %xmm5" at the start of LBB0_1 to the
end of all its predecessors: the end of BB#0 and the end of LBB0_1.

BB#0: # %entry

pxor    %xmm0, %xmm0
movdqa  .LCPI0_0(%rip), %xmm4   # xmm4 = [0,1,2,3]
movl    $1000, %eax             # imm = 0x3E8
movdqa  .LCPI0_1(%rip), %xmm1   # xmm1 = [4,4,4,4]
movdqa  .LCPI0_2(%rip), %xmm2   # xmm2 = [2,2,2,2]
pxor    %xmm3, %xmm3
movdqa  %xmm4, %xmm5  ==> promoted to preheader
.p2align        4, 0x90

.LBB0_1: # %vector.body

=>This Inner Loop Header: Depth=1 paddd %xmm1, %xmm5 pxor %xmm2, %xmm4 pshufd $78, %xmm4, %xmm6 # xmm6 = xmm4[2,3,0,1] movdqa %xmm6, %xmm7 psrad $31, %xmm7 punpckldq %xmm7, %xmm6 # xmm6 = xmm6[0],xmm7[0],xmm6[1],xmm7[1] movdqa %xmm4, %xmm7 psrad $31, %xmm7 punpckldq %xmm7, %xmm4 # xmm4 = xmm4[0],xmm7[0],xmm4[1],xmm7[1] paddq %xmm4, %xmm0 paddq %xmm6, %xmm3 addl $-4, %eax movdqa %xmm5, %xmm4 movdqa %xmm4, %xmm5 ==> apparently redundent and will be deleted. jne .LBB0_1

I think this is actually a weakness in register coalescing. I already
have a similar testcase which probably have the same cause. It is good
to have another one now. It shows the problem may be more general than
I thought and justifies more work to improve it. Will file a separate
bug for it.

Thanks,
Wei.

I think this is actually a weakness in register coalescing. I already
have a similar testcase which probably have the same cause. It is good
to have another one now. It shows the problem may be more general than
I thought and justifies more work to improve it. Will file a separate
bug for it.

Thanks,
Wei.

Looks like phabricator did some massage for my mail and made it hard to read.

Filed a bug for the partial redundent mov problem:
https://llvm.org/bugs/show_bug.cgi?id=27827

Thanks,
Wei.

Ping?

mkuper added a reviewer: anemet.May 26 2016, 4:05 PM

ping * 2

Some minor comments. Can you show IR before and after your change?

lib/Transforms/Vectorize/LoopVectorize.cpp
427	step is always a SCEV now. You mean step which is a constant SCEV.
2105	Check (Step != 0) here.
4097	do we need this code if VF==1?
test/Transforms/LoopVectorize/X86/gather_scatter.ll
98	I propose to remove variable name from this test.

Thanks, Elena!

lib/Transforms/Vectorize/LoopVectorize.cpp
427	Right, thanks!
2105	There's already an assert that this is non-null before each callsite. Do you want another assert here?
4097	We certainly need the loop below the getBroadcastInstrs call (note that loops until UF, not VF). As to getBroadcastInstrs itself - InnerLoopUnroller::getBarocastInstrs() is a nop, which exists, I think, precisely so that we don't need to special-case VF==1.
test/Transforms/LoopVectorize/X86/gather_scatter.ll
98	Sure.

delena added inline comments.May 31 2016, 1:03 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
2105	Probably yes..
2108	I assume that loop invariant code will be hoisted anyway.
4282	this line should be moved down, under the if (VF==1)
4293	you don't need assert here, you are under the "if"

mkuper added inline comments.May 31 2016, 1:26 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
2108	Probably, but since we know this belongs in the preheader, I'd prefer to put it there to begin with, if it's easy - and it seems like it is. Or are you afraid this may be wrong? (The reason I'm doing the saveIP/restoreIP dance instead of just creating a new IRBuilder is that getStepVector expects Builder to point at the right location. I can refactor this to have getStepVector accept an IRBuilder parameter instead, if you think that'll look better.)
4282	Right, thanks.
4293	Right, this was just for equivalence with the other callsite. I'll sink the assert into the function, like you suggested above.

Updated with Elena's comments.

delena accepted this revision.May 31 2016, 11:20 PM

delena edited edge metadata.

This revision is now accepted and ready to land.May 31 2016, 11:20 PM

Closed by commit rL271410: [LV] For some IVs, use vector phis instead of widening in the loop body (authored by mkuper). · Explain WhyJun 1 2016, 10:23 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

lib/

Transforms/

Vectorize/

LoopVectorize.cpp

100 lines

test/

Transforms/

LoopVectorize/

PowerPC/

vsx-tsvc-s173.ll

2 lines

X86/

6 lines

2 lines

2 lines

11 lines

8 lines

9 lines

Diff 57415

lib/Transforms/Vectorize/LoopVectorize.cpp

Show First 20 Lines • Show All 416 Lines • ▼ Show 20 Lines	protected:
virtual Value getStepVector(Value Val, int StartIdx, Value *Step);		virtual Value getStepVector(Value Val, int StartIdx, Value *Step);

/// This function adds (StartIdx, StartIdx + Step, StartIdx + 2*Step, ...)		/// This function adds (StartIdx, StartIdx + Step, StartIdx + 2*Step, ...)
/// to each vector element of Val. The sequence starts at StartIndex.		/// to each vector element of Val. The sequence starts at StartIndex.
/// Step is a SCEV. In order to get StepValue it takes the existing value		/// Step is a SCEV. In order to get StepValue it takes the existing value
/// from SCEV or creates a new using SCEVExpander.		/// from SCEV or creates a new using SCEVExpander.
virtual Value getStepVector(Value Val, int StartIdx, const SCEV *Step);		virtual Value getStepVector(Value Val, int StartIdx, const SCEV *Step);

		/// Create a vector induction variable based on an existing scalar one.
		/// Currently only works for integer primary induction variables with
		/// a constant (non-SCEV) step.
		delenaUnsubmitted Not Done Reply Inline Actions step is always a SCEV now. You mean step which is a constant SCEV. delena: step is always a SCEV now. You mean step which is a constant SCEV.
		mkuperAuthorUnsubmitted Not Done Reply Inline Actions Right, thanks! mkuper: Right, thanks!
		/// If TruncType is provided, instead of widening the original IV, we
		/// widen a version of the IV truncated to TruncType.
		void widenInductionVariable(const InductionDescriptor &II, VectorParts &Entry,
		IntegerType *TruncType = nullptr);

/// When we go over instructions in the basic block we rely on previous		/// When we go over instructions in the basic block we rely on previous
/// values within the current basic block or on loop invariant values.		/// values within the current basic block or on loop invariant values.
/// When we widen (vectorize) values we place them in the map. If the values		/// When we widen (vectorize) values we place them in the map. If the values
/// are not within the map, they have to be loop invariant, so we simply		/// are not within the map, they have to be loop invariant, so we simply
/// broadcast them into a vector.		/// broadcast them into a vector.
VectorParts &getVectorValue(Value *V);		VectorParts &getVectorValue(Value *V);

/// Try to vectorize the interleaved access group that \p Instr belongs to.		/// Try to vectorize the interleaved access group that \p Instr belongs to.
▲ Show 20 Lines • Show All 1,651 Lines • ▼ Show 20 Lines	Value InnerLoopVectorizer::getStepVector(Value Val, int StartIdx,
const SCEV *StepSCEV) {		const SCEV *StepSCEV) {
const DataLayout &DL = OrigLoop->getHeader()->getModule()->getDataLayout();		const DataLayout &DL = OrigLoop->getHeader()->getModule()->getDataLayout();
SCEVExpander Exp(*PSE.getSE(), DL, "induction");		SCEVExpander Exp(*PSE.getSE(), DL, "induction");
Value *StepValue = Exp.expandCodeFor(StepSCEV, StepSCEV->getType(),		Value *StepValue = Exp.expandCodeFor(StepSCEV, StepSCEV->getType(),
&*Builder.GetInsertPoint());		&*Builder.GetInsertPoint());
return getStepVector(Val, StartIdx, StepValue);		return getStepVector(Val, StartIdx, StepValue);
}		}

		void InnerLoopVectorizer::widenInductionVariable(const InductionDescriptor &II,
		VectorParts &Entry,
		IntegerType *TruncType) {
		Value *Start = II.getStartValue();
		ConstantInt *Step = II.getConstIntStepValue();

		delenaUnsubmitted Not Done Reply Inline Actions Check (Step != 0) here. delena: Check (Step != 0) here.
		mkuperAuthorUnsubmitted Not Done Reply Inline Actions There's already an assert that this is non-null before each callsite. Do you want another assert here? mkuper: There's already an assert that this is non-null before each callsite. Do you want another…
		delenaUnsubmitted Not Done Reply Inline Actions Probably yes.. delena: Probably yes..
		// Construct the initial value of the vector IV in the vector loop preheader
		auto CurrIP = Builder.saveIP();
		Builder.SetInsertPoint(LoopVectorPreHeader->getTerminator());
		delenaUnsubmitted Not Done Reply Inline Actions I assume that loop invariant code will be hoisted anyway. delena: I assume that loop invariant code will be hoisted anyway.
		mkuperAuthorUnsubmitted Not Done Reply Inline Actions Probably, but since we know this belongs in the preheader, I'd prefer to put it there to begin with, if it's easy - and it seems like it is. Or are you afraid this may be wrong? (The reason I'm doing the saveIP/restoreIP dance instead of just creating a new IRBuilder is that getStepVector expects Builder to point at the right location. I can refactor this to have getStepVector accept an IRBuilder parameter instead, if you think that'll look better.) mkuper: Probably, but since we know this belongs in the preheader, I'd prefer to put it there to begin…
		if (TruncType) {
		Step = ConstantInt::getSigned(TruncType, Step->getSExtValue());
		Start = Builder.CreateCast(Instruction::Trunc, Start, TruncType);
		}
		Value *SplatStart = Builder.CreateVectorSplat(VF, Start);
		Value *SteppedStart = getStepVector(SplatStart, 0, Step);
		Builder.restoreIP(CurrIP);

		Value *SplatVF =
		ConstantVector::getSplat(VF, ConstantInt::get(Start->getType(), VF));
		// We may need to add the step a number of times, depending on the unroll
		// factor. The last of those goes into the PHI.
		PHINode *VecInd = PHINode::Create(SteppedStart->getType(), 2, "vec.ind",
		&*LoopVectorBody->getFirstInsertionPt());
		Value *LastInduction = VecInd;
		for (unsigned Part = 0; Part < UF; ++Part) {
		Entry[Part] = LastInduction;
		LastInduction = Builder.CreateAdd(LastInduction, SplatVF, "step.add");
		}

		VecInd->addIncoming(SteppedStart, LoopVectorPreHeader);
		VecInd->addIncoming(LastInduction, LoopVectorBody);
		}

Value InnerLoopVectorizer::getStepVector(Value Val, int StartIdx,		Value InnerLoopVectorizer::getStepVector(Value Val, int StartIdx,
Value *Step) {		Value *Step) {
assert(Val->getType()->isVectorTy() && "Must be a vector");		assert(Val->getType()->isVectorTy() && "Must be a vector");
assert(Val->getType()->getScalarType()->isIntegerTy() &&		assert(Val->getType()->getScalarType()->isIntegerTy() &&
"Elem must be an integer");		"Elem must be an integer");
assert(Step->getType() == Val->getType()->getScalarType() &&		assert(Step->getType() == Val->getType()->getScalarType() &&
"Step has wrong type");		"Step has wrong type");
// Create the types.		// Create the types.
▲ Show 20 Lines • Show All 1,939 Lines • ▼ Show 20 Lines	void InnerLoopVectorizer::widenPHIInstruction(

// FIXME: The newly created binary instructions should contain nsw/nuw flags,		// FIXME: The newly created binary instructions should contain nsw/nuw flags,
// which can be found from the original scalar operations.		// which can be found from the original scalar operations.
switch (II.getKind()) {		switch (II.getKind()) {
case InductionDescriptor::IK_NoInduction:		case InductionDescriptor::IK_NoInduction:
llvm_unreachable("Unknown induction");		llvm_unreachable("Unknown induction");
case InductionDescriptor::IK_IntInduction: {		case InductionDescriptor::IK_IntInduction: {
assert(P->getType() == II.getStartValue()->getType() && "Types must match");		assert(P->getType() == II.getStartValue()->getType() && "Types must match");
		if (P != OldInduction \|\| VF == 1) {
		Value *V = Induction;
// Handle other induction variables that are now based on the		// Handle other induction variables that are now based on the
// canonical one.		// canonical one.
Value *V = Induction;
if (P != OldInduction) {		if (P != OldInduction) {
V = Builder.CreateSExtOrTrunc(Induction, P->getType());		V = Builder.CreateSExtOrTrunc(Induction, P->getType());
V = II.transform(Builder, V, PSE.getSE(), DL);		V = II.transform(Builder, V, PSE.getSE(), DL);
V->setName("offset.idx");		V->setName("offset.idx");
}		}
Value *Broadcasted = getBroadcastInstrs(V);		Value *Broadcasted = getBroadcastInstrs(V);
		delenaUnsubmitted Not Done Reply Inline Actions do we need this code if VF==1? delena: do we need this code if VF==1?
		mkuperAuthorUnsubmitted Not Done Reply Inline Actions We certainly need the loop below the getBroadcastInstrs call (note that loops until UF, not VF). As to getBroadcastInstrs itself - InnerLoopUnroller::getBarocastInstrs() is a nop, which exists, I think, precisely so that we don't need to special-case VF==1. mkuper: We certainly need the loop below the getBroadcastInstrs call (note that loops until UF, not VF).
// After broadcasting the induction variable we need to make the vector		// After broadcasting the induction variable we need to make the vector
// consecutive by adding 0, 1, 2, etc.		// consecutive by adding 0, 1, 2, etc.
for (unsigned part = 0; part < UF; ++part)		for (unsigned part = 0; part < UF; ++part)
Entry[part] = getStepVector(Broadcasted, VF * part, II.getStep());		Entry[part] = getStepVector(Broadcasted, VF * part, II.getStep());
		} else {
		// Instead of re-creating the vector IV by splatting the scalar IV
		// in each iteration, we can make a new independent vector IV.
		assert(II.getConstIntStepValue() &&
		"Primary induction variable should have a const step");
		widenInductionVariable(II, Entry);
		}
return;		return;
}		}
case InductionDescriptor::IK_PtrInduction:		case InductionDescriptor::IK_PtrInduction:
// Handle the pointer induction variable case.		// Handle the pointer induction variable case.
assert(P->getType()->isPointerTy() && "Unexpected type.");		assert(P->getType()->isPointerTy() && "Unexpected type.");
// This is the normalized GEP that starts counting at zero.		// This is the normalized GEP that starts counting at zero.
Value *PtrInd = Induction;		Value *PtrInd = Induction;
PtrInd = Builder.CreateSExtOrTrunc(PtrInd, II.getStep()->getType());		PtrInd = Builder.CreateSExtOrTrunc(PtrInd, II.getStep()->getType());
▲ Show 20 Lines • Show All 154 Lines • ▼ Show 20 Lines	case Instruction::BitCast: {
/// Optimize the special case where the source is a constant integer		/// Optimize the special case where the source is a constant integer
/// induction variable. Notice that we can only optimize the 'trunc' case		/// induction variable. Notice that we can only optimize the 'trunc' case
/// because: a. FP conversions lose precision, b. sext/zext may wrap,		/// because: a. FP conversions lose precision, b. sext/zext may wrap,
/// c. other casts depend on pointer size.		/// c. other casts depend on pointer size.

if (CI->getOperand(0) == OldInduction &&		if (CI->getOperand(0) == OldInduction &&
it->getOpcode() == Instruction::Trunc) {		it->getOpcode() == Instruction::Trunc) {
InductionDescriptor II =		InductionDescriptor II =
Legal->getInductionVars()->lookup(OldInduction);		Legal->getInductionVars()->lookup(OldInduction);
if (auto StepValue = II.getConstIntStepValue()) {		if (auto StepValue = II.getConstIntStepValue()) {
StepValue = ConstantInt::getSigned(cast<IntegerType>(CI->getType()),		IntegerType *TruncType = cast<IntegerType>(CI->getType());
StepValue->getSExtValue());		StepValue = ConstantInt::getSigned(TruncType, StepValue->getSExtValue());
		delenaUnsubmitted Not Done Reply Inline Actions this line should be moved down, under the if (VF==1) delena: this line should be moved down, under the if (VF==1)
		mkuperAuthorUnsubmitted Not Done Reply Inline Actions Right, thanks. mkuper: Right, thanks.
		if (VF == 1) {
Value *ScalarCast = Builder.CreateCast(CI->getOpcode(), Induction,		Value *ScalarCast = Builder.CreateCast(CI->getOpcode(), Induction,
CI->getType());		CI->getType());
Value *Broadcasted = getBroadcastInstrs(ScalarCast);		Value *Broadcasted = getBroadcastInstrs(ScalarCast);
for (unsigned Part = 0; Part < UF; ++Part)		for (unsigned Part = 0; Part < UF; ++Part)
Entry[Part] = getStepVector(Broadcasted, VF * Part, StepValue);		Entry[Part] = getStepVector(Broadcasted, VF * Part, StepValue);
		} else {
		// Truncating a vector induction variable on each iteration
		// may be expensive. Instead, truncate the initial value, and create
		// a new, truncated, vector IV based on that.
		assert(II.getConstIntStepValue() &&
		delenaUnsubmitted Not Done Reply Inline Actions you don't need assert here, you are under the "if" delena: you don't need assert here, you are under the "if"
		mkuperAuthorUnsubmitted Not Done Reply Inline Actions Right, this was just for equivalence with the other callsite. I'll sink the assert into the function, like you suggested above. mkuper: Right, this was just for equivalence with the other callsite. I'll sink the assert into the…
		"Primary induction variable should have a const step");
		widenInductionVariable(II, Entry, TruncType);
		}
addMetadata(Entry, &*it);		addMetadata(Entry, &*it);
break;		break;
}		}
}		}
/// Vectorize casts.		/// Vectorize casts.
Type *DestTy =		Type *DestTy =
(VF == 1) ? CI->getType() : VectorType::get(CI->getType(), VF);		(VF == 1) ? CI->getType() : VectorType::get(CI->getType(), VF);

▲ Show 20 Lines • Show All 2,012 Lines • Show Last 20 Lines

test/Transforms/LoopVectorize/PowerPC/vsx-tsvc-s173.ll

Show All 37 Lines	for.end: ; preds = %for.body3
%cmp = icmp slt i32 %inc11, %mul		%cmp = icmp slt i32 %inc11, %mul
br i1 %cmp, label %for.cond1.preheader, label %for.end12		br i1 %cmp, label %for.cond1.preheader, label %for.end12

for.end12: ; preds = %for.end, %entry		for.end12: ; preds = %for.end, %entry
ret i32 0		ret i32 0

; CHECK-LABEL: @s173		; CHECK-LABEL: @s173
; CHECK: load <4 x float>, <4 x float>*		; CHECK: load <4 x float>, <4 x float>*
; CHECK: add i64 %index, 16000		; CHECK: add nsw i64 %.lhs, 16000
; CHECK: ret i32 0		; CHECK: ret i32 0
}		}

attributes #0 = { nounwind }		attributes #0 = { nounwind }

test/Transforms/LoopVectorize/X86/gather_scatter.ll

	Show First 20 Lines • Show All 89 Lines • ▼ Show 20 Lines
	; out[i] = in[i].b + (float) 0.5;			; out[i] = in[i].b + (float) 0.5;
	; }			; }
	; }			; }
	;}			;}

	%struct.In = type { float, float }			%struct.In = type { float, float }

	;AVX512-LABEL: @foo2			;AVX512-LABEL: @foo2
	;AVX512: getelementptr %struct.In, %struct.In* %in, <16 x i64> %induction, i32 1			;AVX512: getelementptr %struct.In, %struct.In* %in, <16 x i64> %vec.ind, i32 1
				delenaUnsubmitted Not Done Reply Inline Actions I propose to remove variable name from this test. delena: I propose to remove variable name from this test.
				mkuperAuthorUnsubmitted Not Done Reply Inline Actions Sure. mkuper: Sure.
	;AVX512: llvm.masked.gather.v16f32			;AVX512: llvm.masked.gather.v16f32
	;AVX512: llvm.masked.store.v16f32			;AVX512: llvm.masked.store.v16f32
	;AVX512: ret void			;AVX512: ret void
	define void @foo2(%struct.In* noalias %in, float* noalias %out, i32* noalias %trigger, i32* noalias %index) #0 {			define void @foo2(%struct.In* noalias %in, float* noalias %out, i32* noalias %trigger, i32* noalias %index) #0 {
	entry:			entry:
	%in.addr = alloca %struct.In*, align 8			%in.addr = alloca %struct.In*, align 8
	%out.addr = alloca float*, align 8			%out.addr = alloca float*, align 8
	%trigger.addr = alloca i32*, align 8			%trigger.addr = alloca i32*, align 8
	▲ Show 20 Lines • Show All 58 Lines • ▼ Show 20 Lines
	; for (int i=0; i<SIZE; ++i) {			; for (int i=0; i<SIZE; ++i) {
	; if (trigger[i] > 0) {			; if (trigger[i] > 0) {
	; out[i].b = in[i].b + (float) 0.5;			; out[i].b = in[i].b + (float) 0.5;
	; }			; }
	; }			; }
	;}			;}

	;AVX512-LABEL: @foo3			;AVX512-LABEL: @foo3
	;AVX512: getelementptr %struct.In, %struct.In* %in, <16 x i64> %induction, i32 1			;AVX512: getelementptr %struct.In, %struct.In* %in, <16 x i64> %vec.ind, i32 1
	;AVX512: llvm.masked.gather.v16f32			;AVX512: llvm.masked.gather.v16f32
	;AVX512: fadd <16 x float>			;AVX512: fadd <16 x float>
	;AVX512: getelementptr %struct.Out, %struct.Out* %out, <16 x i64> %induction, i32 1			;AVX512: getelementptr %struct.Out, %struct.Out* %out, <16 x i64> %vec.ind, i32 1
	;AVX512: llvm.masked.scatter.v16f32			;AVX512: llvm.masked.scatter.v16f32
	;AVX512: ret void			;AVX512: ret void

	%struct.Out = type { float, float }			%struct.Out = type { float, float }

	define void @foo3(%struct.In* noalias %in, %struct.Out* noalias %out, i32* noalias %trigger) {			define void @foo3(%struct.In* noalias %in, %struct.Out* noalias %out, i32* noalias %trigger) {
	entry:			entry:
	%in.addr = alloca %struct.In*, align 8			%in.addr = alloca %struct.In*, align 8
	▲ Show 20 Lines • Show All 52 Lines • Show Last 20 Lines

test/Transforms/LoopVectorize/cast-induction.ll

	; RUN: opt < %s -loop-vectorize -force-vector-interleave=1 -force-vector-width=4 -dce -instcombine -S \| FileCheck %s			; RUN: opt < %s -loop-vectorize -force-vector-interleave=1 -force-vector-width=4 -dce -instcombine -S \| FileCheck %s

	; rdar://problem/12848162			; rdar://problem/12848162

	target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"			target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
	target triple = "x86_64-apple-macosx10.8.0"			target triple = "x86_64-apple-macosx10.8.0"

	@a = common global [2048 x i32] zeroinitializer, align 16			@a = common global [2048 x i32] zeroinitializer, align 16

	;CHECK-LABEL: @example12(			;CHECK-LABEL: @example12(
	;CHECK: trunc i64			;CHECK: %vec.ind1 = phi <4 x i32>
	;CHECK: store <4 x i32>			;CHECK: store <4 x i32>
	;CHECK: ret void			;CHECK: ret void
	define void @example12() nounwind uwtable ssp {			define void @example12() nounwind uwtable ssp {
	br label %1			br label %1

	; <label>:1 ; preds = %1, %0			; <label>:1 ; preds = %1, %0
	%indvars.iv = phi i64 [ 0, %0 ], [ %indvars.iv.next, %1 ]			%indvars.iv = phi i64 [ 0, %0 ], [ %indvars.iv.next, %1 ]
	%2 = getelementptr inbounds [2048 x i32], [2048 x i32]* @a, i64 0, i64 %indvars.iv			%2 = getelementptr inbounds [2048 x i32], [2048 x i32]* @a, i64 0, i64 %indvars.iv
	Show All 11 Lines

test/Transforms/LoopVectorize/gcc-examples.ll

Show First 20 Lines • Show All 362 Lines • ▼ Show 20 Lines	; <label>:1 ; preds = %1, %0
%exitcond = icmp eq i32 %lftr.wideiv, 512		%exitcond = icmp eq i32 %lftr.wideiv, 512
br i1 %exitcond, label %20, label %1		br i1 %exitcond, label %20, label %1

; <label>:20 ; preds = %1		; <label>:20 ; preds = %1
ret void		ret void
}		}

;CHECK-LABEL: @example12(		;CHECK-LABEL: @example12(
;CHECK: trunc i64		;CHECK: %vec.ind1 = phi <4 x i32>
;CHECK: store <4 x i32>		;CHECK: store <4 x i32>
;CHECK: ret void		;CHECK: ret void
define void @example12() nounwind uwtable ssp {		define void @example12() nounwind uwtable ssp {
br label %1		br label %1

; <label>:1 ; preds = %1, %0		; <label>:1 ; preds = %1, %0
%indvars.iv = phi i64 [ 0, %0 ], [ %indvars.iv.next, %1 ]		%indvars.iv = phi i64 [ 0, %0 ], [ %indvars.iv.next, %1 ]
%2 = getelementptr inbounds [2048 x i32], [2048 x i32]* @a, i64 0, i64 %indvars.iv		%2 = getelementptr inbounds [2048 x i32], [2048 x i32]* @a, i64 0, i64 %indvars.iv
▲ Show 20 Lines • Show All 307 Lines • Show Last 20 Lines

test/Transforms/LoopVectorize/gep_with_bitcast.ll

	; RUN: opt -S -loop-vectorize -instcombine -force-vector-width=4 < %s \| FileCheck %s			; RUN: opt -S -loop-vectorize -instcombine -force-vector-width=4 < %s \| FileCheck %s

	target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"			target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"

	; Vectorization of loop with bitcast between GEP and load			; Vectorization of loop with bitcast between GEP and load
	; Simplified source code:			; Simplified source code:
	;void foo (double** __restrict__ in, bool * __restrict__ res) {			;void foo (double** __restrict__ in, bool * __restrict__ res) {
	;			;
	; for (int i = 0; i < 4096; ++i)			; for (int i = 0; i < 4096; ++i)
	; res[i] = ((unsigned long long)in[i] == 0);			; res[i] = ((unsigned long long)in[i] == 0);
	;}			;}

	; CHECK-LABEL: @foo			; CHECK-LABEL: @foo
	; CHECK: vector.body			; CHECK: vector.body
	; CHECK: %0 = getelementptr inbounds double, double* %in, i64 %index			; CHECK: %0 = phi
	; CHECK: %1 = bitcast double** %0 to <4 x i64>*			; CHECK: %2 = getelementptr inbounds double, double* %in, i64 %0
	; CHECK: %wide.load = load <4 x i64>, <4 x i64>* %1, align 8			; CHECK: %3 = bitcast double** %2 to <4 x i64>*
	; CHECK: %2 = icmp eq <4 x i64> %wide.load, zeroinitializer			; CHECK: %wide.load = load <4 x i64>, <4 x i64>* %3, align 8
				; CHECK: %4 = icmp eq <4 x i64> %wide.load, zeroinitializer
	; CHECK: br i1			; CHECK: br i1

	define void @foo(double noalias nocapture readonly %in, double noalias nocapture readnone %out, i8* noalias nocapture %res) #0 {			define void @foo(double noalias nocapture readonly %in, double noalias nocapture readnone %out, i8* noalias nocapture %res) #0 {
	entry:			entry:
	br label %for.body			br label %for.body

	for.body:			for.body:
	%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]			%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
	%arrayidx = getelementptr inbounds double, double* %in, i64 %indvars.iv			%arrayidx = getelementptr inbounds double, double* %in, i64 %indvars.iv
	%tmp53 = bitcast double** %arrayidx to i64*			%tmp53 = bitcast double** %arrayidx to i64*
	%tmp54 = load i64, i64* %tmp53, align 8			%tmp54 = load i64, i64* %tmp53, align 8
	%cmp1 = icmp eq i64 %tmp54, 0			%cmp1 = icmp eq i64 %tmp54, 0
	%arrayidx3 = getelementptr inbounds i8, i8* %res, i64 %indvars.iv			%arrayidx3 = getelementptr inbounds i8, i8* %res, i64 %indvars.iv
	%frombool = zext i1 %cmp1 to i8			%frombool = zext i1 %cmp1 to i8
	store i8 %frombool, i8* %arrayidx3, align 1			store i8 %frombool, i8* %arrayidx3, align 1
	%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1			%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
	%exitcond = icmp eq i64 %indvars.iv.next, 4096			%exitcond = icmp eq i64 %indvars.iv.next, 4096
	br i1 %exitcond, label %for.end, label %for.body			br i1 %exitcond, label %for.end, label %for.body

	for.end:			for.end:
	ret void			ret void
	}			}
	No newline at end of file

test/Transforms/LoopVectorize/global_alias.ll

	; RUN: opt < %s -O1 -loop-vectorize -force-vector-interleave=1 -force-vector-width=4 -dce -instcombine -S \| FileCheck %s			; RUN: opt < %s -O1 -loop-vectorize -force-vector-interleave=1 -force-vector-width=4 -dce -instcombine -S \| FileCheck %s

	target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:64:128-a0:0:64-n32-S64"			target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:64:128-a0:0:64-n32-S64"

	%struct.anon = type { [100 x i32], i32, [100 x i32] }			%struct.anon = type { [100 x i32], i32, [100 x i32] }
	%struct.anon.0 = type { [100 x [100 x i32]], i32, [100 x [100 x i32]] }			%struct.anon.0 = type { [100 x [100 x i32]], i32, [100 x [100 x i32]] }

	@Foo = common global %struct.anon zeroinitializer, align 4			@Foo = common global %struct.anon zeroinitializer, align 4
	@Bar = common global %struct.anon.0 zeroinitializer, align 4			@Bar = common global %struct.anon.0 zeroinitializer, align 4

	@PB = external global i32*			@PB = external global i32*
	@PA = external global i32*			@PA = external global i32*


	;; === First, the tests that should always vectorize, wither statically or by adding run-time checks ===			;; === First, the tests that should always vectorize, wether statically or by adding run-time checks ===


	; /// Different objects, positive induction, constant distance			; /// Different objects, positive induction, constant distance
	; int noAlias01 (int a) {			; int noAlias01 (int a) {
	; int i;			; int i;
	; for (i=0; i<SIZE; i++)			; for (i=0; i<SIZE; i++)
	; Foo.A[i] = Foo.B[i] + a;			; Foo.A[i] = Foo.B[i] + a;
	; return Foo.A[a];			; return Foo.A[a];
	▲ Show 20 Lines • Show All 358 Lines • ▼ Show 20 Lines
	; /// Different objects, negative induction, shortening slide			; /// Different objects, negative induction, shortening slide
	; int noAlias08 (int a) {			; int noAlias08 (int a) {
	; int i;			; int i;
	; for (i=0; i<SIZE-10; i++)			; for (i=0; i<SIZE-10; i++)
	; Foo.A[SIZE-i-1] = Foo.B[SIZE-i-10] + a;			; Foo.A[SIZE-i-1] = Foo.B[SIZE-i-10] + a;
	; return Foo.A[a];			; return Foo.A[a];
	; }			; }
	; CHECK-LABEL: define i32 @noAlias08(			; CHECK-LABEL: define i32 @noAlias08(
	; CHECK: sub <4 x i32>			; CHECK: sub nuw nsw <4 x i32>
	; CHECK: ret			; CHECK: ret

	define i32 @noAlias08(i32 %a) #0 {			define i32 @noAlias08(i32 %a) #0 {
	entry:			entry:
	%a.addr = alloca i32, align 4			%a.addr = alloca i32, align 4
	%i = alloca i32, align 4			%i = alloca i32, align 4
	store i32 %a, i32* %a.addr, align 4			store i32 %a, i32* %a.addr, align 4
	store i32 0, i32* %i, align 4			store i32 0, i32* %i, align 4
	Show All 35 Lines
	; /// Different objects, negative induction, widening slide			; /// Different objects, negative induction, widening slide
	; int noAlias09 (int a) {			; int noAlias09 (int a) {
	; int i;			; int i;
	; for (i=0; i<SIZE; i++)			; for (i=0; i<SIZE; i++)
	; Foo.A[SIZE-i-10] = Foo.B[SIZE-i-1] + a;			; Foo.A[SIZE-i-10] = Foo.B[SIZE-i-1] + a;
	; return Foo.A[a];			; return Foo.A[a];
	; }			; }
	; CHECK-LABEL: define i32 @noAlias09(			; CHECK-LABEL: define i32 @noAlias09(
	; CHECK: sub <4 x i32>			; CHECK: sub nuw nsw <4 x i32>
	; CHECK: ret			; CHECK: ret

	define i32 @noAlias09(i32 %a) #0 {			define i32 @noAlias09(i32 %a) #0 {
	entry:			entry:
	%a.addr = alloca i32, align 4			%a.addr = alloca i32, align 4
	%i = alloca i32, align 4			%i = alloca i32, align 4
	store i32 %a, i32* %a.addr, align 4			store i32 %a, i32* %a.addr, align 4
	store i32 0, i32* %i, align 4			store i32 0, i32* %i, align 4
	▲ Show 20 Lines • Show All 265 Lines • ▼ Show 20 Lines
	; /// Same objects, negative induction, constant distance, just enough for vector size			; /// Same objects, negative induction, constant distance, just enough for vector size
	; int noAlias14 (int a) {			; int noAlias14 (int a) {
	; int i;			; int i;
	; for (i=0; i<SIZE; i++)			; for (i=0; i<SIZE; i++)
	; Foo.A[SIZE-i-1] = Foo.A[SIZE-i-5] + a;			; Foo.A[SIZE-i-1] = Foo.A[SIZE-i-5] + a;
	; return Foo.A[a];			; return Foo.A[a];
	; }			; }
	; CHECK-LABEL: define i32 @noAlias14(			; CHECK-LABEL: define i32 @noAlias14(
	; CHECK: sub <4 x i32>			; CHECK: sub nuw nsw <4 x i32>
	; CHECK: ret			; CHECK: ret

	define i32 @noAlias14(i32 %a) #0 {			define i32 @noAlias14(i32 %a) #0 {
	entry:			entry:
	%a.addr = alloca i32, align 4			%a.addr = alloca i32, align 4
	%i = alloca i32, align 4			%i = alloca i32, align 4
	store i32 %a, i32* %a.addr, align 4			store i32 %a, i32* %a.addr, align 4
	store i32 0, i32* %i, align 4			store i32 0, i32* %i, align 4
	▲ Show 20 Lines • Show All 345 Lines • Show Last 20 Lines

test/Transforms/LoopVectorize/induction_plus.ll

	; RUN: opt < %s -loop-vectorize -force-vector-interleave=1 -force-vector-width=4 -instcombine -S \| FileCheck %s			; RUN: opt < %s -loop-vectorize -force-vector-interleave=1 -force-vector-width=4 -S \| FileCheck %s

	target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"			target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
	target triple = "x86_64-apple-macosx10.8.0"			target triple = "x86_64-apple-macosx10.8.0"

	@array = common global [1024 x i32] zeroinitializer, align 16			@array = common global [1024 x i32] zeroinitializer, align 16

	;CHECK-LABEL: @array_at_plus_one(			;CHECK-LABEL: @array_at_plus_one(
	;CHECK: add i64 %index, 12			;CHECK: %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
	;CHECK: trunc i64			;CHECK: %vec.ind = phi <4 x i64> [ <i64 0, i64 1, i64 2, i64 3>, %vector.ph ], [ %step.add, %vector.body ]
				;CHECK: %vec.ind1 = phi <4 x i32> [ <i32 0, i32 1, i32 2, i32 3>, %vector.ph ], [ %step.add2, %vector.body ]
				;CHECK: add <4 x i64> %vec.ind, <i64 4, i64 4, i64 4, i64 4>
				;CHECK: add nsw <4 x i64> %vec.ind, <i64 12, i64 12, i64 12, i64 12>
	;CHECK: ret i32			;CHECK: ret i32
	define i32 @array_at_plus_one(i32 %n) nounwind uwtable ssp {			define i32 @array_at_plus_one(i32 %n) nounwind uwtable ssp {
	%1 = icmp sgt i32 %n, 0			%1 = icmp sgt i32 %n, 0
	br i1 %1, label %.lr.ph, label %._crit_edge			br i1 %1, label %.lr.ph, label %._crit_edge

	.lr.ph: ; preds = %0, %.lr.ph			.lr.ph: ; preds = %0, %.lr.ph
	%indvars.iv = phi i64 [ %indvars.iv.next, %.lr.ph ], [ 0, %0 ]			%indvars.iv = phi i64 [ %indvars.iv.next, %.lr.ph ], [ 0, %0 ]
	%2 = add nsw i64 %indvars.iv, 12			%2 = add nsw i64 %indvars.iv, 12
	Show All 11 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[LV] For some induction variables, use vector phis instead of widening the scalar in the loop bodyClosedPublic

Details

Diff Detail

Event Timeline

BB#0: # %entry

BB#0: # %entry

Revision Contents

Diff 57415

lib/Transforms/Vectorize/LoopVectorize.cpp

test/Transforms/LoopVectorize/PowerPC/vsx-tsvc-s173.ll

test/Transforms/LoopVectorize/X86/gather_scatter.ll

test/Transforms/LoopVectorize/cast-induction.ll

test/Transforms/LoopVectorize/gcc-examples.ll

test/Transforms/LoopVectorize/gep_with_bitcast.ll

test/Transforms/LoopVectorize/global_alias.ll

test/Transforms/LoopVectorize/induction_plus.ll

[LV] For some induction variables, use vector phis instead of widening the scalar in the loop body
ClosedPublic