This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
include/llvm/Transforms/Vectorize/
-
llvm/
-
Transforms/
-
Vectorize/
-
SLPVectorizer.h
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
1/4
SLPVectorizer.cpp

Differential D29449

[SLP] Generalization of vectorization of CmpInst operands, NFC.
AbandonedPublic

Authored by ABataev on Feb 2 2017, 6:26 AM.

Download Raw Diff

Details

Reviewers

spatel
mzolotukhin
mkuper
hfinkel

Summary

Patch removes CmpInst-specific code for vectorization of its operands and adds a general solution. May improve compile time for the code with many compare instructions.

Diff Detail

Event Timeline

ABataev created this revision.Feb 2 2017, 6:26 AM

mkuper added inline comments.Feb 2 2017, 11:42 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
4860–4865	The code is not equivalent. A cmp doesn't have to feed a branch. You can have two icmps feeding an or, for instance, with the i1 being used by a branch. Why would we want to avoid vectorizing those? Or do we already fail to vectorize when the icmp feeds anything but a branch? On the flip side, you can have a branch that is fed by an "or (icmp, icmp)", or any other source of i1. I assume this will get filtered out in tryToVectorize() but it'd be good to have a test. In any case, the comment above needs to be updated.

ABataev marked an inline comment as done.Feb 3 2017, 3:27 AM

ABataev added inline comments.

lib/Transforms/Vectorize/SLPVectorizer.cpp
4860–4865	You're not quite correct about it. Yes, of course. But this situation will be handled in `vectorizeRootInstruction()` function. It calls `canBeVectorized()` function that traverses all suboperations of the initial instruction (up to `RecursionMaxDepth` tree height). So actually, the current version of the code does the same work twice: the first time when we perform top-to-bottom analysis of all instruction in `vectorizeChainsInBlock()` function and the second time when we perform bottom-to-top analysis in `vectorizeRootInstruction()` function. The first top-to-bottom analysis breaks vectorization in some cases (for example, for future min/max reductions and maybe in some cases for binops too). Your assumption is correct, this situation is handled by `tryToVectorize()` function. And we have a test for these (or similar) situations in `Transforms/SLPVectorizer/X86/compare-reduce.ll`, `Transforms/SLPVectorizer/X86/horizontal-list.ll` and `Transforms/SLPVectorizer/X86/in-tree-user.ll` tests already. I'll update the comment

Updated comment

mssimpso added a subscriber: mssimpso.Feb 3 2017, 6:35 AM

mkuper added inline comments.Feb 3 2017, 3:49 PM

lib/Transforms/Vectorize/SLPVectorizer.cpp
4860–4865	Re 2 - sounds good. Re 1 - I'm not quite convinced. I mean, it will work if the cmp is in the middle of a tree, and there's a root instruction somewhere below it, but that doesn't have to be the case. I mean, if you have: %cmp = icmp eq i32 %foo, %bar %cmp2 = icmp eq i32 %foo2, %bar2 %or = or i1 %cmp, %cmp2 br i1 %or, label %l1, label %l2 Then yes, you'll call vectorizeRootInstruction() on %or, and that's fine, because it can start with any binary operator, and look through the cmps. But what if you don't feed a branch at all? E.g. %cmp = icmp eq i32 %foo, %bar %z = zext i1 %cmp to i32 Or %cmp = icmp eq i32 %foo, %bar call @foo(i1 %cmp) etc. I agree that what we do now is pretty silly - we start from a fairly arbitrary subset of possible roots. But I don't think we should reduce that set further. Or did I miss something about how this works?

ABataev added inline comments.Feb 6 2017, 1:26 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
4860–4865	Michael, the first example: %cmp = icmp eq i32 %foo, %bar %z = zext i1 %cmp to i32 still must feed some instructions like return, store or something like this. Otherwise, it is not used and will be removed from the code and we don't need to perform analysis of this CmpInst at all. Calls also must feed some other instructions. Of course, sometimes we may have standalone calls (like returning void or with ignored result), but in this case BinOps, used as args in this CallInst, won't be analyzed too. We just should teach the vectorizer about this situation.

Sorry, I'm failing to communicate the example I have in mind.
Here it is, concretely:

declare void @bar(i1)

define void @foo(i32* %A, i32 %k, i32 %n) {
  %idx0 = getelementptr inbounds i32, i32* %A, i64 0
  %idx4 = getelementptr inbounds i32, i32* %A, i64 4
  %load0 = load i32, i32* %idx0, align 8
  %load4 = load i32, i32* %idx4, align 8
  %mul0 = mul i32 %load0, %k
  %mul4 = mul i32 %load4, %k
  %res = add i32 %mul0, %mul4
  %cmp = icmp eq i32 %res, %n
  call void @bar(i1 %cmp)
  ret void
}

With the current code, we get:
$ bin/opt -slp-vectorizer < ~/llvm/temp/cmpslp.ll -S -o - -slp-threshold=-10

declare void @bar(i1)

define void @foo(i32* %A, i32 %k, i32 %n) {
  %idx0 = getelementptr inbounds i32, i32* %A, i64 0
  %idx4 = getelementptr inbounds i32, i32* %A, i64 4
  %load0 = load i32, i32* %idx0, align 8
  %load4 = load i32, i32* %idx4, align 8
  %1 = insertelement <2 x i32> undef, i32 %k, i32 0
  %2 = insertelement <2 x i32> %1, i32 %k, i32 1
  %3 = insertelement <2 x i32> undef, i32 %load0, i32 0
  %4 = insertelement <2 x i32> %3, i32 %load4, i32 1
  %5 = mul <2 x i32> %2, %4
  %6 = extractelement <2 x i32> %5, i32 0
  %7 = extractelement <2 x i32> %5, i32 1
  %res = add i32 %6, %7
  %cmp = icmp eq i32 %res, %n
  call void @bar(i1 %cmp)
  ret void
}

The new code will not be able to vectorize this.

I agree with you that (a) what we do now is generally pretty bad, and (b) we handle this case more or less by accident.
But this patch is not NFC, and has the potential to regress this kind of cases.

In D29449#668395, @mkuper wrote:
Sorry, I'm failing to communicate the example I have in mind.
Here it is, concretely:
declare void @bar(i1)

define void @foo(i32* %A, i32 %k, i32 %n) {
  %idx0 = getelementptr inbounds i32, i32* %A, i64 0
  %idx4 = getelementptr inbounds i32, i32* %A, i64 4
  %load0 = load i32, i32* %idx0, align 8
  %load4 = load i32, i32* %idx4, align 8
  %mul0 = mul i32 %load0, %k
  %mul4 = mul i32 %load4, %k
  %res = add i32 %mul0, %mul4
  %cmp = icmp eq i32 %res, %n
  call void @bar(i1 %cmp)
  ret void
}
With the current code, we get:
$ bin/opt -slp-vectorizer < ~/llvm/temp/cmpslp.ll -S -o - -slp-threshold=-10
declare void @bar(i1)

define void @foo(i32* %A, i32 %k, i32 %n) {
  %idx0 = getelementptr inbounds i32, i32* %A, i64 0
  %idx4 = getelementptr inbounds i32, i32* %A, i64 4
  %load0 = load i32, i32* %idx0, align 8
  %load4 = load i32, i32* %idx4, align 8
  %1 = insertelement <2 x i32> undef, i32 %k, i32 0
  %2 = insertelement <2 x i32> %1, i32 %k, i32 1
  %3 = insertelement <2 x i32> undef, i32 %load0, i32 0
  %4 = insertelement <2 x i32> %3, i32 %load4, i32 1
  %5 = mul <2 x i32> %2, %4
  %6 = extractelement <2 x i32> %5, i32 0
  %7 = extractelement <2 x i32> %5, i32 1
  %res = add i32 %6, %7
  %cmp = icmp eq i32 %res, %n
  call void @bar(i1 %cmp)
  ret void
}
The new code will not be able to vectorize this.

I agree with you that (a) what we do now is generally pretty bad, and (b) we handle this case more or less by accident.
But this patch is not NFC, and has the potential to regress this kind of cases.

Michael, I understand this.
What should I do then? Prepare a patch with vectorization of CallInst args and then prepare an NFC patch for CmpInst? Or you have something different in your mind?

What should I do then?

Short term - maybe nothing?
Is this patch blocking anything? I understand this is part of the work to support min/max reductions, but why is it necessary? Can we go forward with that without regressing any existing cases?

Longer term - it would probably be good to try to come up with a saner, or at least, more principled way to do root selection, that also doesn't cause us to look at instructions several times. I don't think adding more ad-hoc cases (CallInst) is the way to go. I'm fairly sure we can come up with other examples like this.

RKSimon resigned from this revision.Feb 8 2017, 3:41 AM

RKSimon added a subscriber: RKSimon.

ABataev abandoned this revision.Feb 10 2017, 7:26 AM

Revision Contents

Path

Size

include/

llvm/

Transforms/

Vectorize/

SLPVectorizer.h

4 lines

lib/

Transforms/

Vectorize/

SLPVectorizer.cpp

47 lines

Diff 86948

include/llvm/Transforms/Vectorize/SLPVectorizer.h

Show First 20 Lines • Show All 76 Lines • ▼ Show 20 Lines	private:
/// \brief Try to vectorize a list of operands.		/// \brief Try to vectorize a list of operands.
/// \@param BuildVector A list of users to ignore for the purpose of		/// \@param BuildVector A list of users to ignore for the purpose of
/// scheduling and that don't need extracting.		/// scheduling and that don't need extracting.
/// \returns true if a value was vectorized.		/// \returns true if a value was vectorized.
bool tryToVectorizeList(ArrayRef<Value *> VL, slpvectorizer::BoUpSLP &R,		bool tryToVectorizeList(ArrayRef<Value *> VL, slpvectorizer::BoUpSLP &R,
ArrayRef<Value *> BuildVector = None,		ArrayRef<Value *> BuildVector = None,
bool AllowReorder = false);		bool AllowReorder = false);

/// \brief Try to vectorize a chain that may start at the operands of \V;		/// \brief Try to vectorize a chain that may start at the operands of \p I.
bool tryToVectorize(BinaryOperator *V, slpvectorizer::BoUpSLP &R);		bool tryToVectorize(Instruction *I, slpvectorizer::BoUpSLP &R);

/// \brief Vectorize the store instructions collected in Stores.		/// \brief Vectorize the store instructions collected in Stores.
bool vectorizeStoreChains(slpvectorizer::BoUpSLP &R);		bool vectorizeStoreChains(slpvectorizer::BoUpSLP &R);

/// \brief Vectorize the index computations of the getelementptr instructions		/// \brief Vectorize the index computations of the getelementptr instructions
/// collected in GEPs.		/// collected in GEPs.
bool vectorizeGEPIndices(BasicBlock *BB, slpvectorizer::BoUpSLP &R);		bool vectorizeGEPIndices(BasicBlock *BB, slpvectorizer::BoUpSLP &R);

Show All 24 Lines

lib/Transforms/Vectorize/SLPVectorizer.cpp

Show First 20 Lines • Show All 4,079 Lines • ▼ Show 20 Lines	for (unsigned I = NextInst; I < MaxInst; ++I) {
Changed = true;		Changed = true;
}		}
}		}
}		}

return Changed;		return Changed;
}		}

bool SLPVectorizerPass::tryToVectorize(BinaryOperator *V, BoUpSLP &R) {		bool SLPVectorizerPass::tryToVectorize(Instruction *I, BoUpSLP &R) {
if (!V)		if (!I)
		return false;

		if (!isa<BinaryOperator>(I) && !isa<CmpInst>(I))
return false;		return false;

Value *P = V->getParent();		Value *P = I->getParent();

// Vectorize in current basic block only.		// Vectorize in current basic block only.
auto *Op0 = dyn_cast<Instruction>(V->getOperand(0));		auto *Op0 = dyn_cast<Instruction>(I->getOperand(0));
auto *Op1 = dyn_cast<Instruction>(V->getOperand(1));		auto *Op1 = dyn_cast<Instruction>(I->getOperand(1));
if (!Op0 \|\| !Op1 \|\| Op0->getParent() != P \|\| Op1->getParent() != P)		if (!Op0 \|\| !Op1 \|\| Op0->getParent() != P \|\| Op1->getParent() != P)
return false;		return false;

// Try to vectorize V.		// Try to vectorize V.
if (tryToVectorizePair(Op0, Op1, R))		if (tryToVectorizePair(Op0, Op1, R))
return true;		return true;

auto *A = dyn_cast<BinaryOperator>(Op0);		auto *A = dyn_cast<BinaryOperator>(Op0);
▲ Show 20 Lines • Show All 556 Lines • ▼ Show 20 Lines
} // namespace		} // namespace

/// \brief Attempt to reduce a horizontal reduction.		/// \brief Attempt to reduce a horizontal reduction.
/// If it is legal to match a horizontal reduction feeding		/// If it is legal to match a horizontal reduction feeding
/// the phi node P with reduction operators Root in a basic block BB, then check		/// the phi node P with reduction operators Root in a basic block BB, then check
/// if it can be done.		/// if it can be done.
/// \returns true if a horizontal reduction was matched and reduced.		/// \returns true if a horizontal reduction was matched and reduced.
/// \returns false if a horizontal reduction was not matched.		/// \returns false if a horizontal reduction was not matched.
static bool canBeVectorized(		static bool
PHINode P, Instruction Root, BasicBlock *BB, BoUpSLP &R,		canBeVectorized(PHINode P, Instruction Root, BasicBlock *BB, BoUpSLP &R,
TargetTransformInfo *TTI,		TargetTransformInfo *TTI,
const function_ref<bool(BinaryOperator *, BoUpSLP &)> Vectorize) {		const function_ref<bool(Instruction *, BoUpSLP &)> Vectorize) {
if (!ShouldVectorizeHor)		if (!ShouldVectorizeHor)
return false;		return false;

if (!Root)		if (!Root)
return false;		return false;

if (Root->getParent() != BB)		if (Root->getParent() != BB)
return false;		return false;
Show All 28 Lines	if (Stack.back().isInitial()) {
Inst = dyn_cast<Instruction>(BI->getOperand(1));		Inst = dyn_cast<Instruction>(BI->getOperand(1));
if (!Inst) {		if (!Inst) {
P = nullptr;		P = nullptr;
continue;		continue;
}		}
}		}
}		}
P = nullptr;		P = nullptr;
if (Vectorize(dyn_cast<BinaryOperator>(Inst), R)) {		if (Vectorize(Inst, R)) {
Res = true;		Res = true;
continue;		continue;
}		}
}		}
if (Stack.back().isFinal()) {		if (Stack.back().isFinal()) {
Stack.pop_back();		Stack.pop_back();
continue;		continue;
}		}
Show All 14 Lines	bool SLPVectorizerPass::vectorizeRootInstruction(PHINode P, Value V,
auto *I = dyn_cast<Instruction>(V);		auto *I = dyn_cast<Instruction>(V);
if (!I)		if (!I)
return false;		return false;

if (!isa<BinaryOperator>(I))		if (!isa<BinaryOperator>(I))
P = nullptr;		P = nullptr;
// Try to match and vectorize a horizontal reduction.		// Try to match and vectorize a horizontal reduction.
return canBeVectorized(P, I, BB, R, TTI,		return canBeVectorized(P, I, BB, R, TTI,
[this](BinaryOperator *BI, BoUpSLP &R) -> bool {		[this](Instruction *I, BoUpSLP &R) -> bool {
return tryToVectorize(BI, R);		return tryToVectorize(I, R);
});		});
}		}

bool SLPVectorizerPass::vectorizeChainsInBlock(BasicBlock *BB, BoUpSLP &R) {		bool SLPVectorizerPass::vectorizeChainsInBlock(BasicBlock *BB, BoUpSLP &R) {
bool Changed = false;		bool Changed = false;
SmallVector<Value *, 4> Incoming;		SmallVector<Value *, 4> Incoming;
SmallSet<Value *, 16> VisitedInstrs;		SmallSet<Value *, 16> VisitedInstrs;

▲ Show 20 Lines • Show All 91 Lines • ▼ Show 20 Lines	if (ReturnInst *RI = dyn_cast<ReturnInst>(it)) {
Changed = true;		Changed = true;
it = BB->begin();		it = BB->begin();
e = BB->end();		e = BB->end();
continue;		continue;
}		}
}		}
}		}

// Try to vectorize trees that start at compare instructions.		if (auto *BI = dyn_cast<BranchInst>(it)) {
if (CmpInst *CI = dyn_cast<CmpInst>(it)) {		if (!BI->isConditional())
if (tryToVectorizePair(CI->getOperand(0), CI->getOperand(1), R)) {
Changed = true;
// We would like to start over since some instructions are deleted
// and the iterator may become invalid value.
it = BB->begin();
e = BB->end();
continue;		continue;
}

for (int I = 0; I < 2; ++I) {		// Try to vectorize trees that start at logical/compare instructions.
if (vectorizeRootInstruction(nullptr, CI->getOperand(I), BB, R, TTI)) {		if (vectorizeRootInstruction(nullptr, BI->getCondition(), BB, R, TTI)) {
		mkuperUnsubmitted Done Reply Inline Actions The code is not equivalent. A cmp doesn't have to feed a branch. You can have two icmps feeding an or, for instance, with the i1 being used by a branch. Why would we want to avoid vectorizing those? Or do we already fail to vectorize when the icmp feeds anything but a branch? On the flip side, you can have a branch that is fed by an "or (icmp, icmp)", or any other source of i1. I assume this will get filtered out in tryToVectorize() but it'd be good to have a test. In any case, the comment above needs to be updated. mkuper: The code is not equivalent. 1) A cmp doesn't have to feed a branch. You can have two icmps…
		ABataevAuthorUnsubmitted Not Done Reply Inline Actions You're not quite correct about it. Yes, of course. But this situation will be handled in `vectorizeRootInstruction()` function. It calls `canBeVectorized()` function that traverses all suboperations of the initial instruction (up to `RecursionMaxDepth` tree height). So actually, the current version of the code does the same work twice: the first time when we perform top-to-bottom analysis of all instruction in `vectorizeChainsInBlock()` function and the second time when we perform bottom-to-top analysis in `vectorizeRootInstruction()` function. The first top-to-bottom analysis breaks vectorization in some cases (for example, for future min/max reductions and maybe in some cases for binops too). Your assumption is correct, this situation is handled by `tryToVectorize()` function. And we have a test for these (or similar) situations in `Transforms/SLPVectorizer/X86/compare-reduce.ll`, `Transforms/SLPVectorizer/X86/horizontal-list.ll` and `Transforms/SLPVectorizer/X86/in-tree-user.ll` tests already. I'll update the comment ABataev: You're not quite correct about it. 1. Yes, of course. But this situation will be handled in…
		mkuperUnsubmitted Not Done Reply Inline Actions Re 2 - sounds good. Re 1 - I'm not quite convinced. I mean, it will work if the cmp is in the middle of a tree, and there's a root instruction somewhere below it, but that doesn't have to be the case. I mean, if you have: %cmp = icmp eq i32 %foo, %bar %cmp2 = icmp eq i32 %foo2, %bar2 %or = or i1 %cmp, %cmp2 br i1 %or, label %l1, label %l2 Then yes, you'll call vectorizeRootInstruction() on %or, and that's fine, because it can start with any binary operator, and look through the cmps. But what if you don't feed a branch at all? E.g. %cmp = icmp eq i32 %foo, %bar %z = zext i1 %cmp to i32 Or %cmp = icmp eq i32 %foo, %bar call @foo(i1 %cmp) etc. I agree that what we do now is pretty silly - we start from a fairly arbitrary subset of possible roots. But I don't think we should reduce that set further. Or did I miss something about how this works? mkuper: Re 2 - sounds good. Re 1 - I'm not quite convinced. I mean, it will work if the cmp is in the…
		ABataevAuthorUnsubmitted Not Done Reply Inline Actions Michael, the first example: %cmp = icmp eq i32 %foo, %bar %z = zext i1 %cmp to i32 still must feed some instructions like return, store or something like this. Otherwise, it is not used and will be removed from the code and we don't need to perform analysis of this CmpInst at all. Calls also must feed some other instructions. Of course, sometimes we may have standalone calls (like returning void or with ignored result), but in this case BinOps, used as args in this CallInst, won't be analyzed too. We just should teach the vectorizer about this situation. ABataev: 1. Michael, the first example: ``` %cmp = icmp eq i32 %foo, %bar %z = zext i1 %cmp to i32 ```…
Changed = true;		Changed = true;
// We would like to start over since some instructions are deleted		// We would like to start over since some instructions are deleted
// and the iterator may become invalid value.		// and the iterator may become invalid value.
it = BB->begin();		it = BB->begin();
e = BB->end();		e = BB->end();
break;		continue;
}
}		}
continue;		continue;
}		}

// Try to vectorize trees that start at insertelement instructions.		// Try to vectorize trees that start at insertelement instructions.
if (InsertElementInst *FirstInsertElem = dyn_cast<InsertElementInst>(it)) {		if (InsertElementInst *FirstInsertElem = dyn_cast<InsertElementInst>(it)) {
SmallVector<Value *, 16> BuildVector;		SmallVector<Value *, 16> BuildVector;
SmallVector<Value *, 16> BuildVectorOpds;		SmallVector<Value *, 16> BuildVectorOpds;
▲ Show 20 Lines • Show All 161 Lines • Show Last 20 Lines