This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
-
VectorCombine.cpp
-
test/Transforms/VectorCombine/X86/
-
Transforms/
-
VectorCombine/
-
X86/
2
extract-binop.ll

Differential D79078

[VectorCombine] Leave reduction operation to SLP
AbandonedPublic

Authored by junparser on Apr 29 2020, 3:01 AM.

Download Raw Diff

Details

Reviewers

spatel
RKSimon
lebedev.ri
jgorbe

Summary

As title, we found that the transformation of vector-combine pass break reduction handle in SLP pass after https://reviews.llvm.org/D75689.
This patch conservatively match reduction pattern and return true in isExtractExtractCheap

TestPlan: check-clang check-llvm

Diff Detail

Event Timeline

junparser created this revision.Apr 29 2020, 3:01 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 29 2020, 3:01 AM

Herald added subscribers: llvm-commits, hiraditya. · View Herald Transcript

This needs a phase-ordering test.
Why shouldn't SLPVectorizer be taught about that pattern instead?

llvm/test/Transforms/VectorCombine/X86/extract-binop.ll

494

Which is now being transformed into

define i32 @ext_ext_reduction(<4 x i32> %x, <4 x i32> %y) {
  %and = and <4 x i32> %x, %y
  %1 = shufflevector <4 x i32> %and, <4 x i32> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
  %2 = or <4 x i32> %1, %and
  %3 = extractelement <4 x i32> %2, i64 0
  %vecext.2 = extractelement <4 x i32> %and, i32 2
  %add.2 = or i32 %vecext.2, %3
  %vecext.3 = extractelement <4 x i32> %and, i32 3
  %add.3 = or i32 %vecext.3, %add.2
  ret i32 %add.3
}

Harbormaster completed remote builds in B55101: Diff 260870.Apr 29 2020, 3:43 AM

In D79078#2009814, @lebedev.ri wrote:

This needs a phase-ordering test.

For now vector-combine is executed before slp in both legacy pm and new pm with O2, so either we handle it here or slp can handle this kind of pattern.

Why shouldn't SLPVectorizer be taught about that pattern instead?

I think we can handle this in SLP, however it may much hard than doing here? I'm not sure.

In D79078#2009883, @junparser wrote:

In D79078#2009814, @lebedev.ri wrote:

This needs a phase-ordering test.

For now vector-combine is executed before slp in both legacy pm and new pm with O2, so either we handle it here or slp can handle this kind of pattern.

I'm not really sure what you mean.
We clearly have phase-ordering issue, and we should have a test that shows it.

Why shouldn't SLPVectorizer be taught about that pattern instead?

I think we can handle this in SLP, however it may much hard than doing here? I'm not sure.

It would be best to enhance SLP rather than adding potentially-numerous semi-arbitrarily bailout elsewhere, i think.

In D79078#2009922, @lebedev.ri wrote:

In D79078#2009883, @junparser wrote:

In D79078#2009814, @lebedev.ri wrote:

This needs a phase-ordering test.

For now vector-combine is executed before slp in both legacy pm and new pm with O2, so either we handle it here or slp can handle this kind of pattern.

I'm not really sure what you mean.
We clearly have phase-ordering issue, and we should have a test that shows it.

Thanks, now i know what you mean.

Why shouldn't SLPVectorizer be taught about that pattern instead?

I think we can handle this in SLP, however it may much hard than doing here? I'm not sure.

It would be best to enhance SLP rather than adding potentially-numerous semi-arbitrarily bailout elsewhere, i think.

make sense, will check this later.

I agree that we need a phase ordering test - see llvm/test/Transforms/PhaseOrdering/X86/addsub.ll as an example test file that describes some expected handshake between vector-combine and other passes. Also, please commit the new test with complete baseline CHECK lines (use utils/update_test_checks.py), and then update the file to show the diffs in this review. Let me know if that's not clear.

I also sympathize with trying to solve this here rather than SLP. One of the reasons vector-combine exists is because SLP became too hard to reason about. In hindsight, we should have created a separate pass for reductions - those are not traditional SLP concerns. Just my opinion. :)

lebedev.ri requested changes to this revision.May 6 2020, 12:08 AM

This revision now requires changes to proceed.May 6 2020, 12:08 AM

update the testcase.

hi @lebedev.ri , sorry for the late response.
I have checked the reduction tree build in SLPVectorizer, for now it only support same operation. It seems we can revert the transformation when match reduction operation in matchAssociativeReduction. However, I don't think it is a good idea.
Also the extra phase-ordering test shows that this patch has minimal impact. Except the first case which is vectorized by SLP, the other cases keep the same IR as before.

hi @lebedev.ri , @spatel , kindly ping.

I'll defer to @spatel, although i semi-weakly insist that adding such cut-offs is weird.
OTOH if this pass is taught to form such reductions we would have caught this regression for free,
because after D79799 we would have ended up in an endless combine loop here.

In D79078#2010024, @spatel wrote:

I agree that we need a phase ordering test - see llvm/test/Transforms/PhaseOrdering/X86/addsub.ll as an example test file that describes some expected handshake between vector-combine and other passes. Also, please commit the new test with complete baseline CHECK lines (use utils/update_test_checks.py), and then update the file to show the diffs in this review. Let me know if that's not clear.

Almost there - it should be next to the example test, not where it is now.

In D79078#2010024, @spatel wrote:

I also sympathize with trying to solve this here rather than SLP. One of the reasons vector-combine exists is because SLP became too hard to reason about. In hindsight, we should have created a separate pass for reductions - those are not traditional SLP concerns. Just my opinion. :)

I'm not sure what you have in mind here?
That *this* pass should also form such reductions?
Or that we should not disturb them after SLP formed them?
Or something else?

In D79078#2031347, @junparser wrote:

hi @lebedev.ri , sorry for the late response.
I have checked the reduction tree build in SLPVectorizer, for now it only support same operation. It seems we can revert the transformation when match reduction operation in matchAssociativeReduction. However, I don't think it is a good idea.

I'm having trouble parsing this response.
Are you saying that SLP can not be taught about this,
that it should not be taught about this or ???.

llvm/test/Transforms/VectorCombine/X86/extract-binop.ll
4–5	Like @spatel has already said, please see `llvm/test/Transforms/PhaseOrdering/X86/addsub.ll` The passordering test should be similar to that one, in that directory.

In D79078#2038950, @lebedev.ri wrote:

In D79078#2010024, @spatel wrote:

I also sympathize with trying to solve this here rather than SLP. One of the reasons vector-combine exists is because SLP became too hard to reason about. In hindsight, we should have created a separate pass for reductions - those are not traditional SLP concerns. Just my opinion. :)

I'm not sure what you have in mind here?
That *this* pass should also form such reductions?
Or that we should not disturb them after SLP formed them?
Or something else?

The reduction logic is a complicated blob of code, so I don't think it belongs here. I'd split it off from SLP into its own pass, but it looks like a lot of untangling.
Currently, we're running this pass *before* SLP only. We could move this after SLP to make sure we are not disturbing reductions before SLP has a chance to recognize them...but I'm not sure if that would also now cause regressions. I don't have a good feel for how these passes are interacting.

What does it take to cause the infinite looping that you found?

Looking at that 1st test - if we allow iteration in this pass, we'll end up with:

define i32 @ext_ext_reduction(<4 x i32> %x, <4 x i32> %y) {
  %and = and <4 x i32> %x, %y
  %1 = shufflevector <4 x i32> %and, <4 x i32> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
  %2 = or <4 x i32> %1, %and
  %3 = shufflevector <4 x i32> %and, <4 x i32> undef, <4 x i32> <i32 2, i32 undef, i32 undef, i32 undef>
  %4 = or <4 x i32> %3, %2
  %5 = shufflevector <4 x i32> %and, <4 x i32> undef, <4 x i32> <i32 3, i32 undef, i32 undef, i32 undef>
  %6 = or <4 x i32> %5, %4
  %7 = extractelement <4 x i32> %6, i64 0
  ret i32 %7
}

And nothing knows how to form the optimal reduction from that pattern. We could say that's the real problem - source code could be in that form originally, so we just miss the reassociation optimization opportunity.

In D79078#2040144, @spatel wrote:

In D79078#2038950, @lebedev.ri wrote:

In D79078#2010024, @spatel wrote:

I also sympathize with trying to solve this here rather than SLP. One of the reasons vector-combine exists is because SLP became too hard to reason about. In hindsight, we should have created a separate pass for reductions - those are not traditional SLP concerns. Just my opinion. :)

I'm not sure what you have in mind here?
That *this* pass should also form such reductions?
Or that we should not disturb them after SLP formed them?
Or something else?

The reduction logic is a complicated blob of code, so I don't think it belongs here. I'd split it off from SLP into its own pass, but it looks like a lot of untangling.
Currently, we're running this pass *before* SLP only. We could move this after SLP to make sure we are not disturbing reductions before SLP has a chance to recognize them...but I'm not sure if that would also now cause regressions. I don't have a good feel for how these passes are interacting.

What does it take to cause the infinite looping that you found?

No, i mean in the case if we would be forming reductions here,
because i think we'd then have two conflicting transforms here,
and they would cause traditional instcombine/dagcombine-esque infinite combine loop.

Looking at that 1st test - if we allow iteration in this pass, we'll end up with:
define i32 @ext_ext_reduction(<4 x i32> %x, <4 x i32> %y) {
  %and = and <4 x i32> %x, %y
  %1 = shufflevector <4 x i32> %and, <4 x i32> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
  %2 = or <4 x i32> %1, %and
  %3 = shufflevector <4 x i32> %and, <4 x i32> undef, <4 x i32> <i32 2, i32 undef, i32 undef, i32 undef>
  %4 = or <4 x i32> %3, %2
  %5 = shufflevector <4 x i32> %and, <4 x i32> undef, <4 x i32> <i32 3, i32 undef, i32 undef, i32 undef>
  %6 = or <4 x i32> %5, %4
  %7 = extractelement <4 x i32> %6, i64 0
  ret i32 %7
}
And nothing knows how to form the optimal reduction from that pattern. We could say that's the real problem - source code could be in that form originally, so we just miss the reassociation optimization opportunity.

That would be the outcome i would prefer, yes.

spatel mentioned this in rG6211830fbabd: [VectorCombine] add reduction-like patterns; NFC.May 16 2020, 10:01 AM

spatel mentioned this in rG43017ceb7841: [PhaseOrdering] add vector reduction tests; NFC.

spatel mentioned this in rG81e9ede3a2db: [VectorCombine] forward walk through instructions to improve chaining of….May 16 2020, 10:33 AM

spatel mentioned this in D79799: [VectorCombine] add loop to enable iterative folding.May 16 2020, 10:37 AM

I added slightly modified versions of the tests here with:
rG6211830fbabd
rG43017ceb7841

Because they are affected by a change that I split off from D79799:
rG81e9ede3a2db

Please rebase (although given what we've discussed here, I'm not sure how we want to solve the general problem of matching/transforming vector reductions).

In D79078#2038950, @lebedev.ri wrote:

I'll defer to @spatel, although i semi-weakly insist that adding such cut-offs is weird.
OTOH if this pass is taught to form such reductions we would have caught this regression for free,
because after D79799 we would have ended up in an endless combine loop here.

In D79078#2010024, @spatel wrote:

I agree that we need a phase ordering test - see llvm/test/Transforms/PhaseOrdering/X86/addsub.ll as an example test file that describes some expected handshake between vector-combine and other passes. Also, please commit the new test with complete baseline CHECK lines (use utils/update_test_checks.py), and then update the file to show the diffs in this review. Let me know if that's not clear.

Almost there - it should be next to the example test, not where it is now.

In D79078#2010024, @spatel wrote:

I also sympathize with trying to solve this here rather than SLP. One of the reasons vector-combine exists is because SLP became too hard to reason about. In hindsight, we should have created a separate pass for reductions - those are not traditional SLP concerns. Just my opinion. :)

I'm not sure what you have in mind here?
That *this* pass should also form such reductions?
Or that we should not disturb them after SLP formed them?
Or something else?

In D79078#2031347, @junparser wrote:

hi @lebedev.ri , sorry for the late response.
I have checked the reduction tree build in SLPVectorizer, for now it only support same operation. It seems we can revert the transformation when match reduction operation in matchAssociativeReduction. However, I don't think it is a good idea.

I'm having trouble parsing this response.
Are you saying that SLP can not be taught about this,
that it should not be taught about this or ???.

I'm saying that SLP should not be taught about this. I totally agree with @spatel that splitting reduction transformation off from SLP and recognizing these forms in that pass.

In D79078#2040191, @spatel wrote:

I added slightly modified versions of the tests here with:
rG6211830fbabd
rG43017ceb7841

Because they are affected by a change that I split off from D79799:
rG81e9ede3a2db

Please rebase (although given what we've discussed here, I'm not sure how we want to solve the general problem of matching/transforming vector reductions).

Yes, this patch just avoid the transforming. can we handle such form at the end of this pass (revert it to reduction form)?

In D79078#2040924, @junparser wrote:

In D79078#2040191, @spatel wrote:

I added slightly modified versions of the tests here with:
rG6211830fbabd
rG43017ceb7841

Because they are affected by a change that I split off from D79799:
rG81e9ede3a2db

Please rebase (although given what we've discussed here, I'm not sure how we want to solve the general problem of matching/transforming vector reductions).

Yes, this patch just avoid the transforming. can we handle such form at the end of this pass (revert it to reduction form)?

It raises a question: are we comfortable transforming IR to the (still experimental) reduction intrinsics?
http://llvm.org/docs/LangRef.html#experimental-vector-reduction-intrinsics

I'm finding a related problem with x86 horizontal math ops: -vector-combine does some locally profitable transforms, and SLP can't recognize the larger pattern now.
I think the quick solution will be to move this pass after SLP until we can do bigger changes like recognize reductions here (or possibly in instcombine if we are ready to canonicalize to the intrinsics).

Alternate proposal: D80236

junparser abandoned this revision.May 24 2020, 8:56 PM

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

VectorCombine.cpp

81 lines

test/

Transforms/

VectorCombine/

X86/

extract-binop.ll

91 lines

Diff 263409

llvm/lib/Transforms/Vectorize/VectorCombine.cpp

Show First 20 Lines • Show All 45 Lines • ▼ Show 20 Lines


/// Compare the relative costs of 2 extracts followed by scalar operation vs.		/// Compare the relative costs of 2 extracts followed by scalar operation vs.
/// vector operation(s) followed by extract. Return true if the existing		/// vector operation(s) followed by extract. Return true if the existing
/// instructions are cheaper than a vector alternative. Otherwise, return false		/// instructions are cheaper than a vector alternative. Otherwise, return false
/// and if one of the extracts should be transformed to a shufflevector, set		/// and if one of the extracts should be transformed to a shufflevector, set
/// \p ConvertToShuffle to that extract instruction.		/// \p ConvertToShuffle to that extract instruction.
static bool isExtractExtractCheap(Instruction Ext0, Instruction Ext1,		static bool isExtractExtractCheap(Instruction Ext0, Instruction Ext1,
unsigned Opcode,		const Instruction &OpInst,
const TargetTransformInfo &TTI,		const TargetTransformInfo &TTI,
Instruction *&ConvertToShuffle,		Instruction *&ConvertToShuffle,
unsigned PreferredExtractIndex) {		unsigned PreferredExtractIndex) {
assert(isa<ConstantInt>(Ext0->getOperand(1)) &&		assert(isa<ConstantInt>(Ext0->getOperand(1)) &&
isa<ConstantInt>(Ext1->getOperand(1)) &&		isa<ConstantInt>(Ext1->getOperand(1)) &&
"Expected constant extract indexes");		"Expected constant extract indexes");
Type *ScalarTy = Ext0->getType();		Type *ScalarTy = Ext0->getType();
auto *VecTy = cast<VectorType>(Ext0->getOperand(0)->getType());		auto *VecTy = cast<VectorType>(Ext0->getOperand(0)->getType());
int ScalarOpCost, VectorOpCost;		int ScalarOpCost, VectorOpCost;

// Get cost estimates for scalar and vector versions of the operation.		// Get cost estimates for scalar and vector versions of the operation.
		unsigned Opcode = OpInst.getOpcode();
bool IsBinOp = Instruction::isBinaryOp(Opcode);		bool IsBinOp = Instruction::isBinaryOp(Opcode);
if (IsBinOp) {		if (IsBinOp) {
ScalarOpCost = TTI.getArithmeticInstrCost(Opcode, ScalarTy);		ScalarOpCost = TTI.getArithmeticInstrCost(Opcode, ScalarTy);
VectorOpCost = TTI.getArithmeticInstrCost(Opcode, VecTy);		VectorOpCost = TTI.getArithmeticInstrCost(Opcode, VecTy);
} else {		} else {
assert((Opcode == Instruction::ICmp \|\| Opcode == Instruction::FCmp) &&		assert((Opcode == Instruction::ICmp \|\| Opcode == Instruction::FCmp) &&
"Expected a compare");		"Expected a compare");
ScalarOpCost = TTI.getCmpSelInstrCost(Opcode, ScalarTy,		ScalarOpCost = TTI.getCmpSelInstrCost(Opcode, ScalarTy,
CmpInst::makeCmpResultType(ScalarTy));		CmpInst::makeCmpResultType(ScalarTy));
VectorOpCost = TTI.getCmpSelInstrCost(Opcode, VecTy,		VectorOpCost = TTI.getCmpSelInstrCost(Opcode, VecTy,
CmpInst::makeCmpResultType(VecTy));		CmpInst::makeCmpResultType(VecTy));
}		}

// Get cost estimates for the extract elements. These costs will factor into		// Get cost estimates for the extract elements. These costs will factor into
// both sequences.		// both sequences.
unsigned Ext0Index = cast<ConstantInt>(Ext0->getOperand(1))->getZExtValue();		unsigned Ext0Index = cast<ConstantInt>(Ext0->getOperand(1))->getZExtValue();
unsigned Ext1Index = cast<ConstantInt>(Ext1->getOperand(1))->getZExtValue();		unsigned Ext1Index = cast<ConstantInt>(Ext1->getOperand(1))->getZExtValue();
		Value *Ext0Vector = Ext0->getOperand(0);
		Value *Ext1Vector = Ext1->getOperand(0);

int Extract0Cost = TTI.getVectorInstrCost(Instruction::ExtractElement,		int Extract0Cost = TTI.getVectorInstrCost(Instruction::ExtractElement,
VecTy, Ext0Index);		VecTy, Ext0Index);
int Extract1Cost = TTI.getVectorInstrCost(Instruction::ExtractElement,		int Extract1Cost = TTI.getVectorInstrCost(Instruction::ExtractElement,
VecTy, Ext1Index);		VecTy, Ext1Index);

// A more expensive extract will always be replaced by a splat shuffle.		// A more expensive extract will always be replaced by a splat shuffle.
// For example, if Ext0 is more expensive:		// For example, if Ext0 is more expensive:
// opcode (extelt V0, Ext0), (ext V1, Ext1) -->		// opcode (extelt V0, Ext0), (ext V1, Ext1) -->
// extelt (opcode (splat V0, Ext0), V1), Ext1		// extelt (opcode (splat V0, Ext0), V1), Ext1
// TODO: Evaluate whether that always results in lowest cost. Alternatively,		// TODO: Evaluate whether that always results in lowest cost. Alternatively,
// check the cost of creating a broadcast shuffle and shuffling both		// check the cost of creating a broadcast shuffle and shuffling both
// operands to element 0.		// operands to element 0.
int CheapExtractCost = std::min(Extract0Cost, Extract1Cost);		int CheapExtractCost = std::min(Extract0Cost, Extract1Cost);

// Extra uses of the extracts mean that we include those costs in the		// Extra uses of the extracts mean that we include those costs in the
// vector total because those instructions will not be eliminated.		// vector total because those instructions will not be eliminated.
int OldCost, NewCost;		int OldCost, NewCost;
if (Ext0->getOperand(0) == Ext1->getOperand(0) && Ext0Index == Ext1Index) {		if (Ext0Vector == Ext1Vector && Ext0Index == Ext1Index) {
// Handle a special case. If the 2 extracts are identical, adjust the		// Handle a special case. If the 2 extracts are identical, adjust the
// formulas to account for that. The extra use charge allows for either the		// formulas to account for that. The extra use charge allows for either the
// CSE'd pattern or an unoptimized form with identical values:		// CSE'd pattern or an unoptimized form with identical values:
// opcode (extelt V, C), (extelt V, C) --> extelt (opcode V, V), C		// opcode (extelt V, C), (extelt V, C) --> extelt (opcode V, V), C
bool HasUseTax = Ext0 == Ext1 ? !Ext0->hasNUses(2)		bool HasUseTax = Ext0 == Ext1 ? !Ext0->hasNUses(2)
: !Ext0->hasOneUse() \|\| !Ext1->hasOneUse();		: !Ext0->hasOneUse() \|\| !Ext1->hasOneUse();
OldCost = CheapExtractCost + ScalarOpCost;		OldCost = CheapExtractCost + ScalarOpCost;
NewCost = VectorOpCost + CheapExtractCost + HasUseTax * CheapExtractCost;		NewCost = VectorOpCost + CheapExtractCost + HasUseTax * CheapExtractCost;
} else {		} else {
// Handle the general case. Each extract is actually a different value:		// Handle the general case. Each extract is actually a different value:
// opcode (extelt V0, C0), (extelt V1, C1) --> extelt (opcode V0, V1), C		// opcode (extelt V0, C0), (extelt V1, C1) --> extelt (opcode V0, V1), C
OldCost = Extract0Cost + Extract1Cost + ScalarOpCost;		OldCost = Extract0Cost + Extract1Cost + ScalarOpCost;
NewCost = VectorOpCost + CheapExtractCost +		NewCost = VectorOpCost + CheapExtractCost +
!Ext0->hasOneUse() * Extract0Cost +		!Ext0->hasOneUse() * Extract0Cost +
!Ext1->hasOneUse() * Extract1Cost;		!Ext1->hasOneUse() * Extract1Cost;
}		}

if (Ext0Index == Ext1Index) {		if (Ext0Index == Ext1Index) {
// If the extract indexes are identical, no shuffle is needed.		// If the extract indexes are identical, no shuffle is needed.
ConvertToShuffle = nullptr;		ConvertToShuffle = nullptr;
} else {		return OldCost < NewCost;
		}

		if (Ext0Vector == Ext1Vector && OpInst.hasOneUse()) {
		// If match the reduction pattern, do nothing.
		const auto *ReductionInst = OpInst.user_back();
		// User instruction is same operation.
		if (ReductionInst->getOpcode() == Opcode) {
		assert(ReductionInst->getNumOperands() == 2 &&
		"Expected a binary operation or compare");
		Value *SecondOp = ReductionInst->getOperand(0) == &OpInst
		? ReductionInst->getOperand(1)
		: ReductionInst->getOperand(0);
		auto *Ext2 = dyn_cast<ExtractElementInst>(SecondOp);
		if (Ext2 && Ext2->getOperand(0) == Ext0Vector)
		return true;
		}
		}

if (IsBinOp && DisableBinopExtractShuffle)		if (IsBinOp && DisableBinopExtractShuffle)
return true;		return true;

// If we are extracting from 2 different indexes, then one operand must be		// If we are extracting from 2 different indexes, then one operand must be
// shuffled before performing the vector operation. The shuffle mask is		// shuffled before performing the vector operation. The shuffle mask is
// undefined except for 1 lane that is being translated to the remaining		// undefined except for 1 lane that is being translated to the remaining
// extraction lane. Therefore, it is a splat shuffle. Ex:		// extraction lane. Therefore, it is a splat shuffle. Ex:
// ShufMask = { undef, undef, 0, undef }		// ShufMask = { undef, undef, 0, undef }
// TODO: The cost model has an option for a "broadcast" shuffle		// TODO: The cost model has an option for a "broadcast" shuffle
// (splat-from-element-0), but no option for a more general splat.		// (splat-from-element-0), but no option for a more general splat.
NewCost +=		NewCost +=
TTI.getShuffleCost(TargetTransformInfo::SK_PermuteSingleSrc, VecTy);		TTI.getShuffleCost(TargetTransformInfo::SK_PermuteSingleSrc, VecTy);

// The more expensive extract will be replaced by a shuffle. If the costs		// The more expensive extract will be replaced by a shuffle. If the costs
// are equal and there is a preferred extract index, shuffle the opposite		// are equal and there is a preferred extract index, shuffle the opposite
// operand. Otherwise, replace the extract with the higher index.		// operand. Otherwise, replace the extract with the higher index.
if (Extract0Cost > Extract1Cost)		if (Extract0Cost > Extract1Cost)
ConvertToShuffle = Ext0;		ConvertToShuffle = Ext0;
else if (Extract1Cost > Extract0Cost)		else if (Extract1Cost > Extract0Cost)
ConvertToShuffle = Ext1;		ConvertToShuffle = Ext1;
else if (PreferredExtractIndex == Ext0Index)		else if (PreferredExtractIndex == Ext0Index)
ConvertToShuffle = Ext1;		ConvertToShuffle = Ext1;
else if (PreferredExtractIndex == Ext1Index)		else if (PreferredExtractIndex == Ext1Index)
ConvertToShuffle = Ext0;		ConvertToShuffle = Ext0;
else		else
ConvertToShuffle = Ext0Index > Ext1Index ? Ext0 : Ext1;		ConvertToShuffle = Ext0Index > Ext1Index ? Ext0 : Ext1;
}

// Aggressively form a vector op if the cost is equal because the transform		// Aggressively form a vector op if the cost is equal because the transform
// may enable further optimization.		// may enable further optimization.
// Codegen can reverse this transform (scalarize) if it was not profitable.		// Codegen can reverse this transform (scalarize) if it was not profitable.
return OldCost < NewCost;		return OldCost < NewCost;
}		}

/// Try to reduce extract element costs by converting scalar compares to vector		/// Try to reduce extract element costs by converting scalar compares to vector
▲ Show 20 Lines • Show All 64 Lines • ▼ Show 20 Lines	static bool foldExtractExtract(Instruction &I, const TargetTransformInfo &TTI) {
// TODO: If we add a larger pattern match that starts from an insert, this		// TODO: If we add a larger pattern match that starts from an insert, this
// probably becomes unnecessary.		// probably becomes unnecessary.
uint64_t InsertIndex = std::numeric_limits<uint64_t>::max();		uint64_t InsertIndex = std::numeric_limits<uint64_t>::max();
if (I.hasOneUse())		if (I.hasOneUse())
match(I.user_back(), m_InsertElement(m_Value(), m_Value(),		match(I.user_back(), m_InsertElement(m_Value(), m_Value(),
m_ConstantInt(InsertIndex)));		m_ConstantInt(InsertIndex)));

Instruction *ConvertToShuffle;		Instruction *ConvertToShuffle;
if (isExtractExtractCheap(Ext0, Ext1, I.getOpcode(), TTI, ConvertToShuffle,		if (isExtractExtractCheap(Ext0, Ext1, I, TTI, ConvertToShuffle, InsertIndex))
InsertIndex))
return false;		return false;

if (ConvertToShuffle) {		if (ConvertToShuffle) {
// The shuffle mask is undefined except for 1 lane that is being translated		// The shuffle mask is undefined except for 1 lane that is being translated
// to the cheap extraction lane. Example:		// to the cheap extraction lane. Example:
// ShufMask = { 2, undef, undef, undef }		// ShufMask = { 2, undef, undef, undef }
uint64_t SplatIndex = ConvertToShuffle == Ext0 ? C0 : C1;		uint64_t SplatIndex = ConvertToShuffle == Ext0 ? C0 : C1;
uint64_t CheapExtIndex = ConvertToShuffle == Ext0 ? C1 : C0;		uint64_t CheapExtIndex = ConvertToShuffle == Ext0 ? C1 : C0;
▲ Show 20 Lines • Show All 216 Lines • Show Last 20 Lines

llvm/test/Transforms/VectorCombine/X86/extract-binop.ll

; NOTE: Assertions have been autogenerated by utils/update_test_checks.py		; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
; RUN: opt < %s -vector-combine -S -mtriple=x86_64-- -mattr=SSE2 \| FileCheck %s --check-prefixes=CHECK,SSE		; RUN: opt < %s -vector-combine -S -mtriple=x86_64-- -mattr=SSE2 \| FileCheck %s --check-prefixes=CHECK,SSE
; RUN: opt < %s -vector-combine -S -mtriple=x86_64-- -mattr=AVX2 \| FileCheck %s --check-prefixes=CHECK,AVX		; RUN: opt < %s -vector-combine -S -mtriple=x86_64-- -mattr=AVX2 \| FileCheck %s --check-prefixes=CHECK,AVX
		; RUN: opt < %s -O3 -S -mtriple=x86_64-- -mattr=avx \| FileCheck %s --check-prefixes=PHASE
		; RUN: opt -passes='default<O3>' -S < %s \| FileCheck %s --check-prefixes=PHASE
		lebedev.riUnsubmitted Not Done Reply Inline Actions Like @spatel has already said, please see `llvm/test/Transforms/PhaseOrdering/X86/addsub.ll` The passordering test should be similar to that one, in that directory. lebedev.ri: Like @spatel has already said, please see `llvm/test/Transforms/PhaseOrdering/X86/addsub.ll`…

declare void @use_i8(i8)		declare void @use_i8(i8)
declare void @use_f32(float)		declare void @use_f32(float)

; Eliminating extract is profitable.		; Eliminating extract is profitable.

define i8 @ext0_ext0_add(<16 x i8> %x, <16 x i8> %y) {		define i8 @ext0_ext0_add(<16 x i8> %x, <16 x i8> %y) {
; CHECK-LABEL: @ext0_ext0_add(		; CHECK-LABEL: @ext0_ext0_add(
▲ Show 20 Lines • Show All 469 Lines • ▼ Show 20 Lines	;
%b01 = fadd float %b0, %b1		%b01 = fadd float %b0, %b1
%b23 = fadd float %b2, %b3		%b23 = fadd float %b2, %b3

%v1 = insertelement <4 x float> undef, float %a23, i32 1		%v1 = insertelement <4 x float> undef, float %a23, i32 1
%v2 = insertelement <4 x float> %v1, float %b01, i32 2		%v2 = insertelement <4 x float> %v1, float %b01, i32 2
%v3 = insertelement <4 x float> %v2, float %b23, i32 3		%v3 = insertelement <4 x float> %v2, float %b23, i32 3
ret <4 x float> %v3		ret <4 x float> %v3
}		}

		; SIMD reduction operation.
		define i32 @ext_ext_reduction(<4 x i32> %x, <4 x i32> %y) {
		; CHECK-LABEL: @ext_ext_reduction(
		lebedev.riUnsubmitted Not Done Reply Inline Actions Which is now being transformed into define i32 @ext_ext_reduction(<4 x i32> %x, <4 x i32> %y) { %and = and <4 x i32> %x, %y %1 = shufflevector <4 x i32> %and, <4 x i32> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef> %2 = or <4 x i32> %1, %and %3 = extractelement <4 x i32> %2, i64 0 %vecext.2 = extractelement <4 x i32> %and, i32 2 %add.2 = or i32 %vecext.2, %3 %vecext.3 = extractelement <4 x i32> %and, i32 3 %add.3 = or i32 %vecext.3, %add.2 ret i32 %add.3 } lebedev.ri: Which is now being transformed into ``` define i32 @ext_ext_reduction(<4 x i32> %x, <4 x i32>…
		; CHECK-NEXT: [[AND:%.]] = and <4 x i32> [[X:%.]], [[Y:%.*]]
		; CHECK-NEXT: [[VECEXT:%.*]] = extractelement <4 x i32> [[AND]], i32 0
		; CHECK-NEXT: [[VECEXT_1:%.*]] = extractelement <4 x i32> [[AND]], i32 1
		; CHECK-NEXT: [[ADD_1:%.*]] = or i32 [[VECEXT_1]], [[VECEXT]]
		; CHECK-NEXT: [[VECEXT_2:%.*]] = extractelement <4 x i32> [[AND]], i32 2
		; CHECK-NEXT: [[ADD_2:%.*]] = or i32 [[VECEXT_2]], [[ADD_1]]
		; CHECK-NEXT: [[VECEXT_3:%.*]] = extractelement <4 x i32> [[AND]], i32 3
		; CHECK-NEXT: [[ADD_3:%.*]] = or i32 [[VECEXT_3]], [[ADD_2]]
		; CHECK-NEXT: ret i32 [[ADD_3]]
		;
		; PHASE-LABEL: @ext_ext_reduction(
		; PHASE-NEXT: [[AND:%.]] = and <4 x i32> [[Y:%.]], [[X:%.*]]
		; PHASE-NEXT: [[RDX_SHUF:%.*]] = shufflevector <4 x i32> [[AND]], <4 x i32> undef, <4 x i32> <i32 2, i32 3, i32 undef, i32 undef>
		; PHASE-NEXT: [[BIN_RDX:%.*]] = or <4 x i32> [[AND]], [[RDX_SHUF]]
		; PHASE-NEXT: [[RDX_SHUF1:%.*]] = shufflevector <4 x i32> [[BIN_RDX]], <4 x i32> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
		; PHASE-NEXT: [[BIN_RDX2:%.*]] = or <4 x i32> [[BIN_RDX]], [[RDX_SHUF1]]
		; PHASE-NEXT: [[TMP1:%.*]] = extractelement <4 x i32> [[BIN_RDX2]], i32 0
		; PHASE-NEXT: ret i32 [[TMP1]]
		;
		%and = and <4 x i32> %x, %y
		%vecext = extractelement <4 x i32> %and, i32 0
		%vecext.1 = extractelement <4 x i32> %and, i32 1
		%add.1 = or i32 %vecext.1, %vecext
		%vecext.2 = extractelement <4 x i32> %and, i32 2
		%add.2 = or i32 %vecext.2, %add.1
		%vecext.3 = extractelement <4 x i32> %and, i32 3
		%add.3 = or i32 %vecext.3, %add.2
		ret i32 %add.3
		}

		define i32 @ext_ext_reduction_1(<4 x i32> %x, <4 x i32> %y) {
		; CHECK-LABEL: @ext_ext_reduction_1(
		; CHECK-NEXT: [[VECEXT:%.]] = extractelement <4 x i32> [[Y:%.]], i32 0
		; CHECK-NEXT: [[VECEXT_1:%.*]] = extractelement <4 x i32> [[Y]], i32 1
		; CHECK-NEXT: [[ADD_1:%.*]] = add i32 [[VECEXT_1]], [[VECEXT]]
		; CHECK-NEXT: [[VECEXT_2:%.*]] = extractelement <4 x i32> [[Y]], i32 2
		; CHECK-NEXT: [[ADD_2:%.*]] = add i32 [[VECEXT_2]], [[ADD_1]]
		; CHECK-NEXT: [[VECEXT_3:%.]] = extractelement <4 x i32> [[X:%.]], i32 2
		; CHECK-NEXT: [[ADD_3:%.*]] = add i32 [[VECEXT_3]], [[ADD_2]]
		; CHECK-NEXT: ret i32 [[ADD_3]]
		;
		; PHASE-LABEL: @ext_ext_reduction_1(
		; PHASE-NEXT: [[VECEXT_1:%.]] = extractelement <4 x i32> [[Y:%.]], i32 1
		; PHASE-NEXT: [[VECEXT_2:%.*]] = extractelement <4 x i32> [[Y]], i32 2
		; PHASE-NEXT: [[TMP1:%.]] = shufflevector <4 x i32> [[X:%.]], <4 x i32> undef, <4 x i32> <i32 2, i32 undef, i32 undef, i32 undef>
		; PHASE-NEXT: [[TMP2:%.*]] = add <4 x i32> [[TMP1]], [[Y]]
		; PHASE-NEXT: [[TMP3:%.*]] = extractelement <4 x i32> [[TMP2]], i32 0
		; PHASE-NEXT: [[ADD_2:%.*]] = add i32 [[TMP3]], [[VECEXT_1]]
		; PHASE-NEXT: [[ADD_3:%.*]] = add i32 [[ADD_2]], [[VECEXT_2]]
		; PHASE-NEXT: ret i32 [[ADD_3]]
		;
		%vecext = extractelement <4 x i32> %y, i32 0
		%vecext.1 = extractelement <4 x i32> %y, i32 1
		%add.1 = add i32 %vecext.1, %vecext
		%vecext.2 = extractelement <4 x i32> %y, i32 2
		%add.2 = add i32 %vecext.2, %add.1
		%vecext.3 = extractelement <4 x i32> %x, i32 2
		%add.3 = add i32 %vecext.3, %add.2
		ret i32 %add.3
		}

		define i32 @ext_ext_reduction_2(<4 x i32> %x, <4 x i32> %y) {
		; CHECK-LABEL: @ext_ext_reduction_2(
		; CHECK-NEXT: [[TMP1:%.]] = shufflevector <4 x i32> [[Y:%.]], <4 x i32> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
		; CHECK-NEXT: [[TMP2:%.*]] = add <4 x i32> [[TMP1]], [[Y]]
		; CHECK-NEXT: [[TMP3:%.*]] = extractelement <4 x i32> [[TMP2]], i64 0
		; CHECK-NEXT: [[VECEXT_2:%.]] = extractelement <4 x i32> [[X:%.]], i32 2
		; CHECK-NEXT: [[ADD_2:%.*]] = add i32 [[VECEXT_2]], [[TMP3]]
		; CHECK-NEXT: ret i32 [[ADD_2]]
		;
		; PHASE-LABEL: @ext_ext_reduction_2(
		; PHASE-NEXT: [[VECEXT_1:%.]] = extractelement <4 x i32> [[Y:%.]], i32 1
		; PHASE-NEXT: [[TMP1:%.]] = shufflevector <4 x i32> [[X:%.]], <4 x i32> undef, <4 x i32> <i32 2, i32 undef, i32 undef, i32 undef>
		; PHASE-NEXT: [[TMP2:%.*]] = add <4 x i32> [[TMP1]], [[Y]]
		; PHASE-NEXT: [[TMP3:%.*]] = extractelement <4 x i32> [[TMP2]], i32 0
		; PHASE-NEXT: [[ADD_2:%.*]] = add i32 [[TMP3]], [[VECEXT_1]]
		; PHASE-NEXT: ret i32 [[ADD_2]]
		;
		%vecext = extractelement <4 x i32> %y, i32 0
		%vecext.1 = extractelement <4 x i32> %y, i32 1
		%add.1 = add i32 %vecext.1, %vecext
		%vecext.2 = extractelement <4 x i32> %x, i32 2
		%add.2 = add i32 %vecext.2, %add.1
		ret i32 %add.2
		}

This is an archive of the discontinued LLVM Phabricator instance.

[VectorCombine] Leave reduction operation to SLPAbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 263409

llvm/lib/Transforms/Vectorize/VectorCombine.cpp

llvm/test/Transforms/VectorCombine/X86/extract-binop.ll

[VectorCombine] Leave reduction operation to SLP
AbandonedPublic