This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
4/4
VectorCombine.cpp
-
test/Transforms/VectorCombine/X86/
-
Transforms/
-
VectorCombine/
-
X86/
-
extract-binop.ll
-
extract-cmp.ll

Differential D75689

[VectorCombine] fold extract-extract-op with different extraction indexes
ClosedPublic

Authored by spatel on Mar 5 2020, 8:56 AM.

Download Raw Diff

Details

Reviewers

lebedev.ri
RKSimon
efriedma

Commits

rGa69158c12acd: [VectorCombine] fold extract-extract-op with different extraction indexes

Summary

opcode (extelt V0, Ext0), (ext V1, Ext1) --> extelt (opcode (splat V0, Ext0), V1), Ext1

The first part of this patch generalizes the cost calculation to accept different extraction indexes. The second part creates a shuffle+extract before feeding into the existing code to create a vector op+extract.

The patch conservatively uses "TargetTransformInfo::SK_PermuteSingleSrc" rather than "TargetTransformInfo::SK_Broadcast" (splat specifically from element 0) because we do not have a more general "SK_Splat" currently. That does not affect any of the current regression tests, but we might be able to find some cost model target specialization where that comes into play.

I suspect that we can expose some missing x86 horizontal op codegen with this transform, so I'm speculatively adding a debug flag to disable the binop variant of this transform to allow easier testing.

The test changes show that we're sensitive to cost model diffs (as we should be), so that means that patches like D74976 should have better coverage.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

spatel created this revision.Mar 5 2020, 8:56 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 5 2020, 8:56 AM

Herald added subscribers: arphaman, hiraditya, mcrosier. · View Herald Transcript

getShuffleCost already takes an Index value (default = 0) that I meant to use for this kind of thing (its currently just used for subvector insertion/extraction), so either 'tweaking' SK_Broadcast to use Index or replacing it with a SK_Splat mode would be relatively painless.

Taking a step back, we have

; 0. scalar
%e0 = extractelement %x, C0
%e1 = extractelement %y, C1
%r = op %e0, %e1

Let's focus on C0 != C1 problem, i see three general approaches there:

move either one lane to another lane
1. move lane C0 to lane C1
2. move lane C1 to lane C0
move both lanes into lane 0

In current solution, why do we decide to only entertain the idea of replacing the costlier extract?
Don't we want to check all three variants? (there's some overlap if one of C0,C1 is already lane 0)
What if the other extract is from lane 0? We'd then be able to use TTI::SK_Broadcast.

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
123	s/1/one/, else it's ambitious with 'first'

In D75689#1907966, @lebedev.ri wrote:
Taking a step back, we have
; 0. scalar
%e0 = extractelement %x, C0
%e1 = extractelement %y, C1
%r = op %e0, %e1
Let's focus on C0 != C1 problem, i see three general approaches there:

move either one lane to another lane

move lane C0 to lane C1

move lane C1 to lane C0

move both lanes into lane 0

In current solution, why do we decide to only entertain the idea of replacing the costlier extract?
Don't we want to check all three variants? (there's some overlap if one of C0,C1 is already lane 0)
What if the other extract is from lane 0? We'd then be able to use TTI::SK_Broadcast.

Correct - I thought about reversing the operands if it would allow the shuffle to be a broadcast, but didn't try to implement it. My guess is that won't change the costs for the examples shown here (since they don't change even if we incorrectly specify that the shuffle in the current patch is a broadcast).

Similarly, the option where we shuffle both operands to element 0 has potential to be the winner, but I'm skeptical that it would be beneficial on most x86 examples. So I agree that if we want to get the optimal sequence, we need to model all of those possibilities - and by not modeling all of the possibilities, we may be missing an optimization. We could also argue that we're still in IR, so we should select 1 of those forms as semi-canonical and let the backend handle the other transforms.

Ok, if I add a TODO comment for now?

lebedev.ri added inline comments.Mar 6 2020, 3:19 AM

llvm/lib/Transforms/Vectorize/VectorCombine.cpp

122

So essentially we are assuming that this is the optimal cost model.
Can we assert that?

int ExpensiveExtractCost = std::max(Extract0Cost, Extract1Cost);
(void)ExpensiveExtractCost;

// FIXME: introduce TargetTransformInfo::SK_Splat.
assert(TTI.getShuffleCost(TargetTransformInfo::SK_PermuteSingleSrc, VecTy) + CheapExtractCost
       <=
       TTI.getShuffleCost(TargetTransformInfo::SK_Broadcast, VecTy) + ExpensiveExtractCost
);

Ok, after looking at this more, i think any further improvements (less pessimistic cost computations)
will look really awkward without first abstracting away the splat shuffle cost,
I think it would be okay to add FIXME's for now.

@RKSimon thoughts?

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
89–90	// FIXME: evaluate whether that always results in lowest cost
122	Err, that won't really do it i guess, so let's not do that.

This revision is now accepted and ready to land.Mar 6 2020, 4:12 AM

LGTM - You might want to add some test cases with 256/512-bit vectors (with indices coming from the same/different 128-bit subvectors) as well to get a broader range of extraction costs.

I'm not certain a SK_Splat will make much difference - the shuffle masks with a single non-undef index would be in most cases a lot cheaper even than that, especially on early SSE targets, so maybe we need to either consider another shuffle enum type or just provide the ability to compute costs from raw shuffle masks?

In D75689#1909516, @RKSimon wrote:

LGTM - You might want to add some test cases with 256/512-bit vectors (with indices coming from the same/different 128-bit subvectors) as well to get a broader range of extraction costs.

Yes, will add some larger vector tests.

I'm not certain a SK_Splat will make much difference - the shuffle masks with a single non-undef index would be in most cases a lot cheaper even than that, especially on early SSE targets, so maybe we need to either consider another shuffle enum type or just provide the ability to compute costs from raw shuffle masks?

There's no clear answer on how far we should take the modeling. We've said in the past that we shouldn't try that hard -- because this is IR, we can never know exactly how it will be lowered for every target.
The idea of trying different options and picking a winner sounds like VPlan to me (I haven't kept up with its current state, so need to review that.)

Closed by commit rGa69158c12acd: [VectorCombine] fold extract-extract-op with different extraction indexes (authored by spatel). · Explain WhyMar 8 2020, 7:27 AM

This revision was automatically updated to reflect the committed changes.

spatel marked 3 inline comments as done.

Hi,

We're seeing some large performance regressions on Eigen (http://eigen.tuxfamily.org/) running on haswell and skylake machines with this patch. Could you please revert it?

echristo added a subscriber: echristo.Mar 16 2020, 6:05 PM

MaskRay added a subscriber: MaskRay.Mar 16 2020, 6:41 PM

In D75689#1925667, @jgorbe wrote:

Hi,

We're seeing some large performance regressions on Eigen (http://eigen.tuxfamily.org/) running on haswell and skylake machines with this patch.

Those are being compiled with the correct -march= for the machine that actually runs benchmarks, right?
Can you be more specific? Reproduction steps (which benchmark specifically), perhaps some regressed snippets?

Could you please revert it?

In D75689#1925667, @jgorbe wrote:

Could you please revert it?

I'd prefer more specific reproduction steps before we make changes here. Ideally, please file a bug report with a reduced example.
Also, as noted in the description/commit message, I expected that this patch could cause perf problems, so there are debug options to help narrow those down:

{-mllvm } -disable-binop-extract-shuffle / -disable-vector-combine

If we can use those (or even add more specific flags) to temporarily disable functionality while investigating, that would be more efficient than reverting.

Thanks for the note about the flags. We have tried both of them, and any (or both) of them being present causes the regression to go away, so that unblocks us while we work on producing a proper test case.

In D75689#1927928, @jgorbe wrote:

Thanks for the note about the flags. We have tried both of them, and any (or both) of them being present causes the regression to go away, so that unblocks us while we work on producing a proper test case.

Sounds good. Note that "-disable-vector-combine" is the big hammer; it effectively disables this entire pass.
If the regressions only appeared with this patch and go away with "-disable-binop-extract-shuffle", the most likely source of the problem is the cost model or this pass is interacting badly with the SLP vectorizer.

sanwou01 added a subscriber: sanwou01.May 11 2020, 4:53 AM

sanwou01 mentioned this in D87231: [AArch64] Match pairwise add/fadd pattern.Sep 7 2020, 5:00 AM

sanwou01 mentioned this in rGd5fd3d9b903e: [AArch64] Match pairwise add/fadd pattern.Sep 17 2020, 8:27 AM

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

VectorCombine.cpp

119 lines

test/

Transforms/

VectorCombine/

X86/

extract-binop.ll

97 lines

extract-cmp.ll

40 lines

Diff 248995

llvm/lib/Transforms/Vectorize/VectorCombine.cpp

Show All 32 Lines
#define DEBUG_TYPE "vector-combine"		#define DEBUG_TYPE "vector-combine"
STATISTIC(NumVecCmp, "Number of vector compares formed");		STATISTIC(NumVecCmp, "Number of vector compares formed");
STATISTIC(NumVecBO, "Number of vector binops formed");		STATISTIC(NumVecBO, "Number of vector binops formed");

static cl::opt<bool> DisableVectorCombine(		static cl::opt<bool> DisableVectorCombine(
"disable-vector-combine", cl::init(false), cl::Hidden,		"disable-vector-combine", cl::init(false), cl::Hidden,
cl::desc("Disable all vector combine transforms"));		cl::desc("Disable all vector combine transforms"));

/// Compare the relative costs of extracts followed by scalar operation vs.		static cl::opt<bool> DisableBinopExtractShuffle(
/// vector operation followed by extract:		"disable-binop-extract-shuffle", cl::init(false), cl::Hidden,
/// opcode (extelt V0, C), (extelt V1, C) --> extelt (opcode V0, V1), C		cl::desc("Disable binop extract to shuffle transforms"));
/// Unless the vector op is much more expensive than the scalar op, this
/// eliminates an extract.
		/// Compare the relative costs of 2 extracts followed by scalar operation vs.
		/// vector operation(s) followed by extract. Return true if the existing
		/// instructions are cheaper than a vector alternative. Otherwise, return false
		/// and if one of the extracts should be transformed to a shufflevector, set
		/// \p ConvertToShuffle to that extract instruction.
static bool isExtractExtractCheap(Instruction Ext0, Instruction Ext1,		static bool isExtractExtractCheap(Instruction Ext0, Instruction Ext1,
unsigned Opcode,		unsigned Opcode,
const TargetTransformInfo &TTI) {		const TargetTransformInfo &TTI,
		Instruction *&ConvertToShuffle) {
assert(isa<ConstantInt>(Ext0->getOperand(1)) &&		assert(isa<ConstantInt>(Ext0->getOperand(1)) &&
(cast<ConstantInt>(Ext0->getOperand(1))->getZExtValue() ==		isa<ConstantInt>(Ext1->getOperand(1)) &&
cast<ConstantInt>(Ext1->getOperand(1))->getZExtValue()) &&		"Expected constant extract indexes");
"Expected same constant extract index");

Type *ScalarTy = Ext0->getType();		Type *ScalarTy = Ext0->getType();
Type *VecTy = Ext0->getOperand(0)->getType();		Type *VecTy = Ext0->getOperand(0)->getType();
int ScalarOpCost, VectorOpCost;		int ScalarOpCost, VectorOpCost;

// Get cost estimates for scalar and vector versions of the operation.		// Get cost estimates for scalar and vector versions of the operation.
bool IsBinOp = Instruction::isBinaryOp(Opcode);		bool IsBinOp = Instruction::isBinaryOp(Opcode);
if (IsBinOp) {		if (IsBinOp) {
ScalarOpCost = TTI.getArithmeticInstrCost(Opcode, ScalarTy);		ScalarOpCost = TTI.getArithmeticInstrCost(Opcode, ScalarTy);
VectorOpCost = TTI.getArithmeticInstrCost(Opcode, VecTy);		VectorOpCost = TTI.getArithmeticInstrCost(Opcode, VecTy);
} else {		} else {
assert((Opcode == Instruction::ICmp \|\| Opcode == Instruction::FCmp) &&		assert((Opcode == Instruction::ICmp \|\| Opcode == Instruction::FCmp) &&
"Expected a compare");		"Expected a compare");
ScalarOpCost = TTI.getCmpSelInstrCost(Opcode, ScalarTy,		ScalarOpCost = TTI.getCmpSelInstrCost(Opcode, ScalarTy,
CmpInst::makeCmpResultType(ScalarTy));		CmpInst::makeCmpResultType(ScalarTy));
VectorOpCost = TTI.getCmpSelInstrCost(Opcode, VecTy,		VectorOpCost = TTI.getCmpSelInstrCost(Opcode, VecTy,
CmpInst::makeCmpResultType(VecTy));		CmpInst::makeCmpResultType(VecTy));
}		}

// Get cost estimate for the extract element. This cost will factor into		// Get cost estimates for the extract elements. These costs will factor into
// both sequences.		// both sequences.
unsigned ExtIndex = cast<ConstantInt>(Ext0->getOperand(1))->getZExtValue();		unsigned Ext0Index = cast<ConstantInt>(Ext0->getOperand(1))->getZExtValue();
int ExtractCost = TTI.getVectorInstrCost(Instruction::ExtractElement,		unsigned Ext1Index = cast<ConstantInt>(Ext1->getOperand(1))->getZExtValue();
VecTy, ExtIndex);
		int Extract0Cost = TTI.getVectorInstrCost(Instruction::ExtractElement,
		VecTy, Ext0Index);
		int Extract1Cost = TTI.getVectorInstrCost(Instruction::ExtractElement,
		VecTy, Ext1Index);

		// A more expensive extract will always be replaced by a splat shuffle.
		// For example, if Ext0 is more expensive:
		// opcode (extelt V0, Ext0), (ext V1, Ext1) -->
		// extelt (opcode (splat V0, Ext0), V1), Ext1
		// TODO: Evaluate whether that always results in lowest cost. Alternatively,
		lebedev.riUnsubmitted Done Reply Inline Actions // FIXME: evaluate whether that always results in lowest cost lebedev.ri: // FIXME: evaluate whether that always results in lowest cost
		// check the cost of creating a broadcast shuffle and shuffling both
		// operands to element 0.
		int CheapExtractCost = std::min(Extract0Cost, Extract1Cost);

// Extra uses of the extracts mean that we include those costs in the		// Extra uses of the extracts mean that we include those costs in the
// vector total because those instructions will not be eliminated.		// vector total because those instructions will not be eliminated.
int OldCost, NewCost;		int OldCost, NewCost;
if (Ext0->getOperand(0) == Ext1->getOperand(0)) {		if (Ext0->getOperand(0) == Ext1->getOperand(0) && Ext0Index == Ext1Index) {
// Handle a special case. If the 2 operands are identical, adjust the		// Handle a special case. If the 2 extracts are identical, adjust the
// formulas to account for that. The extra use charge allows for either the		// formulas to account for that. The extra use charge allows for either the
// CSE'd pattern or an unoptimized form with identical values:		// CSE'd pattern or an unoptimized form with identical values:
// opcode (extelt V, C), (extelt V, C) --> extelt (opcode V, V), C		// opcode (extelt V, C), (extelt V, C) --> extelt (opcode V, V), C
bool HasUseTax = Ext0 == Ext1 ? !Ext0->hasNUses(2)		bool HasUseTax = Ext0 == Ext1 ? !Ext0->hasNUses(2)
: !Ext0->hasOneUse() \|\| !Ext1->hasOneUse();		: !Ext0->hasOneUse() \|\| !Ext1->hasOneUse();
OldCost = ExtractCost + ScalarOpCost;		OldCost = CheapExtractCost + ScalarOpCost;
NewCost = VectorOpCost + ExtractCost + HasUseTax * ExtractCost;		NewCost = VectorOpCost + CheapExtractCost + HasUseTax * CheapExtractCost;
} else {		} else {
// Handle the general case. Each extract is actually a different value:		// Handle the general case. Each extract is actually a different value:
// opcode (extelt V0, C), (extelt V1, C) --> extelt (opcode V0, V1), C		// opcode (extelt V0, C0), (extelt V1, C1) --> extelt (opcode V0, V1), C
OldCost = 2 * ExtractCost + ScalarOpCost;		OldCost = Extract0Cost + Extract1Cost + ScalarOpCost;
NewCost = VectorOpCost + ExtractCost + !Ext0->hasOneUse() * ExtractCost +		NewCost = VectorOpCost + CheapExtractCost +
!Ext1->hasOneUse() * ExtractCost;		!Ext0->hasOneUse() * Extract0Cost +
		!Ext1->hasOneUse() * Extract1Cost;
		}

		if (Ext0Index == Ext1Index) {
		// If the extract indexes are identical, no shuffle is needed.
		ConvertToShuffle = nullptr;
		} else {
		if (IsBinOp && DisableBinopExtractShuffle)
		return true;

		lebedev.riUnsubmitted Done Reply Inline Actions So essentially we are assuming that this is the optimal cost model. Can we assert that? int ExpensiveExtractCost = std::max(Extract0Cost, Extract1Cost); (void)ExpensiveExtractCost; // FIXME: introduce TargetTransformInfo::SK_Splat. assert(TTI.getShuffleCost(TargetTransformInfo::SK_PermuteSingleSrc, VecTy) + CheapExtractCost <= TTI.getShuffleCost(TargetTransformInfo::SK_Broadcast, VecTy) + ExpensiveExtractCost ); lebedev.ri: So essentially we are assuming that this is the optimal cost model. Can we assert that? ``` int…
		lebedev.riUnsubmitted Done Reply Inline Actions Err, that won't really do it i guess, so let's not do that. lebedev.ri: Err, that won't really do it i guess, so let's not do that.
		// If we are extracting from 2 different indexes, then one operand must be
		lebedev.riUnsubmitted Done Reply Inline Actions s/1/one/, else it's ambitious with 'first' lebedev.ri: s/1/one/, else it's ambitious with 'first'
		// shuffled before performing the vector operation. The shuffle mask is
		// undefined except for 1 lane that is being translated to the remaining
		// extraction lane. Therefore, it is a splat shuffle. Ex:
		// ShufMask = { undef, undef, 0, undef }
		// TODO: The cost model has an option for a "broadcast" shuffle
		// (splat-from-element-0), but no option for a more general splat.
		NewCost +=
		TTI.getShuffleCost(TargetTransformInfo::SK_PermuteSingleSrc, VecTy);

		// The more expensive extract will be replaced by a shuffle. If the extracts
		// have the same cost, replace the extract with the higher index.
		if (Extract0Cost > Extract1Cost)
		ConvertToShuffle = Ext0;
		else if (Extract1Cost > Extract0Cost)
		ConvertToShuffle = Ext1;
		else
		ConvertToShuffle = Ext0Index > Ext1Index ? Ext0 : Ext1;
}		}

// Aggressively form a vector op if the cost is equal because the transform		// Aggressively form a vector op if the cost is equal because the transform
// may enable further optimization.		// may enable further optimization.
// Codegen can reverse this transform (scalarize) if it was not profitable.		// Codegen can reverse this transform (scalarize) if it was not profitable.
return OldCost < NewCost;		return OldCost < NewCost;
}		}

/// Try to reduce extract element costs by converting scalar compares to vector		/// Try to reduce extract element costs by converting scalar compares to vector
/// compares followed by extract.		/// compares followed by extract.
▲ Show 20 Lines • Show All 52 Lines • ▼ Show 20 Lines	static bool foldExtractExtract(Instruction &I, const TargetTransformInfo &TTI) {

Value V0, V1;		Value V0, V1;
uint64_t C0, C1;		uint64_t C0, C1;
if (!match(Ext0, m_ExtractElement(m_Value(V0), m_ConstantInt(C0))) \|\|		if (!match(Ext0, m_ExtractElement(m_Value(V0), m_ConstantInt(C0))) \|\|
!match(Ext1, m_ExtractElement(m_Value(V1), m_ConstantInt(C1))) \|\|		!match(Ext1, m_ExtractElement(m_Value(V1), m_ConstantInt(C1))) \|\|
V0->getType() != V1->getType())		V0->getType() != V1->getType())
return false;		return false;

// TODO: Handle C0 != C1 by shuffling 1 of the operands.		Instruction *ConvertToShuffle;
if (C0 != C1)		if (isExtractExtractCheap(Ext0, Ext1, I.getOpcode(), TTI, ConvertToShuffle))
return false;		return false;

if (isExtractExtractCheap(Ext0, Ext1, I.getOpcode(), TTI))		if (ConvertToShuffle) {
return false;		// The shuffle mask is undefined except for 1 lane that is being translated
		// to the cheap extraction lane. Example:
		// ShufMask = { 2, undef, undef, undef }
		uint64_t SplatIndex = ConvertToShuffle == Ext0 ? C0 : C1;
		uint64_t CheapExtIndex = ConvertToShuffle == Ext0 ? C1 : C0;
		Type *VecTy = V0->getType();
		Type *I32Ty = IntegerType::getInt32Ty(I.getContext());
		UndefValue *Undef = UndefValue::get(I32Ty);
		SmallVector<Constant *, 32> ShufMask(VecTy->getVectorNumElements(), Undef);
		ShufMask[CheapExtIndex] = ConstantInt::get(I32Ty, SplatIndex);
		IRBuilder<> Builder(ConvertToShuffle);

		// extelt X, C --> extelt (splat X), C'
		Value *Shuf = Builder.CreateShuffleVector(ConvertToShuffle->getOperand(0),
		UndefValue::get(VecTy),
		ConstantVector::get(ShufMask));
		Value *NewExt = Builder.CreateExtractElement(Shuf, CheapExtIndex);
		if (ConvertToShuffle == Ext0)
		Ext0 = cast<Instruction>(NewExt);
		else
		Ext1 = cast<Instruction>(NewExt);
		}

if (Pred != CmpInst::BAD_ICMP_PREDICATE)		if (Pred != CmpInst::BAD_ICMP_PREDICATE)
foldExtExtCmp(Ext0, Ext1, I, TTI);		foldExtExtCmp(Ext0, Ext1, I, TTI);
else		else
foldExtExtBinop(Ext0, Ext1, I, TTI);		foldExtExtBinop(Ext0, Ext1, I, TTI);

return true;		return true;
}		}
▲ Show 20 Lines • Show All 81 Lines • Show Last 20 Lines

llvm/test/Transforms/VectorCombine/X86/extract-binop.ll

	Show First 20 Lines • Show All 245 Lines • ▼ Show 20 Lines
	;			;
	%e0 = extractelement <16 x i8> %x, i32 0			%e0 = extractelement <16 x i8> %x, i32 0
	%e1 = extractelement <16 x i8> %y, i32 0			%e1 = extractelement <16 x i8> %y, i32 0
	call void @use_i8(i8 %e1)			call void @use_i8(i8 %e1)
	%r = add i8 %e0, %e1			%r = add i8 %e0, %e1
	ret i8 %r			ret i8 %r
	}			}

	; TODO: Different extract indexes requires a shuffle.

	define i8 @ext0_ext1_add(<16 x i8> %x, <16 x i8> %y) {			define i8 @ext0_ext1_add(<16 x i8> %x, <16 x i8> %y) {
	; CHECK-LABEL: @ext0_ext1_add(			; SSE-LABEL: @ext0_ext1_add(
	; CHECK-NEXT: [[E0:%.]] = extractelement <16 x i8> [[X:%.]], i32 0			; SSE-NEXT: [[E0:%.]] = extractelement <16 x i8> [[X:%.]], i32 0
	; CHECK-NEXT: [[E1:%.]] = extractelement <16 x i8> [[Y:%.]], i32 1			; SSE-NEXT: [[E1:%.]] = extractelement <16 x i8> [[Y:%.]], i32 1
	; CHECK-NEXT: [[R:%.*]] = add nuw i8 [[E0]], [[E1]]			; SSE-NEXT: [[R:%.*]] = add nuw i8 [[E0]], [[E1]]
	; CHECK-NEXT: ret i8 [[R]]			; SSE-NEXT: ret i8 [[R]]
				;
				; AVX-LABEL: @ext0_ext1_add(
				; AVX-NEXT: [[TMP1:%.]] = shufflevector <16 x i8> [[Y:%.]], <16 x i8> undef, <16 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				; AVX-NEXT: [[TMP2:%.]] = add nuw <16 x i8> [[X:%.]], [[TMP1]]
				; AVX-NEXT: [[TMP3:%.*]] = extractelement <16 x i8> [[TMP2]], i32 0
				; AVX-NEXT: ret i8 [[TMP3]]
	;			;
	%e0 = extractelement <16 x i8> %x, i32 0			%e0 = extractelement <16 x i8> %x, i32 0
	%e1 = extractelement <16 x i8> %y, i32 1			%e1 = extractelement <16 x i8> %y, i32 1
	%r = add nuw i8 %e0, %e1			%r = add nuw i8 %e0, %e1
	ret i8 %r			ret i8 %r
	}			}

	define i8 @ext5_ext0_add(<16 x i8> %x, <16 x i8> %y) {			define i8 @ext5_ext0_add(<16 x i8> %x, <16 x i8> %y) {
	; CHECK-LABEL: @ext5_ext0_add(			; SSE-LABEL: @ext5_ext0_add(
	; CHECK-NEXT: [[E0:%.]] = extractelement <16 x i8> [[X:%.]], i32 5			; SSE-NEXT: [[E0:%.]] = extractelement <16 x i8> [[X:%.]], i32 5
	; CHECK-NEXT: [[E1:%.]] = extractelement <16 x i8> [[Y:%.]], i32 0			; SSE-NEXT: [[E1:%.]] = extractelement <16 x i8> [[Y:%.]], i32 0
	; CHECK-NEXT: [[R:%.*]] = sub nsw i8 [[E0]], [[E1]]			; SSE-NEXT: [[R:%.*]] = sub nsw i8 [[E0]], [[E1]]
	; CHECK-NEXT: ret i8 [[R]]			; SSE-NEXT: ret i8 [[R]]
				;
				; AVX-LABEL: @ext5_ext0_add(
				; AVX-NEXT: [[TMP1:%.]] = shufflevector <16 x i8> [[X:%.]], <16 x i8> undef, <16 x i32> <i32 5, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				; AVX-NEXT: [[TMP2:%.]] = sub nsw <16 x i8> [[TMP1]], [[Y:%.]]
				; AVX-NEXT: [[TMP3:%.*]] = extractelement <16 x i8> [[TMP2]], i64 0
				; AVX-NEXT: ret i8 [[TMP3]]
	;			;
	%e0 = extractelement <16 x i8> %x, i32 5			%e0 = extractelement <16 x i8> %x, i32 5
	%e1 = extractelement <16 x i8> %y, i32 0			%e1 = extractelement <16 x i8> %y, i32 0
	%r = sub nsw i8 %e0, %e1			%r = sub nsw i8 %e0, %e1
	ret i8 %r			ret i8 %r
	}			}

	define i8 @ext1_ext6_add(<16 x i8> %x, <16 x i8> %y) {			define i8 @ext1_ext6_add(<16 x i8> %x, <16 x i8> %y) {
	; CHECK-LABEL: @ext1_ext6_add(			; SSE-LABEL: @ext1_ext6_add(
	; CHECK-NEXT: [[E0:%.]] = extractelement <16 x i8> [[X:%.]], i32 1			; SSE-NEXT: [[E0:%.]] = extractelement <16 x i8> [[X:%.]], i32 1
	; CHECK-NEXT: [[E1:%.]] = extractelement <16 x i8> [[Y:%.]], i32 6			; SSE-NEXT: [[E1:%.]] = extractelement <16 x i8> [[Y:%.]], i32 6
	; CHECK-NEXT: [[R:%.*]] = and i8 [[E0]], [[E1]]			; SSE-NEXT: [[R:%.*]] = and i8 [[E0]], [[E1]]
	; CHECK-NEXT: ret i8 [[R]]			; SSE-NEXT: ret i8 [[R]]
				;
				; AVX-LABEL: @ext1_ext6_add(
				; AVX-NEXT: [[TMP1:%.]] = shufflevector <16 x i8> [[Y:%.]], <16 x i8> undef, <16 x i32> <i32 undef, i32 6, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				; AVX-NEXT: [[TMP2:%.]] = and <16 x i8> [[X:%.]], [[TMP1]]
				; AVX-NEXT: [[TMP3:%.*]] = extractelement <16 x i8> [[TMP2]], i32 1
				; AVX-NEXT: ret i8 [[TMP3]]
	;			;
	%e0 = extractelement <16 x i8> %x, i32 1			%e0 = extractelement <16 x i8> %x, i32 1
	%e1 = extractelement <16 x i8> %y, i32 6			%e1 = extractelement <16 x i8> %y, i32 6
	%r = and i8 %e0, %e1			%r = and i8 %e0, %e1
	ret i8 %r			ret i8 %r
	}			}

	define float @ext1_ext0_fmul(<4 x float> %x) {			define float @ext1_ext0_fmul(<4 x float> %x) {
	; CHECK-LABEL: @ext1_ext0_fmul(			; CHECK-LABEL: @ext1_ext0_fmul(
	; CHECK-NEXT: [[E0:%.]] = extractelement <4 x float> [[X:%.]], i32 1			; CHECK-NEXT: [[TMP1:%.]] = shufflevector <4 x float> [[X:%.]], <4 x float> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[E1:%.*]] = extractelement <4 x float> [[X]], i32 0			; CHECK-NEXT: [[TMP2:%.*]] = fmul <4 x float> [[TMP1]], [[X]]
	; CHECK-NEXT: [[R:%.*]] = fmul float [[E0]], [[E1]]			; CHECK-NEXT: [[TMP3:%.*]] = extractelement <4 x float> [[TMP2]], i64 0
	; CHECK-NEXT: ret float [[R]]			; CHECK-NEXT: ret float [[TMP3]]
	;			;
	%e0 = extractelement <4 x float> %x, i32 1			%e0 = extractelement <4 x float> %x, i32 1
	%e1 = extractelement <4 x float> %x, i32 0			%e1 = extractelement <4 x float> %x, i32 0
	%r = fmul float %e0, %e1			%r = fmul float %e0, %e1
	ret float %r			ret float %r
	}			}

	define float @ext0_ext3_fmul_extra_use1(<4 x float> %x) {			define float @ext0_ext3_fmul_extra_use1(<4 x float> %x) {
	; CHECK-LABEL: @ext0_ext3_fmul_extra_use1(			; CHECK-LABEL: @ext0_ext3_fmul_extra_use1(
	; CHECK-NEXT: [[E0:%.]] = extractelement <4 x float> [[X:%.]], i32 0			; CHECK-NEXT: [[E0:%.]] = extractelement <4 x float> [[X:%.]], i32 0
	; CHECK-NEXT: call void @use_f32(float [[E0]])			; CHECK-NEXT: call void @use_f32(float [[E0]])
	; CHECK-NEXT: [[E1:%.*]] = extractelement <4 x float> [[X]], i32 3			; CHECK-NEXT: [[TMP1:%.*]] = shufflevector <4 x float> [[X]], <4 x float> undef, <4 x i32> <i32 3, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[R:%.*]] = fmul nnan float [[E0]], [[E1]]			; CHECK-NEXT: [[TMP2:%.*]] = fmul nnan <4 x float> [[X]], [[TMP1]]
	; CHECK-NEXT: ret float [[R]]			; CHECK-NEXT: [[TMP3:%.*]] = extractelement <4 x float> [[TMP2]], i32 0
				; CHECK-NEXT: ret float [[TMP3]]
	;			;
	%e0 = extractelement <4 x float> %x, i32 0			%e0 = extractelement <4 x float> %x, i32 0
	call void @use_f32(float %e0)			call void @use_f32(float %e0)
	%e1 = extractelement <4 x float> %x, i32 3			%e1 = extractelement <4 x float> %x, i32 3
	%r = fmul nnan float %e0, %e1			%r = fmul nnan float %e0, %e1
	ret float %r			ret float %r
	}			}

	define float @ext0_ext3_fmul_extra_use2(<4 x float> %x) {			define float @ext0_ext3_fmul_extra_use2(<4 x float> %x) {
	; CHECK-LABEL: @ext0_ext3_fmul_extra_use2(			; CHECK-LABEL: @ext0_ext3_fmul_extra_use2(
	; CHECK-NEXT: [[E0:%.]] = extractelement <4 x float> [[X:%.]], i32 0			; CHECK-NEXT: [[E0:%.]] = extractelement <4 x float> [[X:%.]], i32 0
	; CHECK-NEXT: [[E1:%.*]] = extractelement <4 x float> [[X]], i32 3			; CHECK-NEXT: [[E1:%.*]] = extractelement <4 x float> [[X]], i32 3
	; CHECK-NEXT: call void @use_f32(float [[E1]])			; CHECK-NEXT: call void @use_f32(float [[E1]])
	; CHECK-NEXT: [[R:%.*]] = fmul ninf nsz float [[E0]], [[E1]]			; CHECK-NEXT: [[R:%.*]] = fmul ninf nsz float [[E0]], [[E1]]
	; CHECK-NEXT: ret float [[R]]			; CHECK-NEXT: ret float [[R]]
	;			;
	%e0 = extractelement <4 x float> %x, i32 0			%e0 = extractelement <4 x float> %x, i32 0
	%e1 = extractelement <4 x float> %x, i32 3			%e1 = extractelement <4 x float> %x, i32 3
	call void @use_f32(float %e1)			call void @use_f32(float %e1)
	%r = fmul ninf nsz float %e0, %e1			%r = fmul ninf nsz float %e0, %e1
	ret float %r			ret float %r
	}			}

	define float @ext0_ext4_fmul_v8f32(<8 x float> %x) {			define float @ext0_ext4_fmul_v8f32(<8 x float> %x) {
	; CHECK-LABEL: @ext0_ext4_fmul_v8f32(			; SSE-LABEL: @ext0_ext4_fmul_v8f32(
	; CHECK-NEXT: [[E0:%.]] = extractelement <8 x float> [[X:%.]], i32 0			; SSE-NEXT: [[E0:%.]] = extractelement <8 x float> [[X:%.]], i32 0
	; CHECK-NEXT: [[E1:%.*]] = extractelement <8 x float> [[X]], i32 4			; SSE-NEXT: [[E1:%.*]] = extractelement <8 x float> [[X]], i32 4
	; CHECK-NEXT: [[R:%.*]] = fadd float [[E0]], [[E1]]			; SSE-NEXT: [[R:%.*]] = fadd float [[E0]], [[E1]]
	; CHECK-NEXT: ret float [[R]]			; SSE-NEXT: ret float [[R]]
				;
				; AVX-LABEL: @ext0_ext4_fmul_v8f32(
				; AVX-NEXT: [[TMP1:%.]] = shufflevector <8 x float> [[X:%.]], <8 x float> undef, <8 x i32> <i32 4, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				; AVX-NEXT: [[TMP2:%.*]] = fadd <8 x float> [[X]], [[TMP1]]
				; AVX-NEXT: [[TMP3:%.*]] = extractelement <8 x float> [[TMP2]], i32 0
				; AVX-NEXT: ret float [[TMP3]]
	;			;
	%e0 = extractelement <8 x float> %x, i32 0			%e0 = extractelement <8 x float> %x, i32 0
	%e1 = extractelement <8 x float> %x, i32 4			%e1 = extractelement <8 x float> %x, i32 4
	%r = fadd float %e0, %e1			%r = fadd float %e0, %e1
	ret float %r			ret float %r
	}			}

	define float @ext7_ext4_fmul_v8f32(<8 x float> %x) {			define float @ext7_ext4_fmul_v8f32(<8 x float> %x) {
	; CHECK-LABEL: @ext7_ext4_fmul_v8f32(			; SSE-LABEL: @ext7_ext4_fmul_v8f32(
	; CHECK-NEXT: [[E0:%.]] = extractelement <8 x float> [[X:%.]], i32 7			; SSE-NEXT: [[E0:%.]] = extractelement <8 x float> [[X:%.]], i32 7
	; CHECK-NEXT: [[E1:%.*]] = extractelement <8 x float> [[X]], i32 4			; SSE-NEXT: [[E1:%.*]] = extractelement <8 x float> [[X]], i32 4
	; CHECK-NEXT: [[R:%.*]] = fadd float [[E0]], [[E1]]			; SSE-NEXT: [[R:%.*]] = fadd float [[E0]], [[E1]]
	; CHECK-NEXT: ret float [[R]]			; SSE-NEXT: ret float [[R]]
				;
				; AVX-LABEL: @ext7_ext4_fmul_v8f32(
				; AVX-NEXT: [[TMP1:%.]] = shufflevector <8 x float> [[X:%.]], <8 x float> undef, <8 x i32> <i32 undef, i32 undef, i32 undef, i32 undef, i32 7, i32 undef, i32 undef, i32 undef>
				; AVX-NEXT: [[TMP2:%.*]] = fadd <8 x float> [[TMP1]], [[X]]
				; AVX-NEXT: [[TMP3:%.*]] = extractelement <8 x float> [[TMP2]], i64 4
				; AVX-NEXT: ret float [[TMP3]]
	;			;
	%e0 = extractelement <8 x float> %x, i32 7			%e0 = extractelement <8 x float> %x, i32 7
	%e1 = extractelement <8 x float> %x, i32 4			%e1 = extractelement <8 x float> %x, i32 4
	%r = fadd float %e0, %e1			%r = fadd float %e0, %e1
	ret float %r			ret float %r
	}			}

	define float @ext0_ext8_fmul_v16f32(<16 x float> %x) {			define float @ext0_ext8_fmul_v16f32(<16 x float> %x) {
	Show All 24 Lines

llvm/test/Transforms/VectorCombine/X86/extract-cmp.ll

Show First 20 Lines • Show All 96 Lines • ▼ Show 20 Lines	t:
%e = select i1 %cmp2, i32 42, i32 99		%e = select i1 %cmp2, i32 42, i32 99
ret i32 %e		ret i32 %e

f:		f:
ret i32 0		ret i32 0
}		}

define i1 @cmp01_v2f64(<2 x double> %x, <2 x double> %y) {		define i1 @cmp01_v2f64(<2 x double> %x, <2 x double> %y) {
; CHECK-LABEL: @cmp01_v2f64(		; SSE-LABEL: @cmp01_v2f64(
; CHECK-NEXT: [[X0:%.]] = extractelement <2 x double> [[X:%.]], i32 0		; SSE-NEXT: [[X0:%.]] = extractelement <2 x double> [[X:%.]], i32 0
; CHECK-NEXT: [[Y1:%.]] = extractelement <2 x double> [[Y:%.]], i32 1		; SSE-NEXT: [[Y1:%.]] = extractelement <2 x double> [[Y:%.]], i32 1
; CHECK-NEXT: [[CMP:%.*]] = fcmp oge double [[X0]], [[Y1]]		; SSE-NEXT: [[CMP:%.*]] = fcmp oge double [[X0]], [[Y1]]
; CHECK-NEXT: ret i1 [[CMP]]		; SSE-NEXT: ret i1 [[CMP]]
		;
		; AVX-LABEL: @cmp01_v2f64(
		; AVX-NEXT: [[TMP1:%.]] = shufflevector <2 x double> [[Y:%.]], <2 x double> undef, <2 x i32> <i32 1, i32 undef>
		; AVX-NEXT: [[TMP2:%.]] = fcmp oge <2 x double> [[X:%.]], [[TMP1]]
		; AVX-NEXT: [[TMP3:%.*]] = extractelement <2 x i1> [[TMP2]], i32 0
		; AVX-NEXT: ret i1 [[TMP3]]
;		;
%x0 = extractelement <2 x double> %x, i32 0		%x0 = extractelement <2 x double> %x, i32 0
%y1 = extractelement <2 x double> %y, i32 1		%y1 = extractelement <2 x double> %y, i32 1
%cmp = fcmp oge double %x0, %y1		%cmp = fcmp oge double %x0, %y1
ret i1 %cmp		ret i1 %cmp
}		}

define i1 @cmp10_v2f64(<2 x double> %x, <2 x double> %y) {		define i1 @cmp10_v2f64(<2 x double> %x, <2 x double> %y) {
; CHECK-LABEL: @cmp10_v2f64(		; SSE-LABEL: @cmp10_v2f64(
; CHECK-NEXT: [[X1:%.]] = extractelement <2 x double> [[X:%.]], i32 1		; SSE-NEXT: [[X1:%.]] = extractelement <2 x double> [[X:%.]], i32 1
; CHECK-NEXT: [[Y0:%.]] = extractelement <2 x double> [[Y:%.]], i32 0		; SSE-NEXT: [[Y0:%.]] = extractelement <2 x double> [[Y:%.]], i32 0
; CHECK-NEXT: [[CMP:%.*]] = fcmp ule double [[X1]], [[Y0]]		; SSE-NEXT: [[CMP:%.*]] = fcmp ule double [[X1]], [[Y0]]
; CHECK-NEXT: ret i1 [[CMP]]		; SSE-NEXT: ret i1 [[CMP]]
		;
		; AVX-LABEL: @cmp10_v2f64(
		; AVX-NEXT: [[TMP1:%.]] = shufflevector <2 x double> [[X:%.]], <2 x double> undef, <2 x i32> <i32 1, i32 undef>
		; AVX-NEXT: [[TMP2:%.]] = fcmp ule <2 x double> [[TMP1]], [[Y:%.]]
		; AVX-NEXT: [[TMP3:%.*]] = extractelement <2 x i1> [[TMP2]], i64 0
		; AVX-NEXT: ret i1 [[TMP3]]
;		;
%x1 = extractelement <2 x double> %x, i32 1		%x1 = extractelement <2 x double> %x, i32 1
%y0 = extractelement <2 x double> %y, i32 0		%y0 = extractelement <2 x double> %y, i32 0
%cmp = fcmp ule double %x1, %y0		%cmp = fcmp ule double %x1, %y0
ret i1 %cmp		ret i1 %cmp
}		}

define i1 @cmp12_v4i32(<4 x i32> %x, <4 x i32> %y) {		define i1 @cmp12_v4i32(<4 x i32> %x, <4 x i32> %y) {
; CHECK-LABEL: @cmp12_v4i32(		; CHECK-LABEL: @cmp12_v4i32(
; CHECK-NEXT: [[X1:%.]] = extractelement <4 x i32> [[X:%.]], i32 1		; CHECK-NEXT: [[TMP1:%.]] = shufflevector <4 x i32> [[Y:%.]], <4 x i32> undef, <4 x i32> <i32 undef, i32 2, i32 undef, i32 undef>
; CHECK-NEXT: [[Y2:%.]] = extractelement <4 x i32> [[Y:%.]], i32 2		; CHECK-NEXT: [[TMP2:%.]] = icmp sgt <4 x i32> [[X:%.]], [[TMP1]]
; CHECK-NEXT: [[CMP:%.*]] = icmp sgt i32 [[X1]], [[Y2]]		; CHECK-NEXT: [[TMP3:%.*]] = extractelement <4 x i1> [[TMP2]], i32 1
; CHECK-NEXT: ret i1 [[CMP]]		; CHECK-NEXT: ret i1 [[TMP3]]
;		;
%x1 = extractelement <4 x i32> %x, i32 1		%x1 = extractelement <4 x i32> %x, i32 1
%y2 = extractelement <4 x i32> %y, i32 2		%y2 = extractelement <4 x i32> %y, i32 2
%cmp = icmp sgt i32 %x1, %y2		%cmp = icmp sgt i32 %x1, %y2
ret i1 %cmp		ret i1 %cmp
}		}

This is an archive of the discontinued LLVM Phabricator instance.

[VectorCombine] fold extract-extract-op with different extraction indexesClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 248995

llvm/lib/Transforms/Vectorize/VectorCombine.cpp

llvm/test/Transforms/VectorCombine/X86/extract-binop.ll

llvm/test/Transforms/VectorCombine/X86/extract-cmp.ll

[VectorCombine] fold extract-extract-op with different extraction indexes
ClosedPublic