This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
6/7
VectorCombine.cpp
-
test/Transforms/
-
Transforms/
-
PhaseOrdering/X86/
-
X86/
-
vector-reductions.ll
-
VectorCombine/X86/
-
X86/
-
extract-cmp-binop.ll

Differential D82474

[VectorCombine] try to form vector compare and binop to eliminate scalar ops
ClosedPublic

Authored by spatel on Jun 24 2020, 9:07 AM.

Download Raw Diff

Details

Reviewers

RKSimon
lebedev.ri
craig.topper

Commits

rGb6315aee5b42: [VectorCombine] try to form vector compare and binop to eliminate scalar ops

Summary

binop i1 (cmp Pred (ext X, Index0), C0), (cmp Pred (ext X, Index1), C1)
-->
vcmp = cmp Pred X, VecC
ext (binop vNi1 vcmp, (shuffle vcmp, Index1)), Index0

This is a larger pattern than the existing extractelement folds because we can't reasonably vectorize the sub-patterns with constants based on cost model calcs (it doesn't usually make sense to replace a single extracted scalar op with constant operand with a vector op). I salvaged as much of the existing logic as I could, but there might be better ways to share and reduce code.

The motivating case from PR43745:
https://bugs.llvm.org/show_bug.cgi?id=43745
...is the special case of a 2-way reduction. We tried to get SLP to handle that particular pattern in D59710, but that caused crashing and regressions. This patch is more general, but hopefully safer.

The v2f64 test with SSE2 surprised me - the cost model accounting looks like this:
OldCost = 0 (free extract of f64 at index 0) + 1 (extract of f64 at index 1) + 2 (scalar fcmps) + 1 (and of bools) = 4
NewCost = 2 (vector fcmp) + 1 (shuffle) + 1 (vector 'and') + 1 (extract of bool) = 5

There's no code comment explanation for the more expensive vector fcmp in the cost model table, but I assume that's based on some ancient SSE2 implementation where it wasn't the cheapest:

static const CostTblEntry SSE2CostTbl[] = {
  { ISD::SETCC,   MVT::v2f64,   2 },
  { ISD::SETCC,   MVT::f64,     1 },

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

spatel created this revision.Jun 24 2020, 9:07 AM

Herald added a project: Restricted Project. · View Herald TranscriptJun 24 2020, 9:07 AM

Herald added subscribers: hiraditya, mcrosier. · View Herald Transcript

spatel mentioned this in D82602: [SelectionDAG] don't split branch on logic-of-vector-compares.Jun 26 2020, 5:14 AM

Yes, old x86 targets really struggle with some v2f64 ops - cmppd included.

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
563	cast<FixedVectorType> (maybe a dyn_cast<>) ?

spatel marked an inline comment as done.Jun 28 2020, 9:26 AM

spatel added a subscriber: ctetreau.

spatel added inline comments.

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
563	Yes, I think dyn_cast with early return here - D82056 is trying to clean this up in similar existing code. cc @ctetreau

Patch updated:
Added bail out and regression test for scalable vectors.

LGTM - cheers

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
90	Maybe pull this out as a precommit NFC?
525	Maybe add an explanation that SLP can't handle this?

This revision is now accepted and ready to land.Jun 28 2020, 12:56 PM

lebedev.ri added inline comments.Jun 28 2020, 1:45 PM

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
200–201	Looks like some of this should be a preparatory NFC cleanup
601	`++SomeNewStatistic;`

spatel marked an inline comment as done.Jun 29 2020, 6:04 AM

spatel added inline comments.

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
90	Yes, I'll do that. I did not do it in advance of this review because it introduces some redundancy for the existing code. So it is of questionable value without the new caller.

spatel mentioned this in rG3b95d8346d58: [VectorCombine] refactor - make helper function for extract to shuffle logic….Jun 29 2020, 7:00 AM

spatel marked 4 inline comments as done.Jun 29 2020, 7:30 AM

Closed by commit rGb6315aee5b42: [VectorCombine] try to form vector compare and binop to eliminate scalar ops (authored by spatel). · Explain WhyJun 29 2020, 8:04 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

VectorCombine.cpp

87 lines

test/

Transforms/

PhaseOrdering/

X86/

vector-reductions.ll

13 lines

VectorCombine/

X86/

extract-cmp-binop.ll

80 lines

Diff 274131

llvm/lib/Transforms/Vectorize/VectorCombine.cpp

Show All 29 Lines
#include "llvm/Transforms/Vectorize.h"		#include "llvm/Transforms/Vectorize.h"

using namespace llvm;		using namespace llvm;
using namespace llvm::PatternMatch;		using namespace llvm::PatternMatch;

#define DEBUG_TYPE "vector-combine"		#define DEBUG_TYPE "vector-combine"
STATISTIC(NumVecCmp, "Number of vector compares formed");		STATISTIC(NumVecCmp, "Number of vector compares formed");
STATISTIC(NumVecBO, "Number of vector binops formed");		STATISTIC(NumVecBO, "Number of vector binops formed");
		STATISTIC(NumVecCmpBO, "Number of vector compare + binop formed");
STATISTIC(NumShufOfBitcast, "Number of shuffles moved after bitcast");		STATISTIC(NumShufOfBitcast, "Number of shuffles moved after bitcast");
STATISTIC(NumScalarBO, "Number of scalar binops formed");		STATISTIC(NumScalarBO, "Number of scalar binops formed");
STATISTIC(NumScalarCmp, "Number of scalar compares formed");		STATISTIC(NumScalarCmp, "Number of scalar compares formed");

static cl::opt<bool> DisableVectorCombine(		static cl::opt<bool> DisableVectorCombine(
"disable-vector-combine", cl::init(false), cl::Hidden,		"disable-vector-combine", cl::init(false), cl::Hidden,
cl::desc("Disable all vector combine transforms"));		cl::desc("Disable all vector combine transforms"));

Show All 26 Lines	bool isExtractExtractCheap(ExtractElementInst Ext0, ExtractElementInst Ext1,
unsigned PreferredExtractIndex);		unsigned PreferredExtractIndex);
void foldExtExtCmp(ExtractElementInst Ext0, ExtractElementInst Ext1,		void foldExtExtCmp(ExtractElementInst Ext0, ExtractElementInst Ext1,
Instruction &I);		Instruction &I);
void foldExtExtBinop(ExtractElementInst Ext0, ExtractElementInst Ext1,		void foldExtExtBinop(ExtractElementInst Ext0, ExtractElementInst Ext1,
Instruction &I);		Instruction &I);
bool foldExtractExtract(Instruction &I);		bool foldExtractExtract(Instruction &I);
bool foldBitcastShuf(Instruction &I);		bool foldBitcastShuf(Instruction &I);
bool scalarizeBinopOrCmp(Instruction &I);		bool scalarizeBinopOrCmp(Instruction &I);
		bool foldExtractedCmps(Instruction &I);
};		};

static void replaceValue(Value &Old, Value &New) {		static void replaceValue(Value &Old, Value &New) {
Old.replaceAllUsesWith(&New);		Old.replaceAllUsesWith(&New);
New.takeName(&Old);		New.takeName(&Old);
}		}

/// Determine which, if any, of the inputs should be replaced by a shuffle		/// Determine which, if any, of the inputs should be replaced by a shuffle
/// followed by extract from a different index.		/// followed by extract from a different index.
		RKSimonUnsubmitted Not Done Reply Inline Actions Maybe pull this out as a precommit NFC? RKSimon: Maybe pull this out as a precommit NFC?
		spatelAuthorUnsubmitted Done Reply Inline Actions Yes, I'll do that. I did not do it in advance of this review because it introduces some redundancy for the existing code. So it is of questionable value without the new caller. spatel: Yes, I'll do that. I did not do it in advance of this review because it introduces some…
ExtractElementInst *VectorCombine::getShuffleExtract(		ExtractElementInst *VectorCombine::getShuffleExtract(
ExtractElementInst Ext0, ExtractElementInst Ext1,		ExtractElementInst Ext0, ExtractElementInst Ext1,
unsigned PreferredExtractIndex = InvalidIndex) const {		unsigned PreferredExtractIndex = InvalidIndex) const {
assert(isa<ConstantInt>(Ext0->getIndexOperand()) &&		assert(isa<ConstantInt>(Ext0->getIndexOperand()) &&
isa<ConstantInt>(Ext1->getIndexOperand()) &&		isa<ConstantInt>(Ext1->getIndexOperand()) &&
"Expected constant extract indexes");		"Expected constant extract indexes");

unsigned Index0 = cast<ConstantInt>(Ext0->getIndexOperand())->getZExtValue();		unsigned Index0 = cast<ConstantInt>(Ext0->getIndexOperand())->getZExtValue();
▲ Show 20 Lines • Show All 93 Lines • ▼ Show 20 Lines	if (Ext0->getOperand(0) == Ext1->getOperand(0) && Ext0Index == Ext1Index) {
// Handle the general case. Each extract is actually a different value:		// Handle the general case. Each extract is actually a different value:
// opcode (extelt V0, C0), (extelt V1, C1) --> extelt (opcode V0, V1), C		// opcode (extelt V0, C0), (extelt V1, C1) --> extelt (opcode V0, V1), C
OldCost = Extract0Cost + Extract1Cost + ScalarOpCost;		OldCost = Extract0Cost + Extract1Cost + ScalarOpCost;
NewCost = VectorOpCost + CheapExtractCost +		NewCost = VectorOpCost + CheapExtractCost +
!Ext0->hasOneUse() * Extract0Cost +		!Ext0->hasOneUse() * Extract0Cost +
!Ext1->hasOneUse() * Extract1Cost;		!Ext1->hasOneUse() * Extract1Cost;
}		}

ConvertToShuffle = getShuffleExtract(Ext0, Ext1, PreferredExtractIndex);		ConvertToShuffle = getShuffleExtract(Ext0, Ext1, PreferredExtractIndex);
if (ConvertToShuffle) {		if (ConvertToShuffle) {
		lebedev.riUnsubmitted Done Reply Inline Actions Looks like some of this should be a preparatory NFC cleanup lebedev.ri: Looks like some of this should be a preparatory NFC cleanup
if (IsBinOp && DisableBinopExtractShuffle)		if (IsBinOp && DisableBinopExtractShuffle)
return true;		return true;

// If we are extracting from 2 different indexes, then one operand must be		// If we are extracting from 2 different indexes, then one operand must be
// shuffled before performing the vector operation. The shuffle mask is		// shuffled before performing the vector operation. The shuffle mask is
// undefined except for 1 lane that is being translated to the remaining		// undefined except for 1 lane that is being translated to the remaining
// extraction lane. Therefore, it is a splat shuffle. Ex:		// extraction lane. Therefore, it is a splat shuffle. Ex:
// ShufMask = { undef, undef, 0, undef }		// ShufMask = { undef, undef, 0, undef }
▲ Show 20 Lines • Show All 305 Lines • ▼ Show 20 Lines	bool VectorCombine::scalarizeBinopOrCmp(Instruction &I) {
// Fold the vector constants in the original vectors into a new base vector.		// Fold the vector constants in the original vectors into a new base vector.
Constant *NewVecC = IsCmp ? ConstantExpr::getCompare(Pred, VecC0, VecC1)		Constant *NewVecC = IsCmp ? ConstantExpr::getCompare(Pred, VecC0, VecC1)
: ConstantExpr::get(Opcode, VecC0, VecC1);		: ConstantExpr::get(Opcode, VecC0, VecC1);
Value *Insert = Builder.CreateInsertElement(NewVecC, Scalar, Index);		Value *Insert = Builder.CreateInsertElement(NewVecC, Scalar, Index);
replaceValue(I, *Insert);		replaceValue(I, *Insert);
return true;		return true;
}		}

		/// Try to combine a scalar binop + 2 scalar compares of extracted elements of
		/// a vector into vector operations followed by extract. Note: The SLP pass
		/// may miss this pattern because of implementation problems.
		RKSimonUnsubmitted Done Reply Inline Actions Maybe add an explanation that SLP can't handle this? RKSimon: Maybe add an explanation that SLP can't handle this?
		bool VectorCombine::foldExtractedCmps(Instruction &I) {
		// We are looking for a scalar binop of booleans.
		// binop i1 (cmp Pred I0, C0), (cmp Pred I1, C1)
		if (!I.isBinaryOp() \|\| !I.getType()->isIntegerTy(1))
		return false;

		// The compare predicates should match, and each compare should have a
		// constant operand.
		// TODO: Relax the one-use constraints.
		Value B0 = I.getOperand(0), B1 = I.getOperand(1);
		Instruction I0, I1;
		Constant C0, C1;
		CmpInst::Predicate P0, P1;
		if (!match(B0, m_OneUse(m_Cmp(P0, m_Instruction(I0), m_Constant(C0)))) \|\|
		!match(B1, m_OneUse(m_Cmp(P1, m_Instruction(I1), m_Constant(C1)))) \|\|
		P0 != P1)
		return false;

		// The compare operands must be extracts of the same vector with constant
		// extract indexes.
		// TODO: Relax the one-use constraints.
		Value *X;
		uint64_t Index0, Index1;
		if (!match(I0, m_OneUse(m_ExtractElt(m_Value(X), m_ConstantInt(Index0)))) \|\|
		!match(I1, m_OneUse(m_ExtractElt(m_Specific(X), m_ConstantInt(Index1)))))
		return false;

		auto *Ext0 = cast<ExtractElementInst>(I0);
		auto *Ext1 = cast<ExtractElementInst>(I1);
		ExtractElementInst *ConvertToShuf = getShuffleExtract(Ext0, Ext1);
		if (!ConvertToShuf)
		return false;

		// The original scalar pattern is:
		// binop i1 (cmp Pred (ext X, Index0), C0), (cmp Pred (ext X, Index1), C1)
		CmpInst::Predicate Pred = P0;
		unsigned CmpOpcode = CmpInst::isFPPredicate(Pred) ? Instruction::FCmp
		: Instruction::ICmp;
		RKSimonUnsubmitted Done Reply Inline Actions cast<FixedVectorType> (maybe a dyn_cast<>) ? RKSimon: cast<FixedVectorType> (maybe a dyn_cast<>) ?
		spatelAuthorUnsubmitted Done Reply Inline Actions Yes, I think dyn_cast with early return here - D82056 is trying to clean this up in similar existing code. cc @ctetreau spatel: Yes, I think dyn_cast with early return here - D82056 is trying to clean this up in similar…
		auto *VecTy = dyn_cast<FixedVectorType>(X->getType());
		if (!VecTy)
		return false;

		int OldCost = TTI.getVectorInstrCost(Ext0->getOpcode(), VecTy, Index0);
		OldCost += TTI.getVectorInstrCost(Ext1->getOpcode(), VecTy, Index1);
		OldCost += TTI.getCmpSelInstrCost(CmpOpcode, I0->getType()) * 2;
		OldCost += TTI.getArithmeticInstrCost(I.getOpcode(), I.getType());

		// The proposed vector pattern is:
		// vcmp = cmp Pred X, VecC
		// ext (binop vNi1 vcmp, (shuffle vcmp, Index1)), Index0
		int CheapIndex = ConvertToShuf == Ext0 ? Index1 : Index0;
		int ExpensiveIndex = ConvertToShuf == Ext0 ? Index0 : Index1;
		auto *CmpTy = cast<FixedVectorType>(CmpInst::makeCmpResultType(X->getType()));
		int NewCost = TTI.getCmpSelInstrCost(CmpOpcode, X->getType());
		NewCost +=
		TTI.getShuffleCost(TargetTransformInfo::SK_PermuteSingleSrc, CmpTy);
		NewCost += TTI.getArithmeticInstrCost(I.getOpcode(), CmpTy);
		NewCost += TTI.getVectorInstrCost(Ext0->getOpcode(), CmpTy, CheapIndex);

		// Aggressively form vector ops if the cost is equal because the transform
		// may enable further optimization.
		// Codegen can reverse this transform (scalarize) if it was not profitable.
		if (OldCost < NewCost)
		return false;

		// Create a vector constant from the 2 scalar constants.
		SmallVector<Constant *, 32> CmpC(VecTy->getNumElements(),
		UndefValue::get(VecTy->getElementType()));
		CmpC[Index0] = C0;
		CmpC[Index1] = C1;
		Value *VCmp = Builder.CreateCmp(Pred, X, ConstantVector::get(CmpC));

		Value *Shuf = createShiftShuffle(VCmp, ExpensiveIndex, CheapIndex, Builder);
		Value *VecLogic = Builder.CreateBinOp(cast<BinaryOperator>(I).getOpcode(),
		VCmp, Shuf);
		Value *NewExt = Builder.CreateExtractElement(VecLogic, CheapIndex);
		lebedev.riUnsubmitted Done Reply Inline Actions `++SomeNewStatistic;` lebedev.ri: `++SomeNewStatistic;`
		replaceValue(I, *NewExt);
		++NumVecCmpBO;
		return true;
		}

/// This is the entry point for all transforms. Pass manager differences are		/// This is the entry point for all transforms. Pass manager differences are
/// handled in the callers of this function.		/// handled in the callers of this function.
bool VectorCombine::run() {		bool VectorCombine::run() {
if (DisableVectorCombine)		if (DisableVectorCombine)
return false;		return false;

bool MadeChange = false;		bool MadeChange = false;
for (BasicBlock &BB : F) {		for (BasicBlock &BB : F) {
// Ignore unreachable basic blocks.		// Ignore unreachable basic blocks.
if (!DT.isReachableFromEntry(&BB))		if (!DT.isReachableFromEntry(&BB))
continue;		continue;
// Do not delete instructions under here and invalidate the iterator.		// Do not delete instructions under here and invalidate the iterator.
// Walk the block forwards to enable simple iterative chains of transforms.		// Walk the block forwards to enable simple iterative chains of transforms.
// TODO: It could be more efficient to remove dead instructions		// TODO: It could be more efficient to remove dead instructions
// iteratively in this loop rather than waiting until the end.		// iteratively in this loop rather than waiting until the end.
for (Instruction &I : BB) {		for (Instruction &I : BB) {
if (isa<DbgInfoIntrinsic>(I))		if (isa<DbgInfoIntrinsic>(I))
continue;		continue;
Builder.SetInsertPoint(&I);		Builder.SetInsertPoint(&I);
MadeChange \|= foldExtractExtract(I);		MadeChange \|= foldExtractExtract(I);
MadeChange \|= foldBitcastShuf(I);		MadeChange \|= foldBitcastShuf(I);
MadeChange \|= scalarizeBinopOrCmp(I);		MadeChange \|= scalarizeBinopOrCmp(I);
		MadeChange \|= foldExtractedCmps(I);
}		}
}		}

// We're done with transforms, so remove dead instructions.		// We're done with transforms, so remove dead instructions.
if (MadeChange)		if (MadeChange)
for (BasicBlock &BB : F)		for (BasicBlock &BB : F)
SimplifyInstructionsInBlock(&BB);		SimplifyInstructionsInBlock(&BB);

▲ Show 20 Lines • Show All 60 Lines • Show Last 20 Lines

llvm/test/Transforms/PhaseOrdering/X86/vector-reductions.ll

	Show First 20 Lines • Show All 288 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: [[TMP0:%.]] = insertelement <2 x double> undef, double [[C:%.]], i32 0			; CHECK-NEXT: [[TMP0:%.]] = insertelement <2 x double> undef, double [[C:%.]], i32 0
	; CHECK-NEXT: [[TMP1:%.*]] = insertelement <2 x double> [[TMP0]], double [[FNEG]], i32 1			; CHECK-NEXT: [[TMP1:%.*]] = insertelement <2 x double> [[TMP0]], double [[FNEG]], i32 1
	; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x double> undef, double [[B]], i32 0			; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x double> undef, double [[B]], i32 0
	; CHECK-NEXT: [[TMP3:%.*]] = insertelement <2 x double> [[TMP2]], double [[C]], i32 1			; CHECK-NEXT: [[TMP3:%.*]] = insertelement <2 x double> [[TMP2]], double [[C]], i32 1
	; CHECK-NEXT: [[TMP4:%.*]] = fsub <2 x double> [[TMP1]], [[TMP3]]			; CHECK-NEXT: [[TMP4:%.*]] = fsub <2 x double> [[TMP1]], [[TMP3]]
	; CHECK-NEXT: [[TMP5:%.*]] = insertelement <2 x double> undef, double [[MUL]], i32 0			; CHECK-NEXT: [[TMP5:%.*]] = insertelement <2 x double> undef, double [[MUL]], i32 0
	; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <2 x double> [[TMP5]], <2 x double> undef, <2 x i32> zeroinitializer			; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <2 x double> [[TMP5]], <2 x double> undef, <2 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP7:%.*]] = fdiv <2 x double> [[TMP4]], [[TMP6]]			; CHECK-NEXT: [[TMP7:%.*]] = fdiv <2 x double> [[TMP4]], [[TMP6]]
	; CHECK-NEXT: [[TMP8:%.*]] = extractelement <2 x double> [[TMP7]], i32 0			; CHECK-NEXT: [[TMP8:%.*]] = fcmp olt <2 x double> [[TMP7]], <double 0x3EB0C6F7A0B5ED8D, double 0x3EB0C6F7A0B5ED8D>
	; CHECK-NEXT: [[CMP:%.*]] = fcmp olt double [[TMP8]], 0x3EB0C6F7A0B5ED8D			; CHECK-NEXT: [[SHIFT:%.*]] = shufflevector <2 x i1> [[TMP8]], <2 x i1> undef, <2 x i32> <i32 1, i32 undef>
	; CHECK-NEXT: [[TMP9:%.*]] = extractelement <2 x double> [[TMP7]], i32 1			; CHECK-NEXT: [[TMP9:%.*]] = and <2 x i1> [[TMP8]], [[SHIFT]]
	; CHECK-NEXT: [[CMP4:%.*]] = fcmp olt double [[TMP9]], 0x3EB0C6F7A0B5ED8D			; CHECK-NEXT: [[OR_COND:%.*]] = extractelement <2 x i1> [[TMP9]], i64 0
	; CHECK-NEXT: [[OR_COND:%.*]] = and i1 [[CMP]], [[CMP4]]
	; CHECK-NEXT: br i1 [[OR_COND]], label [[CLEANUP:%.]], label [[LOR_LHS_FALSE:%.]]			; CHECK-NEXT: br i1 [[OR_COND]], label [[CLEANUP:%.]], label [[LOR_LHS_FALSE:%.]]
	; CHECK: lor.lhs.false:			; CHECK: lor.lhs.false:
	; CHECK-NEXT: [[TMP10:%.*]] = fcmp ule <2 x double> [[TMP7]], <double 1.000000e+00, double 1.000000e+00>			; CHECK-NEXT: [[TMP10:%.*]] = fcmp ule <2 x double> [[TMP7]], <double 1.000000e+00, double 1.000000e+00>
	; CHECK-NEXT: [[SHIFT:%.*]] = shufflevector <2 x i1> [[TMP10]], <2 x i1> undef, <2 x i32> <i32 1, i32 undef>			; CHECK-NEXT: [[SHIFT2:%.*]] = shufflevector <2 x i1> [[TMP10]], <2 x i1> undef, <2 x i32> <i32 1, i32 undef>
	; CHECK-NEXT: [[TMP11:%.*]] = or <2 x i1> [[TMP10]], [[SHIFT]]			; CHECK-NEXT: [[TMP11:%.*]] = or <2 x i1> [[TMP10]], [[SHIFT2]]
	; CHECK-NEXT: [[NOT_OR_COND1:%.*]] = extractelement <2 x i1> [[TMP11]], i32 0			; CHECK-NEXT: [[NOT_OR_COND1:%.*]] = extractelement <2 x i1> [[TMP11]], i32 0
	; CHECK-NEXT: ret i1 [[NOT_OR_COND1]]			; CHECK-NEXT: ret i1 [[NOT_OR_COND1]]
	; CHECK: cleanup:			; CHECK: cleanup:
	; CHECK-NEXT: ret i1 false			; CHECK-NEXT: ret i1 false
	;			;
	entry:			entry:
	%fneg = fneg double %b			%fneg = fneg double %b
	%add = fadd double %fneg, %c			%add = fadd double %fneg, %c
	Show All 31 Lines

llvm/test/Transforms/VectorCombine/X86/extract-cmp-binop.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt < %s -vector-combine -S -mtriple=x86_64-- -mattr=sse2 \| FileCheck %s --check-prefixes=CHECK,SSE			; RUN: opt < %s -vector-combine -S -mtriple=x86_64-- -mattr=sse2 \| FileCheck %s --check-prefixes=CHECK,SSE
	; RUN: opt < %s -vector-combine -S -mtriple=x86_64-- -mattr=avx2 \| FileCheck %s --check-prefixes=CHECK,AVX			; RUN: opt < %s -vector-combine -S -mtriple=x86_64-- -mattr=avx2 \| FileCheck %s --check-prefixes=CHECK,AVX

	define i1 @fcmp_and_v2f64(<2 x double> %a) {			define i1 @fcmp_and_v2f64(<2 x double> %a) {
	; CHECK-LABEL: @fcmp_and_v2f64(			; SSE-LABEL: @fcmp_and_v2f64(
	; CHECK-NEXT: [[E1:%.]] = extractelement <2 x double> [[A:%.]], i32 0			; SSE-NEXT: [[E1:%.]] = extractelement <2 x double> [[A:%.]], i32 0
	; CHECK-NEXT: [[E2:%.*]] = extractelement <2 x double> [[A]], i32 1			; SSE-NEXT: [[E2:%.*]] = extractelement <2 x double> [[A]], i32 1
	; CHECK-NEXT: [[CMP1:%.*]] = fcmp olt double [[E1]], 4.200000e+01			; SSE-NEXT: [[CMP1:%.*]] = fcmp olt double [[E1]], 4.200000e+01
	; CHECK-NEXT: [[CMP2:%.*]] = fcmp olt double [[E2]], -8.000000e+00			; SSE-NEXT: [[CMP2:%.*]] = fcmp olt double [[E2]], -8.000000e+00
	; CHECK-NEXT: [[R:%.*]] = and i1 [[CMP1]], [[CMP2]]			; SSE-NEXT: [[R:%.*]] = and i1 [[CMP1]], [[CMP2]]
	; CHECK-NEXT: ret i1 [[R]]			; SSE-NEXT: ret i1 [[R]]
				;
				; AVX-LABEL: @fcmp_and_v2f64(
				; AVX-NEXT: [[TMP1:%.]] = fcmp olt <2 x double> [[A:%.]], <double 4.200000e+01, double -8.000000e+00>
				; AVX-NEXT: [[SHIFT:%.*]] = shufflevector <2 x i1> [[TMP1]], <2 x i1> undef, <2 x i32> <i32 1, i32 undef>
				; AVX-NEXT: [[TMP2:%.*]] = and <2 x i1> [[TMP1]], [[SHIFT]]
				; AVX-NEXT: [[R:%.*]] = extractelement <2 x i1> [[TMP2]], i64 0
				; AVX-NEXT: ret i1 [[R]]
	;			;
	%e1 = extractelement <2 x double> %a, i32 0			%e1 = extractelement <2 x double> %a, i32 0
	%e2 = extractelement <2 x double> %a, i32 1			%e2 = extractelement <2 x double> %a, i32 1
	%cmp1 = fcmp olt double %e1, 42.0			%cmp1 = fcmp olt double %e1, 42.0
	%cmp2 = fcmp olt double %e2, -8.0			%cmp2 = fcmp olt double %e2, -8.0
	%r = and i1 %cmp1, %cmp2			%r = and i1 %cmp1, %cmp2
	ret i1 %r			ret i1 %r
	}			}

	define i1 @fcmp_or_v4f64(<4 x double> %a) {			define i1 @fcmp_or_v4f64(<4 x double> %a) {
	; CHECK-LABEL: @fcmp_or_v4f64(			; SSE-LABEL: @fcmp_or_v4f64(
	; CHECK-NEXT: [[E1:%.]] = extractelement <4 x double> [[A:%.]], i32 0			; SSE-NEXT: [[E1:%.]] = extractelement <4 x double> [[A:%.]], i32 0
	; CHECK-NEXT: [[E2:%.*]] = extractelement <4 x double> [[A]], i64 2			; SSE-NEXT: [[E2:%.*]] = extractelement <4 x double> [[A]], i64 2
	; CHECK-NEXT: [[CMP1:%.*]] = fcmp olt double [[E1]], 4.200000e+01			; SSE-NEXT: [[CMP1:%.*]] = fcmp olt double [[E1]], 4.200000e+01
	; CHECK-NEXT: [[CMP2:%.*]] = fcmp olt double [[E2]], -8.000000e+00			; SSE-NEXT: [[CMP2:%.*]] = fcmp olt double [[E2]], -8.000000e+00
	; CHECK-NEXT: [[R:%.*]] = or i1 [[CMP1]], [[CMP2]]			; SSE-NEXT: [[R:%.*]] = or i1 [[CMP1]], [[CMP2]]
	; CHECK-NEXT: ret i1 [[R]]			; SSE-NEXT: ret i1 [[R]]
				;
				; AVX-LABEL: @fcmp_or_v4f64(
				; AVX-NEXT: [[TMP1:%.]] = fcmp olt <4 x double> [[A:%.]], <double 4.200000e+01, double undef, double -8.000000e+00, double undef>
				; AVX-NEXT: [[SHIFT:%.*]] = shufflevector <4 x i1> [[TMP1]], <4 x i1> undef, <4 x i32> <i32 2, i32 undef, i32 undef, i32 undef>
				; AVX-NEXT: [[TMP2:%.*]] = or <4 x i1> [[TMP1]], [[SHIFT]]
				; AVX-NEXT: [[R:%.*]] = extractelement <4 x i1> [[TMP2]], i64 0
				; AVX-NEXT: ret i1 [[R]]
	;			;
	%e1 = extractelement <4 x double> %a, i32 0			%e1 = extractelement <4 x double> %a, i32 0
	%e2 = extractelement <4 x double> %a, i64 2			%e2 = extractelement <4 x double> %a, i64 2
	%cmp1 = fcmp olt double %e1, 42.0			%cmp1 = fcmp olt double %e1, 42.0
	%cmp2 = fcmp olt double %e2, -8.0			%cmp2 = fcmp olt double %e2, -8.0
	%r = or i1 %cmp1, %cmp2			%r = or i1 %cmp1, %cmp2
	ret i1 %r			ret i1 %r
	}			}

	define i1 @icmp_xor_v4i32(<4 x i32> %a) {			define i1 @icmp_xor_v4i32(<4 x i32> %a) {
	; CHECK-LABEL: @icmp_xor_v4i32(			; CHECK-LABEL: @icmp_xor_v4i32(
	; CHECK-NEXT: [[E1:%.]] = extractelement <4 x i32> [[A:%.]], i32 3			; CHECK-NEXT: [[TMP1:%.]] = icmp sgt <4 x i32> [[A:%.]], <i32 undef, i32 -8, i32 undef, i32 42>
	; CHECK-NEXT: [[E2:%.*]] = extractelement <4 x i32> [[A]], i32 1			; CHECK-NEXT: [[SHIFT:%.*]] = shufflevector <4 x i1> [[TMP1]], <4 x i1> undef, <4 x i32> <i32 undef, i32 3, i32 undef, i32 undef>
	; CHECK-NEXT: [[CMP1:%.*]] = icmp sgt i32 [[E1]], 42			; CHECK-NEXT: [[TMP2:%.*]] = xor <4 x i1> [[TMP1]], [[SHIFT]]
	; CHECK-NEXT: [[CMP2:%.*]] = icmp sgt i32 [[E2]], -8			; CHECK-NEXT: [[R:%.*]] = extractelement <4 x i1> [[TMP2]], i64 1
	; CHECK-NEXT: [[R:%.*]] = xor i1 [[CMP1]], [[CMP2]]
	; CHECK-NEXT: ret i1 [[R]]			; CHECK-NEXT: ret i1 [[R]]
	;			;
	%e1 = extractelement <4 x i32> %a, i32 3			%e1 = extractelement <4 x i32> %a, i32 3
	%e2 = extractelement <4 x i32> %a, i32 1			%e2 = extractelement <4 x i32> %a, i32 1
	%cmp1 = icmp sgt i32 %e1, 42			%cmp1 = icmp sgt i32 %e1, 42
	%cmp2 = icmp sgt i32 %e2, -8			%cmp2 = icmp sgt i32 %e2, -8
	%r = xor i1 %cmp1, %cmp2			%r = xor i1 %cmp1, %cmp2
	ret i1 %r			ret i1 %r
	}			}

	; add is not canonical (should be xor), but that is ok.			; add is not canonical (should be xor), but that is ok.

	define i1 @icmp_add_v8i32(<8 x i32> %a) {			define i1 @icmp_add_v8i32(<8 x i32> %a) {
	; CHECK-LABEL: @icmp_add_v8i32(			; SSE-LABEL: @icmp_add_v8i32(
	; CHECK-NEXT: [[E1:%.]] = extractelement <8 x i32> [[A:%.]], i32 7			; SSE-NEXT: [[E1:%.]] = extractelement <8 x i32> [[A:%.]], i32 7
	; CHECK-NEXT: [[E2:%.*]] = extractelement <8 x i32> [[A]], i32 2			; SSE-NEXT: [[E2:%.*]] = extractelement <8 x i32> [[A]], i32 2
	; CHECK-NEXT: [[CMP1:%.*]] = icmp eq i32 [[E1]], 42			; SSE-NEXT: [[CMP1:%.*]] = icmp eq i32 [[E1]], 42
	; CHECK-NEXT: [[CMP2:%.*]] = icmp eq i32 [[E2]], -8			; SSE-NEXT: [[CMP2:%.*]] = icmp eq i32 [[E2]], -8
	; CHECK-NEXT: [[R:%.*]] = add i1 [[CMP1]], [[CMP2]]			; SSE-NEXT: [[R:%.*]] = add i1 [[CMP1]], [[CMP2]]
	; CHECK-NEXT: ret i1 [[R]]			; SSE-NEXT: ret i1 [[R]]
				;
				; AVX-LABEL: @icmp_add_v8i32(
				; AVX-NEXT: [[TMP1:%.]] = icmp eq <8 x i32> [[A:%.]], <i32 undef, i32 undef, i32 -8, i32 undef, i32 undef, i32 undef, i32 undef, i32 42>
				; AVX-NEXT: [[SHIFT:%.*]] = shufflevector <8 x i1> [[TMP1]], <8 x i1> undef, <8 x i32> <i32 undef, i32 undef, i32 7, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				; AVX-NEXT: [[TMP2:%.*]] = add <8 x i1> [[TMP1]], [[SHIFT]]
				; AVX-NEXT: [[R:%.*]] = extractelement <8 x i1> [[TMP2]], i64 2
				; AVX-NEXT: ret i1 [[R]]
	;			;
	%e1 = extractelement <8 x i32> %a, i32 7			%e1 = extractelement <8 x i32> %a, i32 7
	%e2 = extractelement <8 x i32> %a, i32 2			%e2 = extractelement <8 x i32> %a, i32 2
	%cmp1 = icmp eq i32 %e1, 42			%cmp1 = icmp eq i32 %e1, 42
	%cmp2 = icmp eq i32 %e2, -8			%cmp2 = icmp eq i32 %e2, -8
	%r = add i1 %cmp1, %cmp2			%r = add i1 %cmp1, %cmp2
	ret i1 %r			ret i1 %r
	}			}

				; Negative test - this could CSE/simplify.

	define i1 @same_extract_index(<4 x i32> %a) {			define i1 @same_extract_index(<4 x i32> %a) {
	; CHECK-LABEL: @same_extract_index(			; CHECK-LABEL: @same_extract_index(
	; CHECK-NEXT: [[E1:%.]] = extractelement <4 x i32> [[A:%.]], i32 2			; CHECK-NEXT: [[E1:%.]] = extractelement <4 x i32> [[A:%.]], i32 2
	; CHECK-NEXT: [[E2:%.*]] = extractelement <4 x i32> [[A]], i32 2			; CHECK-NEXT: [[E2:%.*]] = extractelement <4 x i32> [[A]], i32 2
	; CHECK-NEXT: [[CMP1:%.*]] = icmp ugt i32 [[E1]], 42			; CHECK-NEXT: [[CMP1:%.*]] = icmp ugt i32 [[E1]], 42
	; CHECK-NEXT: [[CMP2:%.*]] = icmp ugt i32 [[E2]], -8			; CHECK-NEXT: [[CMP2:%.*]] = icmp ugt i32 [[E2]], -8
	; CHECK-NEXT: [[R:%.*]] = and i1 [[CMP1]], [[CMP2]]			; CHECK-NEXT: [[R:%.*]] = and i1 [[CMP1]], [[CMP2]]
	; CHECK-NEXT: ret i1 [[R]]			; CHECK-NEXT: ret i1 [[R]]
	;			;
	%e1 = extractelement <4 x i32> %a, i32 2			%e1 = extractelement <4 x i32> %a, i32 2
	%e2 = extractelement <4 x i32> %a, i32 2			%e2 = extractelement <4 x i32> %a, i32 2
	%cmp1 = icmp ugt i32 %e1, 42			%cmp1 = icmp ugt i32 %e1, 42
	%cmp2 = icmp ugt i32 %e2, -8			%cmp2 = icmp ugt i32 %e2, -8
	%r = and i1 %cmp1, %cmp2			%r = and i1 %cmp1, %cmp2
	ret i1 %r			ret i1 %r
	}			}

				; Negative test - need identical predicates.

	define i1 @different_preds(<4 x i32> %a) {			define i1 @different_preds(<4 x i32> %a) {
	; CHECK-LABEL: @different_preds(			; CHECK-LABEL: @different_preds(
	; CHECK-NEXT: [[E1:%.]] = extractelement <4 x i32> [[A:%.]], i32 1			; CHECK-NEXT: [[E1:%.]] = extractelement <4 x i32> [[A:%.]], i32 1
	; CHECK-NEXT: [[E2:%.*]] = extractelement <4 x i32> [[A]], i32 2			; CHECK-NEXT: [[E2:%.*]] = extractelement <4 x i32> [[A]], i32 2
	; CHECK-NEXT: [[CMP1:%.*]] = icmp sgt i32 [[E1]], 42			; CHECK-NEXT: [[CMP1:%.*]] = icmp sgt i32 [[E1]], 42
	; CHECK-NEXT: [[CMP2:%.*]] = icmp ugt i32 [[E2]], -8			; CHECK-NEXT: [[CMP2:%.*]] = icmp ugt i32 [[E2]], -8
	; CHECK-NEXT: [[R:%.*]] = and i1 [[CMP1]], [[CMP2]]			; CHECK-NEXT: [[R:%.*]] = and i1 [[CMP1]], [[CMP2]]
	; CHECK-NEXT: ret i1 [[R]]			; CHECK-NEXT: ret i1 [[R]]
	;			;
	%e1 = extractelement <4 x i32> %a, i32 1			%e1 = extractelement <4 x i32> %a, i32 1
	%e2 = extractelement <4 x i32> %a, i32 2			%e2 = extractelement <4 x i32> %a, i32 2
	%cmp1 = icmp sgt i32 %e1, 42			%cmp1 = icmp sgt i32 %e1, 42
	%cmp2 = icmp ugt i32 %e2, -8			%cmp2 = icmp ugt i32 %e2, -8
	%r = and i1 %cmp1, %cmp2			%r = and i1 %cmp1, %cmp2
	ret i1 %r			ret i1 %r
	}			}

				; Negative test - need 1 source vector.

	define i1 @different_source_vec(<4 x i32> %a, <4 x i32> %b) {			define i1 @different_source_vec(<4 x i32> %a, <4 x i32> %b) {
	; CHECK-LABEL: @different_source_vec(			; CHECK-LABEL: @different_source_vec(
	; CHECK-NEXT: [[E1:%.]] = extractelement <4 x i32> [[A:%.]], i32 1			; CHECK-NEXT: [[E1:%.]] = extractelement <4 x i32> [[A:%.]], i32 1
	; CHECK-NEXT: [[E2:%.]] = extractelement <4 x i32> [[B:%.]], i32 2			; CHECK-NEXT: [[E2:%.]] = extractelement <4 x i32> [[B:%.]], i32 2
	; CHECK-NEXT: [[CMP1:%.*]] = icmp sgt i32 [[E1]], 42			; CHECK-NEXT: [[CMP1:%.*]] = icmp sgt i32 [[E1]], 42
	; CHECK-NEXT: [[CMP2:%.*]] = icmp sgt i32 [[E2]], -8			; CHECK-NEXT: [[CMP2:%.*]] = icmp sgt i32 [[E2]], -8
	; CHECK-NEXT: [[R:%.*]] = and i1 [[CMP1]], [[CMP2]]			; CHECK-NEXT: [[R:%.*]] = and i1 [[CMP1]], [[CMP2]]
	; CHECK-NEXT: ret i1 [[R]]			; CHECK-NEXT: ret i1 [[R]]
	;			;
	%e1 = extractelement <4 x i32> %a, i32 1			%e1 = extractelement <4 x i32> %a, i32 1
	%e2 = extractelement <4 x i32> %b, i32 2			%e2 = extractelement <4 x i32> %b, i32 2
	%cmp1 = icmp sgt i32 %e1, 42			%cmp1 = icmp sgt i32 %e1, 42
	%cmp2 = icmp sgt i32 %e2, -8			%cmp2 = icmp sgt i32 %e2, -8
	%r = and i1 %cmp1, %cmp2			%r = and i1 %cmp1, %cmp2
	ret i1 %r			ret i1 %r
	}			}

				; Negative test - don't try this with scalable vectors.

	define i1 @scalable(<vscale x 4 x i32> %a) {			define i1 @scalable(<vscale x 4 x i32> %a) {
	; CHECK-LABEL: @scalable(			; CHECK-LABEL: @scalable(
	; CHECK-NEXT: [[E1:%.]] = extractelement <vscale x 4 x i32> [[A:%.]], i32 3			; CHECK-NEXT: [[E1:%.]] = extractelement <vscale x 4 x i32> [[A:%.]], i32 3
	; CHECK-NEXT: [[E2:%.*]] = extractelement <vscale x 4 x i32> [[A]], i32 1			; CHECK-NEXT: [[E2:%.*]] = extractelement <vscale x 4 x i32> [[A]], i32 1
	; CHECK-NEXT: [[CMP1:%.*]] = icmp sgt i32 [[E1]], 42			; CHECK-NEXT: [[CMP1:%.*]] = icmp sgt i32 [[E1]], 42
	; CHECK-NEXT: [[CMP2:%.*]] = icmp sgt i32 [[E2]], -8			; CHECK-NEXT: [[CMP2:%.*]] = icmp sgt i32 [[E2]], -8
	; CHECK-NEXT: [[R:%.*]] = xor i1 [[CMP1]], [[CMP2]]			; CHECK-NEXT: [[R:%.*]] = xor i1 [[CMP1]], [[CMP2]]
	; CHECK-NEXT: ret i1 [[R]]			; CHECK-NEXT: ret i1 [[R]]
	;			;
	%e1 = extractelement <vscale x 4 x i32> %a, i32 3			%e1 = extractelement <vscale x 4 x i32> %a, i32 3
	%e2 = extractelement <vscale x 4 x i32> %a, i32 1			%e2 = extractelement <vscale x 4 x i32> %a, i32 1
	%cmp1 = icmp sgt i32 %e1, 42			%cmp1 = icmp sgt i32 %e1, 42
	%cmp2 = icmp sgt i32 %e2, -8			%cmp2 = icmp sgt i32 %e2, -8
	%r = xor i1 %cmp1, %cmp2			%r = xor i1 %cmp1, %cmp2
	ret i1 %r			ret i1 %r
	}			}