Download Raw Diff

Details

Reviewers

spatel
RKSimon
dtemirbulatov
anton-afanasyev
sdesmalen
lebedev.ri

Commits

rG04ba80ca4dee: [Instcombiner]Improve emission of logical or/and reductions.

Summary

For logical or/and reductions we emit regular intrinsics @llvm.vector.reduce.or/and.vxi1 calls.
These intrinsics are not effective for the logical or/and reductions,
especially if the optimizer is able to emit short circuit versions of
the scalar or/and instructions and vector code gets less effective than
the scalar version.
Instead, or reduction for i1 can be represented as:

%val = bitcast <ReduxWidth x i1> to iReduxWidth
%res = cmp ne iReduxWidth %val, 0

and reduction for i1 can be represented as:

%val = bitcast <ReduxWidth x i1> to iReduxWidth
%res = cmp eq iReduxWidth %val, 11111

This improves perfromance of the vector code significantly and make it
to outperform short circuit scalar code.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

ABataev created this revision.Feb 24 2021, 11:44 AM

Herald added a subscriber: hiraditya. · View Herald TranscriptFeb 24 2021, 11:44 AM

ABataev requested review of this revision.Feb 24 2021, 11:44 AM

Herald added a project: Restricted Project. · View Herald TranscriptFeb 24 2021, 11:44 AM

Summary improvements

ABataev added a parent revision: D57059: [SLP] Initial support for the vectorization of the non-power-of-2 vectors..Feb 24 2021, 1:06 PM

Harbormaster completed remote builds in B90674: Diff 326187.Feb 24 2021, 1:06 PM

Harbormaster completed remote builds in B90662: Diff 326167.Feb 24 2021, 2:38 PM

Hi @ABataev, I saw this patch come by and left some comments, hope that's alright!

llvm/lib/Transforms/Utils/LoopUtils.cpp
1031	Can you add the condition that `&& isa<FixedVectorType>(Src)`? (same request for LoopVectorize.cpp and SLPVectorize.cpp) We're starting to make the LoopVectorizer vectorize for scalable VFs. This means we're currently fixing up cases like this where assumptions are made that are only valid for fixed-width vectors. For scalable vectors it might be possible to do the `<vscale x N x i1>` reduction as a compare on `<vscale x 1 x iN>`, but at least for SVE I know that we never want that.
1034–1036	nit: `cast<FixedVectorType>(Src->getType())->getNumElements()`
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
6779–6794	Should this be part of BasicTTImpl::getArithmeticReductionCost? That way you don't have to implement it twice.

sdesmalen added a reviewer: sdesmalen.Feb 25 2021, 1:19 AM

Why can't we just do the costmodel change, and then canonicalize to the right variant (i.e. away from @llvm.vector.reduce.?.v?i1) in instcombine?

In D97406#2587071, @lebedev.ri wrote:

Why can't we just do the costmodel change, and then canonicalize to the right variant (i.e. away from @llvm.vector.reduce.?.v?i1) in instcombine?

+1 I'd prefer to avoid yet more special cases in the vectorizers.

Yeah, I also thought about moving this to instcombiner. This patch originally is more a question where it is better to implement it. I'll move the transformation to the instcombiner.

llvm/lib/Transforms/Utils/LoopUtils.cpp
1031	Sure, will do.

Address comments.

Can't all the vectorizer code be removed now? X86TTIImpl::getArithmeticReductionCost already has bool reduction costs - not sure if we need to improve the default costs ?

In D97406#2590716, @RKSimon wrote:

Can't all the vectorizer code be removed now? X86TTIImpl::getArithmeticReductionCost already has bool reduction costs - not sure if we need to improve the default costs ?

Looks like the bool reduction costs are too high because they still rely on some vector instructions. The cost of bitcast + cmp is much lower than even AVX512 bool reductions, for example. What we can do, is to check ащкall targets that we have or/and bool reduction and fallback to the BasicTTI cost. And teach BasicTTI that the cost of bool or/and reductions is actually just bitcast+cmp.

Harbormaster completed remote builds in B91073: Diff 326737.Feb 26 2021, 11:01 AM

Rebase

Harbormaster completed remote builds in B91575: Diff 327456.Mar 2 2021, 7:57 AM

llvm/lib/Transforms/Vectorize/ changes should not be there, the costmodel itself should be fixed.

In D97406#2597397, @lebedev.ri wrote:

llvm/lib/Transforms/Vectorize/ changes should not be there, the costmodel itself should be fixed.

So, do you suggest modifying getArithmeticReductionCost to return the correct cost for logical reductions?

In D97406#2597419, @ABataev wrote:

In D97406#2597397, @lebedev.ri wrote:

llvm/lib/Transforms/Vectorize/ changes should not be there, the costmodel itself should be fixed.

So, do you suggest modifying getArithmeticReductionCost to return the correct cost for logical reductions?

Something along those lines, yes.

Also, could you please split off the instcombine change? That is basically LG.

llvm/test/Transforms/InstCombine/vector-logical-reductions.ll
4–30 ↗	(On Diff #327456)	These tests are too complex, they should be just:

In D97406#2597430, @lebedev.ri wrote:

In D97406#2597419, @ABataev wrote:

In D97406#2597397, @lebedev.ri wrote:

llvm/lib/Transforms/Vectorize/ changes should not be there, the costmodel itself should be fixed.

So, do you suggest modifying getArithmeticReductionCost to return the correct cost for logical reductions?

Something along those lines, yes.

Also, could you please split off the instcombine change? That is basically LG.

Ok.

llvm/test/Transforms/InstCombine/vector-logical-reductions.ll
4–30 ↗	(On Diff #327456)	Will fix

Removed cost update

ABataev retitled this revision from [Vectorizers]Improve emission of logical or/and reductions. to [Instcombiner]Improve emission of logical or/and reductions..Mar 2 2021, 8:59 AM

Harbormaster completed remote builds in B91590: Diff 327477.Mar 2 2021, 9:00 AM

Hm, actually, are we sure this is universally true/better?
For aarch64 this seems to result in larger codegen: https://godbolt.org/z/hczeMo

In D97406#2597601, @lebedev.ri wrote:

Hm, actually, are we sure this is universally true/better?
For aarch64 this seems to result in larger codegen: https://godbolt.org/z/hczeMo

Is this the same as these?
https://bugs.llvm.org/show_bug.cgi?id=41636
https://bugs.llvm.org/show_bug.cgi?id=41635
https://bugs.llvm.org/show_bug.cgi?id=41634

In D97406#2597705, @RKSimon wrote:

In D97406#2597601, @lebedev.ri wrote:

Hm, actually, are we sure this is universally true/better?
For aarch64 this seems to result in larger codegen: https://godbolt.org/z/hczeMo

Is this the same as these?
https://bugs.llvm.org/show_bug.cgi?id=41636
https://bugs.llvm.org/show_bug.cgi?id=41635
https://bugs.llvm.org/show_bug.cgi?id=41634

Yes, looks so.

Okay, let's call this canonicalization.

This revision is now accepted and ready to land.Mar 2 2021, 9:50 AM

This revision was landed with ongoing or failed builds.Mar 4 2021, 8:01 AM

Closed by commit rG04ba80ca4dee: [Instcombiner]Improve emission of logical or/and reductions. (authored by ABataev). · Explain Why

This revision was automatically updated to reflect the committed changes.

ABataev added a commit: rG04ba80ca4dee: [Instcombiner]Improve emission of logical or/and reductions..

Diff 326167

llvm/lib/Transforms/Utils/LoopUtils.cpp

	Show First 20 Lines • Show All 1,018 Lines • ▼ Show 20 Lines
	Value *llvm::createSimpleTargetReduction(IRBuilderBase &Builder,			Value *llvm::createSimpleTargetReduction(IRBuilderBase &Builder,
	const TargetTransformInfo *TTI,			const TargetTransformInfo *TTI,
	Value *Src, RecurKind RdxKind,			Value *Src, RecurKind RdxKind,
	ArrayRef<Value *> RedOps) {			ArrayRef<Value *> RedOps) {
	TargetTransformInfo::ReductionFlags RdxFlags;			TargetTransformInfo::ReductionFlags RdxFlags;
	RdxFlags.IsMaxOp = RdxKind == RecurKind::SMax \|\| RdxKind == RecurKind::UMax \|\|			RdxFlags.IsMaxOp = RdxKind == RecurKind::SMax \|\| RdxKind == RecurKind::UMax \|\|
	RdxKind == RecurKind::FMax;			RdxKind == RecurKind::FMax;
	RdxFlags.IsSigned = RdxKind == RecurKind::SMax \|\| RdxKind == RecurKind::SMin;			RdxFlags.IsSigned = RdxKind == RecurKind::SMax \|\| RdxKind == RecurKind::SMin;
				// Special reductions for i1 or and and operations. No need to emit reductions
				// here, just x != <0, 0, .., 0> for reduction or and x == <1, 1, .., 1> for
				// reduction and.
	auto *SrcVecEltTy = cast<VectorType>(Src->getType())->getElementType();			auto *SrcVecEltTy = cast<VectorType>(Src->getType())->getElementType();
				if ((RdxKind == RecurKind::And \|\| RdxKind == RecurKind::Or) &&
				sdesmalenUnsubmitted Not Done Reply Inline Actions Can you add the condition that `&& isa<FixedVectorType>(Src)`? (same request for LoopVectorize.cpp and SLPVectorize.cpp) We're starting to make the LoopVectorizer vectorize for scalable VFs. This means we're currently fixing up cases like this where assumptions are made that are only valid for fixed-width vectors. For scalable vectors it might be possible to do the `<vscale x N x i1>` reduction as a compare on `<vscale x 1 x iN>`, but at least for SVE I know that we never want that. sdesmalen: Can you add the condition that `&& isa<FixedVectorType>(Src)`? (same request for LoopVectorize.
				ABataevAuthorUnsubmitted Done Reply Inline Actions Sure, will do. ABataev: Sure, will do.
				SrcVecEltTy == Builder.getInt1Ty()) {
				Value *Res = Builder.CreateBitCast(
				Src, Builder.getIntNTy(cast<VectorType>(Src->getType())
				->getElementCount()
				.getFixedValue()));
				sdesmalenUnsubmitted Not Done Reply Inline Actions nit: `cast<FixedVectorType>(Src->getType())->getNumElements()` sdesmalen: nit: `cast<FixedVectorType>(Src->getType())->getNumElements()`
				if (RdxKind == RecurKind::And) {
				Res = Builder.CreateICmpEQ(Res,
				ConstantInt::getAllOnesValue(Res->getType()));
				} else {
				assert(RdxKind == RecurKind::Or && "Expected or reduction.");
				Res = Builder.CreateIsNotNull(Res);
				}
				return Res;
				}

	switch (RdxKind) {			switch (RdxKind) {
	case RecurKind::Add:			case RecurKind::Add:
	return Builder.CreateAddReduce(Src);			return Builder.CreateAddReduce(Src);
	case RecurKind::Mul:			case RecurKind::Mul:
	return Builder.CreateMulReduce(Src);			return Builder.CreateMulReduce(Src);
	case RecurKind::And:			case RecurKind::And:
	return Builder.CreateAndReduce(Src);			return Builder.CreateAndReduce(Src);
	case RecurKind::Or:			case RecurKind::Or:
	▲ Show 20 Lines • Show All 671 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 6,767 Lines • ▼ Show 20 Lines	InstructionCost LoopVectorizationCostModel::getReductionPatternCost(
// the reduction on its own.		// the reduction on its own.
Instruction *LastChain = InLoopReductionImmediateChains[RetI];		Instruction *LastChain = InLoopReductionImmediateChains[RetI];
Instruction *ReductionPhi = LastChain;		Instruction *ReductionPhi = LastChain;
while (!isa<PHINode>(ReductionPhi))		while (!isa<PHINode>(ReductionPhi))
ReductionPhi = InLoopReductionImmediateChains[ReductionPhi];		ReductionPhi = InLoopReductionImmediateChains[ReductionPhi];

RecurrenceDescriptor RdxDesc =		RecurrenceDescriptor RdxDesc =
Legal->getReductionVars()[cast<PHINode>(ReductionPhi)];		Legal->getReductionVars()[cast<PHINode>(ReductionPhi)];
unsigned BaseCost = TTI.getArithmeticReductionCost(RdxDesc.getOpcode(),		unsigned BaseCost;
VectorTy, false, CostKind);		RecurKind RdxKind = RdxDesc.getRecurrenceKind();
		RdxDesc.getRecurrenceType();
		if ((RdxKind == RecurKind::Or \|\| RdxKind == RecurKind::And) &&
		VectorTy->getElementType() ==
		IntegerType::getInt1Ty(VectorTy->getContext())) {
		// Or reduction for i1 is represented as:
		// %val = bitcast <ReduxWidth x i1> to iReduxWidth
		// %res = cmp ne iReduxWidth %val, 0
		// And reduction for i1 is represented as:
		// %val = bitcast <ReduxWidth x i1> to iReduxWidth
		// %res = cmp eq iReduxWidth %val, 11111
		Type *ValTy = IntegerType::get(VectorTy->getContext(),
		VectorTy->getElementCount().getFixedValue());
		BaseCost = TTI.getCastInstrCost(Instruction::BitCast, ValTy, VectorTy,
		TTI::CastContextHint::None,
		TTI::TCK_RecipThroughput) +
		TTI.getCmpSelInstrCost(Instruction::ICmp, ValTy,
		CmpInst::makeCmpResultType(ValTy));
		sdesmalenUnsubmitted Not Done Reply Inline Actions Should this be part of BasicTTImpl::getArithmeticReductionCost? That way you don't have to implement it twice. sdesmalen: Should this be part of BasicTTImpl::getArithmeticReductionCost? That way you don't have to…
		} else {
		BaseCost =
		TTI.getArithmeticReductionCost(RdxDesc.getOpcode(), VectorTy,
		/IsPairwiseForm=/false, CostKind);
		}

// Get the operand that was not the reduction chain and match it to one of the		// Get the operand that was not the reduction chain and match it to one of the
// patterns, returning the better cost if it is found.		// patterns, returning the better cost if it is found.
Instruction *RedOp = RetI->getOperand(1) == LastChain		Instruction *RedOp = RetI->getOperand(1) == LastChain
? dyn_cast<Instruction>(RetI->getOperand(0))		? dyn_cast<Instruction>(RetI->getOperand(0))
: dyn_cast<Instruction>(RetI->getOperand(1));		: dyn_cast<Instruction>(RetI->getOperand(1));

VectorTy = VectorType::get(I->getOperand(0)->getType(), VectorTy);		VectorTy = VectorType::get(I->getOperand(0)->getType(), VectorTy);
▲ Show 20 Lines • Show All 2,968 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 7,021 Lines • ▼ Show 20 Lines	InstructionCost getReductionCost(TargetTransformInfo *TTI,
case RecurKind::Add:		case RecurKind::Add:
case RecurKind::Mul:		case RecurKind::Mul:
case RecurKind::Or:		case RecurKind::Or:
case RecurKind::And:		case RecurKind::And:
case RecurKind::Xor:		case RecurKind::Xor:
case RecurKind::FAdd:		case RecurKind::FAdd:
case RecurKind::FMul: {		case RecurKind::FMul: {
unsigned RdxOpcode = RecurrenceDescriptor::getOpcode(RdxKind);		unsigned RdxOpcode = RecurrenceDescriptor::getOpcode(RdxKind);
		if ((RdxKind == RecurKind::Or \|\| RdxKind == RecurKind::And) &&
		ScalarTy == IntegerType::getInt1Ty(FirstReducedVal->getContext())) {
		// Or reduction for i1 is represented as:
		// %val = bitcast <ReduxWidth x i1> to iReduxWidth
		// %res = cmp ne iReduxWidth %val, 0
		// And reduction for i1 is represented as:
		// %val = bitcast <ReduxWidth x i1> to iReduxWidth
		// %res = cmp eq iReduxWidth %val, 11111
		Type *ValTy =
		IntegerType::get(FirstReducedVal->getContext(), ReduxWidth);
		VectorCost = TTI->getCastInstrCost(Instruction::BitCast, ValTy,
		VectorTy, TTI::CastContextHint::None,
		TTI::TCK_RecipThroughput) +
		TTI->getCmpSelInstrCost(Instruction::ICmp, ValTy,
		CmpInst::makeCmpResultType(ValTy));
		} else {
VectorCost = TTI->getArithmeticReductionCost(RdxOpcode, VectorTy,		VectorCost = TTI->getArithmeticReductionCost(RdxOpcode, VectorTy,
/IsPairwiseForm=/false);		/IsPairwiseForm=/false);
		}
ScalarCost = TTI->getArithmeticInstrCost(RdxOpcode, ScalarTy);		ScalarCost = TTI->getArithmeticInstrCost(RdxOpcode, ScalarTy);
break;		break;
}		}
case RecurKind::FMax:		case RecurKind::FMax:
case RecurKind::FMin: {		case RecurKind::FMin: {
auto *VecCondTy = cast<VectorType>(CmpInst::makeCmpResultType(VectorTy));		auto *VecCondTy = cast<VectorType>(CmpInst::makeCmpResultType(VectorTy));
VectorCost =		VectorCost =
TTI->getMinMaxReductionCost(VectorTy, VecCondTy,		TTI->getMinMaxReductionCost(VectorTy, VecCondTy,
▲ Show 20 Lines • Show All 688 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/compare-reduce.ll

	Show First 20 Lines • Show All 78 Lines • ▼ Show 20 Lines
	; other candidates in the reduction because it does not have matching predicate			; other candidates in the reduction because it does not have matching predicate
	; and/or constant operand.			; and/or constant operand.

	define float @merge_anyof_v4f32_wrong_first(<4 x float> %x) {			define float @merge_anyof_v4f32_wrong_first(<4 x float> %x) {
	; CHECK-LABEL: @merge_anyof_v4f32_wrong_first(			; CHECK-LABEL: @merge_anyof_v4f32_wrong_first(
	; CHECK-NEXT: [[TMP1:%.]] = extractelement <4 x float> [[X:%.]], i32 3			; CHECK-NEXT: [[TMP1:%.]] = extractelement <4 x float> [[X:%.]], i32 3
	; CHECK-NEXT: [[CMP3WRONG:%.*]] = fcmp olt float [[TMP1]], 4.200000e+01			; CHECK-NEXT: [[CMP3WRONG:%.*]] = fcmp olt float [[TMP1]], 4.200000e+01
	; CHECK-NEXT: [[TMP2:%.*]] = fcmp ogt <4 x float> [[X]], <float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00>			; CHECK-NEXT: [[TMP2:%.*]] = fcmp ogt <4 x float> [[X]], <float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00>
	; CHECK-NEXT: [[TMP3:%.*]] = call i1 @llvm.vector.reduce.or.v4i1(<4 x i1> [[TMP2]])			; CHECK-NEXT: [[TMP3:%.*]] = bitcast <4 x i1> [[TMP2]] to i4
	; CHECK-NEXT: [[TMP4:%.*]] = or i1 [[TMP3]], [[CMP3WRONG]]			; CHECK-NEXT: [[TMP4:%.*]] = icmp ne i4 [[TMP3]], 0
	; CHECK-NEXT: [[R:%.*]] = select i1 [[TMP4]], float -1.000000e+00, float 1.000000e+00			; CHECK-NEXT: [[TMP5:%.*]] = or i1 [[TMP4]], [[CMP3WRONG]]
				; CHECK-NEXT: [[R:%.*]] = select i1 [[TMP5]], float -1.000000e+00, float 1.000000e+00
	; CHECK-NEXT: ret float [[R]]			; CHECK-NEXT: ret float [[R]]
	;			;
	%x0 = extractelement <4 x float> %x, i32 0			%x0 = extractelement <4 x float> %x, i32 0
	%x1 = extractelement <4 x float> %x, i32 1			%x1 = extractelement <4 x float> %x, i32 1
	%x2 = extractelement <4 x float> %x, i32 2			%x2 = extractelement <4 x float> %x, i32 2
	%x3 = extractelement <4 x float> %x, i32 3			%x3 = extractelement <4 x float> %x, i32 3
	%cmp3wrong = fcmp olt float %x3, 42.0			%cmp3wrong = fcmp olt float %x3, 42.0
	%cmp0 = fcmp ogt float %x0, 1.0			%cmp0 = fcmp ogt float %x0, 1.0
	%cmp1 = fcmp ogt float %x1, 1.0			%cmp1 = fcmp ogt float %x1, 1.0
	%cmp2 = fcmp ogt float %x2, 1.0			%cmp2 = fcmp ogt float %x2, 1.0
	%cmp3 = fcmp ogt float %x3, 1.0			%cmp3 = fcmp ogt float %x3, 1.0
	%or03 = or i1 %cmp0, %cmp3wrong			%or03 = or i1 %cmp0, %cmp3wrong
	%or031 = or i1 %or03, %cmp1			%or031 = or i1 %or03, %cmp1
	%or0312 = or i1 %or031, %cmp2			%or0312 = or i1 %or031, %cmp2
	%or03123 = or i1 %or0312, %cmp3			%or03123 = or i1 %or0312, %cmp3
	%r = select i1 %or03123, float -1.0, float 1.0			%r = select i1 %or03123, float -1.0, float 1.0
	ret float %r			ret float %r
	}			}

	define float @merge_anyof_v4f32_wrong_last(<4 x float> %x) {			define float @merge_anyof_v4f32_wrong_last(<4 x float> %x) {
	; CHECK-LABEL: @merge_anyof_v4f32_wrong_last(			; CHECK-LABEL: @merge_anyof_v4f32_wrong_last(
	; CHECK-NEXT: [[TMP1:%.]] = extractelement <4 x float> [[X:%.]], i32 3			; CHECK-NEXT: [[TMP1:%.]] = extractelement <4 x float> [[X:%.]], i32 3
	; CHECK-NEXT: [[CMP3WRONG:%.*]] = fcmp olt float [[TMP1]], 4.200000e+01			; CHECK-NEXT: [[CMP3WRONG:%.*]] = fcmp olt float [[TMP1]], 4.200000e+01
	; CHECK-NEXT: [[TMP2:%.*]] = fcmp ogt <4 x float> [[X]], <float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00>			; CHECK-NEXT: [[TMP2:%.*]] = fcmp ogt <4 x float> [[X]], <float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00>
	; CHECK-NEXT: [[TMP3:%.*]] = call i1 @llvm.vector.reduce.or.v4i1(<4 x i1> [[TMP2]])			; CHECK-NEXT: [[TMP3:%.*]] = bitcast <4 x i1> [[TMP2]] to i4
	; CHECK-NEXT: [[TMP4:%.*]] = or i1 [[TMP3]], [[CMP3WRONG]]			; CHECK-NEXT: [[TMP4:%.*]] = icmp ne i4 [[TMP3]], 0
	; CHECK-NEXT: [[R:%.*]] = select i1 [[TMP4]], float -1.000000e+00, float 1.000000e+00			; CHECK-NEXT: [[TMP5:%.*]] = or i1 [[TMP4]], [[CMP3WRONG]]
				; CHECK-NEXT: [[R:%.*]] = select i1 [[TMP5]], float -1.000000e+00, float 1.000000e+00
	; CHECK-NEXT: ret float [[R]]			; CHECK-NEXT: ret float [[R]]
	;			;
	%x0 = extractelement <4 x float> %x, i32 0			%x0 = extractelement <4 x float> %x, i32 0
	%x1 = extractelement <4 x float> %x, i32 1			%x1 = extractelement <4 x float> %x, i32 1
	%x2 = extractelement <4 x float> %x, i32 2			%x2 = extractelement <4 x float> %x, i32 2
	%x3 = extractelement <4 x float> %x, i32 3			%x3 = extractelement <4 x float> %x, i32 3
	%cmp3wrong = fcmp olt float %x3, 42.0			%cmp3wrong = fcmp olt float %x3, 42.0
	%cmp0 = fcmp ogt float %x0, 1.0			%cmp0 = fcmp ogt float %x0, 1.0
	%cmp1 = fcmp ogt float %x1, 1.0			%cmp1 = fcmp ogt float %x1, 1.0
	%cmp2 = fcmp ogt float %x2, 1.0			%cmp2 = fcmp ogt float %x2, 1.0
	%cmp3 = fcmp ogt float %x3, 1.0			%cmp3 = fcmp ogt float %x3, 1.0
	%or03 = or i1 %cmp0, %cmp3			%or03 = or i1 %cmp0, %cmp3
	%or031 = or i1 %or03, %cmp1			%or031 = or i1 %or03, %cmp1
	%or0312 = or i1 %or031, %cmp2			%or0312 = or i1 %or031, %cmp2
	%or03123 = or i1 %or0312, %cmp3wrong			%or03123 = or i1 %or0312, %cmp3wrong
	%r = select i1 %or03123, float -1.0, float 1.0			%r = select i1 %or03123, float -1.0, float 1.0
	ret float %r			ret float %r
	}			}

	define i32 @merge_anyof_v4i32_wrong_middle(<4 x i32> %x) {			define i32 @merge_anyof_v4i32_wrong_middle(<4 x i32> %x) {
	; CHECK-LABEL: @merge_anyof_v4i32_wrong_middle(			; CHECK-LABEL: @merge_anyof_v4i32_wrong_middle(
	; CHECK-NEXT: [[TMP1:%.]] = extractelement <4 x i32> [[X:%.]], i32 3			; CHECK-NEXT: [[TMP1:%.]] = extractelement <4 x i32> [[X:%.]], i32 3
	; CHECK-NEXT: [[CMP3WRONG:%.*]] = icmp slt i32 [[TMP1]], 42			; CHECK-NEXT: [[CMP3WRONG:%.*]] = icmp slt i32 [[TMP1]], 42
	; CHECK-NEXT: [[TMP2:%.*]] = icmp sgt <4 x i32> [[X]], <i32 1, i32 1, i32 1, i32 1>			; CHECK-NEXT: [[TMP2:%.*]] = icmp sgt <4 x i32> [[X]], <i32 1, i32 1, i32 1, i32 1>
	; CHECK-NEXT: [[TMP3:%.*]] = call i1 @llvm.vector.reduce.or.v4i1(<4 x i1> [[TMP2]])			; CHECK-NEXT: [[TMP3:%.*]] = bitcast <4 x i1> [[TMP2]] to i4
	; CHECK-NEXT: [[TMP4:%.*]] = or i1 [[TMP3]], [[CMP3WRONG]]			; CHECK-NEXT: [[TMP4:%.*]] = icmp ne i4 [[TMP3]], 0
	; CHECK-NEXT: [[R:%.*]] = select i1 [[TMP4]], i32 -1, i32 1			; CHECK-NEXT: [[TMP5:%.*]] = or i1 [[TMP4]], [[CMP3WRONG]]
				; CHECK-NEXT: [[R:%.*]] = select i1 [[TMP5]], i32 -1, i32 1
	; CHECK-NEXT: ret i32 [[R]]			; CHECK-NEXT: ret i32 [[R]]
	;			;
	%x0 = extractelement <4 x i32> %x, i32 0			%x0 = extractelement <4 x i32> %x, i32 0
	%x1 = extractelement <4 x i32> %x, i32 1			%x1 = extractelement <4 x i32> %x, i32 1
	%x2 = extractelement <4 x i32> %x, i32 2			%x2 = extractelement <4 x i32> %x, i32 2
	%x3 = extractelement <4 x i32> %x, i32 3			%x3 = extractelement <4 x i32> %x, i32 3
	%cmp3wrong = icmp slt i32 %x3, 42			%cmp3wrong = icmp slt i32 %x3, 42
	%cmp0 = icmp sgt i32 %x0, 1			%cmp0 = icmp sgt i32 %x0, 1
	Show All 12 Lines
	; ideal reduction groups all of the original 'sgt' ops together.			; ideal reduction groups all of the original 'sgt' ops together.

	define i32 @merge_anyof_v4i32_wrong_middle_better_rdx(<4 x i32> %x, <4 x i32> %y) {			define i32 @merge_anyof_v4i32_wrong_middle_better_rdx(<4 x i32> %x, <4 x i32> %y) {
	; CHECK-LABEL: @merge_anyof_v4i32_wrong_middle_better_rdx(			; CHECK-LABEL: @merge_anyof_v4i32_wrong_middle_better_rdx(
	; CHECK-NEXT: [[TMP1:%.]] = extractelement <4 x i32> [[Y:%.]], i32 3			; CHECK-NEXT: [[TMP1:%.]] = extractelement <4 x i32> [[Y:%.]], i32 3
	; CHECK-NEXT: [[TMP2:%.]] = extractelement <4 x i32> [[X:%.]], i32 3			; CHECK-NEXT: [[TMP2:%.]] = extractelement <4 x i32> [[X:%.]], i32 3
	; CHECK-NEXT: [[CMP3WRONG:%.*]] = icmp slt i32 [[TMP2]], [[TMP1]]			; CHECK-NEXT: [[CMP3WRONG:%.*]] = icmp slt i32 [[TMP2]], [[TMP1]]
	; CHECK-NEXT: [[TMP3:%.*]] = icmp sgt <4 x i32> [[X]], [[Y]]			; CHECK-NEXT: [[TMP3:%.*]] = icmp sgt <4 x i32> [[X]], [[Y]]
	; CHECK-NEXT: [[TMP4:%.*]] = call i1 @llvm.vector.reduce.or.v4i1(<4 x i1> [[TMP3]])			; CHECK-NEXT: [[TMP4:%.*]] = bitcast <4 x i1> [[TMP3]] to i4
	; CHECK-NEXT: [[TMP5:%.*]] = or i1 [[TMP4]], [[CMP3WRONG]]			; CHECK-NEXT: [[TMP5:%.*]] = icmp ne i4 [[TMP4]], 0
	; CHECK-NEXT: [[R:%.*]] = select i1 [[TMP5]], i32 -1, i32 1			; CHECK-NEXT: [[TMP6:%.*]] = or i1 [[TMP5]], [[CMP3WRONG]]
				; CHECK-NEXT: [[R:%.*]] = select i1 [[TMP6]], i32 -1, i32 1
	; CHECK-NEXT: ret i32 [[R]]			; CHECK-NEXT: ret i32 [[R]]
	;			;
	%x0 = extractelement <4 x i32> %x, i32 0			%x0 = extractelement <4 x i32> %x, i32 0
	%x1 = extractelement <4 x i32> %x, i32 1			%x1 = extractelement <4 x i32> %x, i32 1
	%x2 = extractelement <4 x i32> %x, i32 2			%x2 = extractelement <4 x i32> %x, i32 2
	%x3 = extractelement <4 x i32> %x, i32 3			%x3 = extractelement <4 x i32> %x, i32 3
	%y0 = extractelement <4 x i32> %y, i32 0			%y0 = extractelement <4 x i32> %y, i32 0
	%y1 = extractelement <4 x i32> %y, i32 1			%y1 = extractelement <4 x i32> %y, i32 1
	Show All 14 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[Instcombiner]Improve emission of logical or/and reductions.
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 326167

llvm/lib/Transforms/Utils/LoopUtils.cpp

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

llvm/test/Transforms/SLPVectorizer/X86/compare-reduce.ll

This is an archive of the discontinued LLVM Phabricator instance.

[Instcombiner]Improve emission of logical or/and reductions.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 326167

llvm/lib/Transforms/Utils/LoopUtils.cpp

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

llvm/test/Transforms/SLPVectorizer/X86/compare-reduce.ll

[Instcombiner]Improve emission of logical or/and reductions.
ClosedPublic