This is an archive of the discontinued LLVM Phabricator instance.

[X86] PR34149 Suboptimal codegen for fast minnum and maxnum.
Needs ReviewPublic

Authored by jbhateja on Sep 8 2017, 2:37 AM.

Download Raw Diff

Details

Reviewers

spatel
craig.topper
sanjoy
RKSimon
asbirlea

Summary

Perform DAG combining of FMIN/MAXNUM based on Fast-Math flags which are propagated
from Instruction to SDNode.

Diff Detail

Build Status

Buildable 10060
Build 10060: arc lint + arc unit

Event Timeline

jbhateja created this revision.Sep 8 2017, 2:37 AM

Harbormaster completed remote builds in B10015: Diff 114323.Sep 8 2017, 2:38 AM

jbhateja edited reviewers, added: sanjoy, RKSimon, spatel, craig.topper; removed: llvm-commits.Sep 8 2017, 2:40 AM

jbhateja added a subscriber: llvm-commits.

sanjoy added a reviewer: asbirlea.Sep 8 2017, 3:06 AM

The way you're transferring the flags from the IR to the node works for this particular case, but I'd prefer that we clean that up as a preliminary step for this patch if possible. The existing flags transfer code is limited because we started with SDNodeFlags only on binary operators. So there's a different blob of code to handle flags in SelectionDAGBuilder::visitBinary(). Can we unify that? Is it possible to handle this for all opcodes/nodes in SelectionDAGBuilder::visit()?

I might be missing some context here. If we have fast/nnan on these calls, then can't we simplify this in IR to fmp+select and not have to deal with this in the backend? The intrinsics only exist to make sure that NaN behavior in IR meets the higher level standards, so if we have nnan, then we don't need the intrinsic?

Consolidating Instruction->SDNode Flags propagation in one class.

In D37616#865418, @spatel wrote:

I might be missing some context here. If we have fast/nnan on these calls, then can't we simplify this in IR to fmp+select and not have to deal with this in the backend? The intrinsics only exist to make sure that NaN behavior in IR meets the higher level standards, so if we have nnan, then we don't need the intrinsic?

Intrinsic function defer code geneation/expansion to backend this give backend control over geneating efficient code as per specific target. As of now SelectionDAGBuilder boils down intrinsic call to fminnum (or fminnan if target supports it and nnan is true) which is then handled by different targets differently.

In D37616#865798, @jbhateja wrote:

In D37616#865418, @spatel wrote:

I might be missing some context here. If we have fast/nnan on these calls, then can't we simplify this in IR to fmp+select and not have to deal with this in the backend? The intrinsics only exist to make sure that NaN behavior in IR meets the higher level standards, so if we have nnan, then we don't need the intrinsic?

Intrinsic function defer code geneation/expansion to backend this give backend control over geneating efficient code as per specific target.

It's incorrect that intrinsics are passed unaltered to the backend for expansion/optimization. See the optimizations for both generic and target-specific intrinsics in InstCombiner::visitCallInst().

Again, I may be missing some context - who created this IR? Creating a 'call fast llvm.maxnum()' just doesn't make sense to me, so if we can fix that in IR, we should do that. The intrinsic inhibits the large number of potential optimizations for fcmp+select that we have in IR. No target should benefit from having extra NaN semantics requirements provided by the intrinsic that are then overridden by FMF.

Please split the FlagsAcquirer diff into a separate patch.

In D37616#865981, @spatel wrote:

In D37616#865798, @jbhateja wrote:

In D37616#865418, @spatel wrote:

I might be missing some context here. If we have fast/nnan on these calls, then can't we simplify this in IR to fmp+select and not have to deal with this in the backend? The intrinsics only exist to make sure that NaN behavior in IR meets the higher level standards, so if we have nnan, then we don't need the intrinsic?

Intrinsic function defer code geneation/expansion to backend this give backend control over geneating efficient code as per specific target.

It's incorrect that intrinsics are passed unaltered to the backend for expansion/optimization. See the optimizations for both generic and target-specific intrinsics in InstCombiner::visitCallInst().

Again, I may be missing some context - who created this IR? Creating a 'call fast llvm.maxnum()' just doesn't make sense to me, so if we can fix that in IR, we should do that. The intrinsic inhibits the large number of potential optimizations for fcmp+select that we have in IR. No target should benefit from having extra NaN semantics requirements provided by the intrinsic that are then overridden by FMF.

Please split the FlagsAcquirer diff into a separate patch.

Your point of not genrating call fast @llvm.minnum in the first place and instead fcmp+setcc should have been generated is valid. But, its still a valid IR syntactically and semantically which could be thrown at backend.

Please consider following case which is being compiled for arm( -mcpu=cortex-r52 -march=arm test.ll -mattr=fp-armv8)

define <4 x double> @CASEA(<4 x double> %x, <4 x double> %y) {

%z = call fast <4 x double> @llvm.minnum.v4f64(<4 x double> %x, <4 x double> %y) readnone
ret <4 x double> %z

}

define <4 x double> @CASEB(<4 x double> %x, <4 x double> %y) {

%c = fcmp ule <4 x double> %x, %y
%z = select <4 x i1> %c, <4 x double> %x, <4 x double> %y
ret <4 x double> %z

}

Instruction selector does not generates vminnm for CASEB where as same is generated for CASEA.
As of now SelectionDAGBuilder generates fminnum (or fminnnan) SDNode for llvm.minnum intrinsic.
Thus different targets lowers fminnum (SDNode) differently.

So handling for fast math with intrinsic here should be fine ?

Otherwise, I can put Acquirer in a seperate patch.

In D37616#865983, @jbhateja wrote:

In D37616#865981, @spatel wrote:

In D37616#865798, @jbhateja wrote:

In D37616#865418, @spatel wrote:

I might be missing some context here. If we have fast/nnan on these calls, then can't we simplify this in IR to fmp+select and not have to deal with this in the backend? The intrinsics only exist to make sure that NaN behavior in IR meets the higher level standards, so if we have nnan, then we don't need the intrinsic?

Intrinsic function defer code geneation/expansion to backend this give backend control over geneating efficient code as per specific target.

It's incorrect that intrinsics are passed unaltered to the backend for expansion/optimization. See the optimizations for both generic and target-specific intrinsics in InstCombiner::visitCallInst().

Again, I may be missing some context - who created this IR? Creating a 'call fast llvm.maxnum()' just doesn't make sense to me, so if we can fix that in IR, we should do that. The intrinsic inhibits the large number of potential optimizations for fcmp+select that we have in IR. No target should benefit from having extra NaN semantics requirements provided by the intrinsic that are then overridden by FMF.

Please split the FlagsAcquirer diff into a separate patch.

Your point of not genrating call fast @llvm.minnum in the first place and instead fcmp+setcc should have been generated is valid. But, its still a valid IR syntactically and semantically which could be thrown at backend.

I don't see the point of this argument. There are an infinite number of valid IR patterns that we can send to the backend, but the main point of the IR optimizer is to limit that set, so we don't have to increase the complexity of the backend. That's what I'm requesting here: solve this in IR, so the backend never has to worry about it. I'll ask again: who is producing this IR or these nodes? If I'm missing some scenario in which this pattern can be created in the backend, then I agree that we will have to handle it there (but still not the target-specific way you've proposed). If not, let's not add unnecessary code.

Please consider following case which is being compiled for arm( -mcpu=cortex-r52 -march=arm test.ll -mattr=fp-armv8)

define <4 x double> @CASEA(<4 x double> %x, <4 x double> %y) {
%z = call fast <4 x double> @llvm.minnum.v4f64(<4 x double> %x, <4 x double> %y) readnone
ret <4 x double> %z
}

define <4 x double> @CASEB(<4 x double> %x, <4 x double> %y) {
%c = fcmp ule <4 x double> %x, %y
%z = select <4 x i1> %c, <4 x double> %x, <4 x double> %y
ret <4 x double> %z
}

Instruction selector does not generates vminnm for CASEB where as same is generated for CASEA.
As of now SelectionDAGBuilder generates fminnum (or fminnnan) SDNode for llvm.minnum intrinsic.
Thus different targets lowers fminnum (SDNode) differently.

I acknowledge there may be other bugs here, but I think the ARM backend is behaving correctly for the example as shown (cc'ing @efriedma for expertise).

If we want to create equivalent patterns for these 2 examples, then we must add 'fast' to the fcmp in the 2nd case:

%c = fcmp fast ule <4 x double> %x, %y

If we do that, we see that the ARM backend is still behaving correctly and optimally to produce vminnm:

$ ./llc -o - minnum.ll -mcpu=cortex-r52 -march=arm -mattr=fp-armv8 | egrep 'CASE|minnm'
_CASEA:
vminnm.f64 d25, d22, d17
vminnm.f64 d24, d23, d16
vminnm.f64 d17, d21, d19
vminnm.f64 d16, d20, d18
_CASEB:
vminnm.f64 d25, d22, d17
vminnm.f64 d24, d23, d16
vminnm.f64 d17, d21, d19
vminnm.f64 d16, d20, d18

spatel mentioned this in D37686: [DAG] Consolidating Instruction->SDNode Flags propagation in one class for better code management..Sep 11 2017, 8:26 AM

FWIW, we've seen a similar case of min/maxnum being generated in the IR and leading to suboptimal codegen.
The solution was replacing the IR generated with fcmp+select (coupled with D27846). Seems like the right approach to avoid adding complexity to the backend.

RKSimon resigned from this revision.Sep 29 2017, 3:18 AM

asbirlea resigned from this revision.Sep 15 2021, 12:26 PM

Herald added a subscriber: pengfei. · View Herald TranscriptSep 15 2021, 12:26 PM

sanjoy resigned from this revision.Jan 29 2022, 5:33 PM

Revision Contents

Path

Size

lib/

CodeGen/

SelectionDAG/

SelectionDAGBuilder.h

4 lines

SelectionDAGBuilder.cpp

91 lines

Target/

X86/

X86ISelLowering.cpp

25 lines

test/

CodeGen/

X86/

pr34149.ll

8 lines

Diff 114491

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.h

Show First 20 Lines • Show All 655 Lines • ▼ Show 20 Lines	public:
SDValue getCopyFromRegs(const Value V, Type Ty);		SDValue getCopyFromRegs(const Value V, Type Ty);

// resolveDanglingDebugInfo - if we saw an earlier dbg_value referring to V,		// resolveDanglingDebugInfo - if we saw an earlier dbg_value referring to V,
// generate the debug data structures now that we've seen its definition.		// generate the debug data structures now that we've seen its definition.
void resolveDanglingDebugInfo(const Value *V, SDValue Val);		void resolveDanglingDebugInfo(const Value *V, SDValue Val);
SDValue getValue(const Value *V);		SDValue getValue(const Value *V);
bool findValue(const Value *V) const;		bool findValue(const Value *V) const;

		// returns DAG node of SDValue present in NodeMap for
		// a given Value.
		SDNode * getDAGNode(const Value *);

SDValue getNonRegisterValue(const Value *V);		SDValue getNonRegisterValue(const Value *V);
SDValue getValueImpl(const Value *V);		SDValue getValueImpl(const Value *V);

void setValue(const Value *V, SDValue NewN) {		void setValue(const Value *V, SDValue NewN) {
SDValue &N = NodeMap[V];		SDValue &N = NodeMap[V];
assert(!N.getNode() && "Already set a value for this node!");		assert(!N.getNode() && "Already set a value for this node!");
N = NewN;		N = NewN;
}		}
▲ Show 20 Lines • Show All 360 Lines • Show Last 20 Lines

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 78 Lines • ▼ Show 20 Lines
static unsigned LimitFloatPrecision;		static unsigned LimitFloatPrecision;

static cl::opt<unsigned, true>		static cl::opt<unsigned, true>
LimitFPPrecision("limit-float-precision",		LimitFPPrecision("limit-float-precision",
cl::desc("Generate low-precision inline sequences "		cl::desc("Generate low-precision inline sequences "
"for some float libcalls"),		"for some float libcalls"),
cl::location(LimitFloatPrecision),		cl::location(LimitFloatPrecision),
cl::init(0));		cl::init(0));

		static bool isVectorReductionOp(const User *I);

		/// This class is used for propagating Flags from Instruction to SDNode.
		/// These flags are later used by accessing SDNode during different
		/// DAG phases.
		class SDNodeFlagsAcquirer {
		public:
		SDNodeFlagsAcquirer(const Instruction * I, SelectionDAGBuilder *SDB):
		Instr(I), SelDB(SDB) {}

		~SDNodeFlagsAcquirer() {
		SDNode * Node = SelDB->getDAGNode(Instr);
		if (Node) {
		SDNodeFlags Flags = Node->getFlags();

		if (isa<FPMathOperator>(*Instr)) {
		Flags.setNoNaNs(Instr->hasNoNaNs());
		Flags.setNoInfs(Instr->hasNoInfs());
		Flags.setUnsafeAlgebra(Instr->hasUnsafeAlgebra());
		Flags.setNoSignedZeros(Instr->hasNoSignedZeros());
		Flags.setAllowContract(Instr->hasAllowContract());
		Flags.setAllowReciprocal(Instr->hasAllowReciprocal());
		}

		if (const OverflowingBinaryOperator *OFBinOp =
		dyn_cast<const OverflowingBinaryOperator>(Instr)) {
		Flags.setNoSignedWrap(OFBinOp->hasNoSignedWrap());
		Flags.setNoUnsignedWrap(OFBinOp->hasNoUnsignedWrap());
		}

		if (const PossiblyExactOperator *ExactOp =
		dyn_cast<const PossiblyExactOperator>(Instr))
		Flags.setExact(ExactOp->isExact());

		if (isVectorReductionOp(Instr))
		Flags.setVectorReduction(true);

		Node->setFlags(Flags);
		}
		}

		private:
		const Instruction * Instr;
		SelectionDAGBuilder *SelDB;
		};

// Limit the width of DAG chains. This is important in general to prevent		// Limit the width of DAG chains. This is important in general to prevent
// DAG-based analysis from blowing up. For example, alias analysis and		// DAG-based analysis from blowing up. For example, alias analysis and
// load clustering may not complete in reasonable time. It is difficult to		// load clustering may not complete in reasonable time. It is difficult to
// recognize and avoid this situation within each individual analysis, and		// recognize and avoid this situation within each individual analysis, and
// future analyses are likely to have the same behavior. Limiting DAG width is		// future analyses are likely to have the same behavior. Limiting DAG width is
// the safe approach and will be especially important with global DAGs.		// the safe approach and will be especially important with global DAGs.
//		//
// MaxParallelChains default is arbitrarily high to avoid affecting		// MaxParallelChains default is arbitrarily high to avoid affecting
▲ Show 20 Lines • Show All 877 Lines • ▼ Show 20 Lines	SDValue SelectionDAGBuilder::getControlRoot() {
Root = DAG.getNode(ISD::TokenFactor, getCurSDLoc(), MVT::Other,		Root = DAG.getNode(ISD::TokenFactor, getCurSDLoc(), MVT::Other,
PendingExports);		PendingExports);
PendingExports.clear();		PendingExports.clear();
DAG.setRoot(Root);		DAG.setRoot(Root);
return Root;		return Root;
}		}

void SelectionDAGBuilder::visit(const Instruction &I) {		void SelectionDAGBuilder::visit(const Instruction &I) {
		SDNodeFlagsAcquirer Flags(&I,this);

// Set up outgoing PHI node register values before emitting the terminator.		// Set up outgoing PHI node register values before emitting the terminator.
if (isa<TerminatorInst>(&I)) {		if (isa<TerminatorInst>(&I)) {
HandlePHINodesInSuccessorBlocks(I.getParent());		HandlePHINodesInSuccessorBlocks(I.getParent());
}		}

// Increase the SDNodeOrder if dealing with a non-debug instruction.		// Increase the SDNodeOrder if dealing with a non-debug instruction.
if (!isa<DbgInfoIntrinsic>(I))		if (!isa<DbgInfoIntrinsic>(I))
++SDNodeOrder;		++SDNodeOrder;
▲ Show 20 Lines • Show All 65 Lines • ▼ Show 20 Lines	if (It != FuncInfo.ValueMap.end()) {
Result = RFV.getCopyFromRegs(DAG, FuncInfo, getCurSDLoc(), Chain, nullptr,		Result = RFV.getCopyFromRegs(DAG, FuncInfo, getCurSDLoc(), Chain, nullptr,
V);		V);
resolveDanglingDebugInfo(V, Result);		resolveDanglingDebugInfo(V, Result);
}		}

return Result;		return Result;
}		}

		SDNode * SelectionDAGBuilder::getDAGNode(const Value *V) {
		if (NodeMap.find(V) == NodeMap.end())
		return nullptr;
		return NodeMap[V].getNode();
		}

/// getValue - Return an SDValue for the given Value.		/// getValue - Return an SDValue for the given Value.
SDValue SelectionDAGBuilder::getValue(const Value *V) {		SDValue SelectionDAGBuilder::getValue(const Value *V) {
// If we already have an SDValue for this value, use it. It's important		// If we already have an SDValue for this value, use it. It's important
// to do this first, so that we don't create a CopyFromReg if we already		// to do this first, so that we don't create a CopyFromReg if we already
// have a regular SDValue.		// have a regular SDValue.
SDValue &N = NodeMap[V];		SDValue &N = NodeMap[V];
if (N.getNode()) return N;		if (N.getNode()) return N;

▲ Show 20 Lines • Show All 1,563 Lines • ▼ Show 20 Lines	static bool isVectorReductionOp(const User *I) {
}		}
return ReduxExtracted;		return ReduxExtracted;
}		}

void SelectionDAGBuilder::visitBinary(const User &I, unsigned OpCode) {		void SelectionDAGBuilder::visitBinary(const User &I, unsigned OpCode) {
SDValue Op1 = getValue(I.getOperand(0));		SDValue Op1 = getValue(I.getOperand(0));
SDValue Op2 = getValue(I.getOperand(1));		SDValue Op2 = getValue(I.getOperand(1));

bool nuw = false;		if (isVectorReductionOp(&I))
bool nsw = false;
bool exact = false;
bool vec_redux = false;
FastMathFlags FMF;

if (const OverflowingBinaryOperator *OFBinOp =
dyn_cast<const OverflowingBinaryOperator>(&I)) {
nuw = OFBinOp->hasNoUnsignedWrap();
nsw = OFBinOp->hasNoSignedWrap();
}
if (const PossiblyExactOperator *ExactOp =
dyn_cast<const PossiblyExactOperator>(&I))
exact = ExactOp->isExact();
if (const FPMathOperator *FPOp = dyn_cast<const FPMathOperator>(&I))
FMF = FPOp->getFastMathFlags();

if (isVectorReductionOp(&I)) {
vec_redux = true;
DEBUG(dbgs() << "Detected a reduction operation:" << I << "\n");		DEBUG(dbgs() << "Detected a reduction operation:" << I << "\n");
}

SDNodeFlags Flags;
Flags.setExact(exact);
Flags.setNoSignedWrap(nsw);
Flags.setNoUnsignedWrap(nuw);
Flags.setVectorReduction(vec_redux);
Flags.setAllowReciprocal(FMF.allowReciprocal());
Flags.setAllowContract(FMF.allowContract());
Flags.setNoInfs(FMF.noInfs());
Flags.setNoNaNs(FMF.noNaNs());
Flags.setNoSignedZeros(FMF.noSignedZeros());
Flags.setUnsafeAlgebra(FMF.unsafeAlgebra());

SDValue BinNodeValue = DAG.getNode(OpCode, getCurSDLoc(), Op1.getValueType(),		SDValue BinNodeValue = DAG.getNode(OpCode, getCurSDLoc(), Op1.getValueType(),
Op1, Op2, Flags);		Op1, Op2);
setValue(&I, BinNodeValue);		setValue(&I, BinNodeValue);
}		}

void SelectionDAGBuilder::visitShift(const User &I, unsigned Opcode) {		void SelectionDAGBuilder::visitShift(const User &I, unsigned Opcode) {
SDValue Op1 = getValue(I.getOperand(0));		SDValue Op1 = getValue(I.getOperand(0));
SDValue Op2 = getValue(I.getOperand(1));		SDValue Op2 = getValue(I.getOperand(1));

EVT ShiftTy = DAG.getTargetLoweringInfo().getShiftAmountTy(		EVT ShiftTy = DAG.getTargetLoweringInfo().getShiftAmountTy(
▲ Show 20 Lines • Show All 3,884 Lines • ▼ Show 20 Lines	bool SelectionDAGBuilder::visitBinaryFloatCall(const CallInst &I,
SDValue Tmp0 = getValue(I.getArgOperand(0));		SDValue Tmp0 = getValue(I.getArgOperand(0));
SDValue Tmp1 = getValue(I.getArgOperand(1));		SDValue Tmp1 = getValue(I.getArgOperand(1));
EVT VT = Tmp0.getValueType();		EVT VT = Tmp0.getValueType();
setValue(&I, DAG.getNode(Opcode, getCurSDLoc(), VT, Tmp0, Tmp1));		setValue(&I, DAG.getNode(Opcode, getCurSDLoc(), VT, Tmp0, Tmp1));
return true;		return true;
}		}

void SelectionDAGBuilder::visitCall(const CallInst &I) {		void SelectionDAGBuilder::visitCall(const CallInst &I) {

// Handle inline assembly differently.		// Handle inline assembly differently.
if (isa<InlineAsm>(I.getCalledValue())) {		if (isa<InlineAsm>(I.getCalledValue())) {
visitInlineAsm(&I);		visitInlineAsm(&I);
return;		return;
}		}

MachineModuleInfo &MMI = DAG.getMachineFunction().getMMI();		MachineModuleInfo &MMI = DAG.getMachineFunction().getMMI();
computeUsesVAFloatArgument(I, MMI);		computeUsesVAFloatArgument(I, MMI);
▲ Show 20 Lines • Show All 3,303 Lines • Show Last 20 Lines

lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 32,759 Lines • ▼ Show 20 Lines
	// The SSE FP max/min instructions were not designed for this case, but rather			// The SSE FP max/min instructions were not designed for this case, but rather
	// to implement:			// to implement:
	// Min = Op1 < Op0 ? Op1 : Op0			// Min = Op1 < Op0 ? Op1 : Op0
	// Max = Op1 > Op0 ? Op1 : Op0			// Max = Op1 > Op0 ? Op1 : Op0
	//			//
	// So they always return Op0 if either input is a NaN. However, we can still			// So they always return Op0 if either input is a NaN. However, we can still
	// use those instructions for fmaxnum by selecting away a NaN input.			// use those instructions for fmaxnum by selecting away a NaN input.

	// If either operand is NaN, the 2nd source operand (Op0) is passed through.			SDNodeFlags Flags = N->getFlags();
	auto MinMaxOp = N->getOpcode() == ISD::FMAXNUM ? X86ISD::FMAX : X86ISD::FMIN;			bool FastMathFlag = Flags.hasUnsafeAlgebra() \|\| Flags.hasNoNaNs();
	SDValue MinOrMax = DAG.getNode(MinMaxOp, DL, VT, Op1, Op0);
	SDValue IsOp0Nan = DAG.getSetCC(DL, SetCCType , Op0, Op0, ISD::SETUO);

	// If Op0 is a NaN, select Op1. Otherwise, select the max. If both operands			if (!FastMathFlag) {
	// are NaN, the NaN value of Op1 is the result.			// If either operand is NaN, the 2nd source operand (Op0) is passed through.
	return DAG.getSelect(DL, VT, IsOp0Nan, Op1, MinOrMax);			auto MinMaxOp = N->getOpcode() == ISD::FMAXNUM ? X86ISD::FMAX : X86ISD::FMIN;
				SDValue MinOrMax = DAG.getNode(MinMaxOp, DL, VT, Op1, Op0);
				SDValue IsOp0Nan = DAG.getSetCC(DL, SetCCType , Op0, Op0, ISD::SETUO);

				// If Op0 is a NaN, select Op1. Otherwise, select the max. If both operands
				// are NaN, the NaN value of Op1 is the result.
				return DAG.getSelect(DL, VT, IsOp0Nan, Op1, MinOrMax);
				} else {
				// FastMath assume operands and result are not a NaN.
				auto MinMaxOp = N->getOpcode() == ISD::FMAXNUM ? ISD::SETUGE : ISD::SETULE;
				SDValue CCRes = DAG.getSetCC(DL, SetCCType , Op0, Op1, MinMaxOp);

				return DAG.getSelect(DL, VT, CCRes, Op0, Op1);
				}
	}			}

	/// Do target-specific dag combines on X86ISD::ANDNP nodes.			/// Do target-specific dag combines on X86ISD::ANDNP nodes.
	static SDValue combineAndnp(SDNode *N, SelectionDAG &DAG,			static SDValue combineAndnp(SDNode *N, SelectionDAG &DAG,
	TargetLowering::DAGCombinerInfo &DCI,			TargetLowering::DAGCombinerInfo &DCI,
	const X86Subtarget &Subtarget) {			const X86Subtarget &Subtarget) {
	// ANDNP(0, x) -> x			// ANDNP(0, x) -> x
	if (ISD::isBuildVectorAllZeros(N->getOperand(0).getNode()))			if (ISD::isBuildVectorAllZeros(N->getOperand(0).getNode()))
	▲ Show 20 Lines • Show All 2,646 Lines • Show Last 20 Lines

test/CodeGen/X86/pr34149.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py

	; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mcpu=haswell \| FileCheck %s			; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mcpu=haswell \| FileCheck %s

	declare <4 x double> @llvm.minnum.v4f64(<4 x double> %x, <4 x double> %y)			declare <4 x double> @llvm.minnum.v4f64(<4 x double> %x, <4 x double> %y)
	declare <4 x double> @llvm.maxnum.v4f64(<4 x double> %x, <4 x double> %y)			declare <4 x double> @llvm.maxnum.v4f64(<4 x double> %x, <4 x double> %y)

	define <4 x double> @via_minnum(<4 x double> %x, <4 x double> %y) {			define <4 x double> @via_minnum(<4 x double> %x, <4 x double> %y) {
	; CHECK-LABEL: via_minnum:			; CHECK-LABEL: via_minnum:
	; CHECK: # BB#0:			; CHECK: # BB#0:
	; CHECK-NEXT: vminpd %ymm0, %ymm1, %ymm2			; CHECK-NEXT: vminpd %ymm0, %ymm1, %ymm0
	; CHECK-NEXT: vcmpunordpd %ymm0, %ymm0, %ymm0
	; CHECK-NEXT: vblendvpd %ymm0, %ymm1, %ymm2, %ymm0
	; CHECK-NEXT: retq			; CHECK-NEXT: retq
	%z = call fast <4 x double> @llvm.minnum.v4f64(<4 x double> %x, <4 x double> %y) readnone			%z = call fast <4 x double> @llvm.minnum.v4f64(<4 x double> %x, <4 x double> %y) readnone
	ret <4 x double> %z			ret <4 x double> %z
	}			}

	define <4 x double> @via_maxnum(<4 x double> %x, <4 x double> %y) {			define <4 x double> @via_maxnum(<4 x double> %x, <4 x double> %y) {
	; CHECK-LABEL: via_maxnum:			; CHECK-LABEL: via_maxnum:
	; CHECK: # BB#0:			; CHECK: # BB#0:
	; CHECK-NEXT: vmaxpd %ymm0, %ymm1, %ymm2			; CHECK-NEXT: vmaxpd %ymm0, %ymm1, %ymm0
	; CHECK-NEXT: vcmpunordpd %ymm0, %ymm0, %ymm0
	; CHECK-NEXT: vblendvpd %ymm0, %ymm1, %ymm2, %ymm0
	; CHECK-NEXT: retq			; CHECK-NEXT: retq
	%z = call fast <4 x double> @llvm.maxnum.v4f64(<4 x double> %x, <4 x double> %y) readnone			%z = call fast <4 x double> @llvm.maxnum.v4f64(<4 x double> %x, <4 x double> %y) readnone
	ret <4 x double> %z			ret <4 x double> %z
	}			}

	define <4 x double> @via_fcmp(<4 x double> %x, <4 x double> %y) {			define <4 x double> @via_fcmp(<4 x double> %x, <4 x double> %y) {
	; CHECK-LABEL: via_fcmp:			; CHECK-LABEL: via_fcmp:
	; CHECK: # BB#0:			; CHECK: # BB#0:
	; CHECK-NEXT: vminpd %ymm0, %ymm1, %ymm0			; CHECK-NEXT: vminpd %ymm0, %ymm1, %ymm0
	; CHECK-NEXT: retq			; CHECK-NEXT: retq
	%c = fcmp ule <4 x double> %x, %y			%c = fcmp ule <4 x double> %x, %y
	%z = select <4 x i1> %c, <4 x double> %x, <4 x double> %y			%z = select <4 x i1> %c, <4 x double> %x, <4 x double> %y
	ret <4 x double> %z			ret <4 x double> %z
	}			}