This is an archive of the discontinued LLVM Phabricator instance.

[x86] inline calls to fmaxf / llvm.maxnum.f32 using maxss (PR24475)
ClosedPublic

Authored by spatel on Dec 7 2015, 10:43 AM.

Download Raw Diff

Details

Reviewers

qcolombet
scanon
andreadb
kbsmith1
zansari

Commits

rG271efcdf209f: [x86] inline calls to fmaxf / llvm.maxnum.f32 using maxss (PR24475)
rL255700: [x86] inline calls to fmaxf / llvm.maxnum.f32 using maxss (PR24475)

Summary

This patch implements the suggested codegen from PR24475:
https://llvm.org/bugs/show_bug.cgi?id=24475

but only for the fmaxf() case to start, so we can sort out any bugs before extending to fmin, f64, and vectors.

The fmax / maxnum definitions provide us flexibility for signed zeros, so I hope the only thing we have to worry about in this replacement sequence is NaN handling.

Note 1: I initially implemented this as lowerFMAXNUM(), but that exposes a problem: SelectionDAGBuilder::visitSelect() transforms compare/select instructions into FMAXNUM nodes if we declare FMAXNUM legal or custom. Perhaps this should be checking for NaN inputs or global unsafe-math before transforming? As it stands, this bypasses a big set of optimizations that the x86 backend already has in PerformSELECTCombine(). I don't know what the tradeoffs are for making a 'combine' rather than a 'lower'. If a 'lower' is preferred, we will need to fix that problem.

Note 2: The v2f32 test reveals another bug; the vector is extended to v4f32, so we have completely unnecessary operations happening on undef elements of the vector.

Diff Detail

Event Timeline

spatel updated this revision to Diff 42078.Dec 7 2015, 10:43 AM

spatel retitled this revision from to [x86] inline calls to fmaxf / llvm.maxnum.f32 using maxss (PR24475).

spatel updated this object.

spatel added reviewers: scanon, qcolombet, jmolloy.

spatel added a subscriber: llvm-commits.

As this primarily involves the x86 backend, I don't think I'm an appropriate reviewer; Resigning.

In D15294#309246, @jmolloy wrote:

As this primarily involves the x86 backend, I don't think I'm an appropriate reviewer; Resigning.

Thanks, James. I thought you might have some feedback on the SelectionDAGBuilder question, although that is separate from the functionality of this patch as it stands.

Adding some more potential x86 reviewers.

Hi Sanjay,

Quick high-level question : wouldn't it be better to pull the intermediate value out of the fmax to reduce the dependence chain?

So, instead of :

cond = isnan(op0)
V = select (op1, op0, cond)
Result = FMAX(op1, V)

do this:

t = FMAX(op1, op0)
cond = isnan(op0)
Result = select (op1, t, cond)

This way, we go from isnan->select->fmax to fmax/isnan -> select.

In D15294#310167, @zansari wrote:

Quick high-level question : wouldn't it be better to pull the intermediate value out of the fmax to reduce the dependence chain?

Yes, that would be better.

Because I'm SSE dyslexic, I altered the test program in https://llvm.org/bugs/show_bug.cgi?id=24475 to check, and this is what I came up with:

__m128 maxnum = _mm_max_ss(v2, v1);
__m128 isnan1 = _mm_cmpunord_ss(v1, v1);
maxnum = _mm_blendv_ps(maxnum, v2, isnan1);

Which compiles to (AT&T syntax - should invert the dyslexia, but I still can't get it right):

vmaxss        %xmm0, %xmm1, %xmm2         <--- if either input is NaN, xmm0 (v1) is returned
vcmpunordss   %xmm0, %xmm0, %xmm0
vblendvps     %xmm0, %xmm1, %xmm2, %xmm0  <--- if xmm0 (v1) is NaN, output xmm1 (v2); if not, output max or v1

I'll translate that to LLVM and update the patch. Thanks!

Patch updated:

Implement more efficient sequence suggested by Zia.
Fix checks in the test to match.
Add FIXME comments for other fmax tests that aren't handled yet.

I'd disable the expansion under minsize, at least.. Otherwise, lgtm.

LGTM

This revision is now accepted and ready to land.Dec 15 2015, 12:40 PM

In D15294#311164, @zansari wrote:

I'd disable the expansion under minsize, at least.. Otherwise, lgtm.

Thanks! I'll add that check and a test case and get this checked in.

Closed by commit rL255700: [x86] inline calls to fmaxf / llvm.maxnum.f32 using maxss (PR24475) (authored by spatel). · Explain WhyDec 15 2015, 3:14 PM

This revision was automatically updated to reflect the committed changes.

pengfei mentioned this in D145634: [X86] Support llvm.{min,max}imum.f{16,32,64}.Mar 12 2023, 3:08 AM

Revision Contents

Path

Size

lib/

Target/

X86/

X86ISelLowering.cpp

59 lines

test/

CodeGen/

X86/

fmaxnum.ll

252 lines

Diff 42078

lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,779 Lines • ▼ Show 20 Lines
setTargetDAGCombine(ISD::SRL);		setTargetDAGCombine(ISD::SRL);
setTargetDAGCombine(ISD::OR);		setTargetDAGCombine(ISD::OR);
setTargetDAGCombine(ISD::AND);		setTargetDAGCombine(ISD::AND);
setTargetDAGCombine(ISD::ADD);		setTargetDAGCombine(ISD::ADD);
setTargetDAGCombine(ISD::FADD);		setTargetDAGCombine(ISD::FADD);
setTargetDAGCombine(ISD::FSUB);		setTargetDAGCombine(ISD::FSUB);
setTargetDAGCombine(ISD::FNEG);		setTargetDAGCombine(ISD::FNEG);
setTargetDAGCombine(ISD::FMA);		setTargetDAGCombine(ISD::FMA);
		setTargetDAGCombine(ISD::FMAXNUM);
setTargetDAGCombine(ISD::SUB);		setTargetDAGCombine(ISD::SUB);
setTargetDAGCombine(ISD::LOAD);		setTargetDAGCombine(ISD::LOAD);
setTargetDAGCombine(ISD::MLOAD);		setTargetDAGCombine(ISD::MLOAD);
setTargetDAGCombine(ISD::STORE);		setTargetDAGCombine(ISD::STORE);
setTargetDAGCombine(ISD::MSTORE);		setTargetDAGCombine(ISD::MSTORE);
setTargetDAGCombine(ISD::TRUNCATE);		setTargetDAGCombine(ISD::TRUNCATE);
setTargetDAGCombine(ISD::ZERO_EXTEND);		setTargetDAGCombine(ISD::ZERO_EXTEND);
setTargetDAGCombine(ISD::ANY_EXTEND);		setTargetDAGCombine(ISD::ANY_EXTEND);
▲ Show 20 Lines • Show All 24,604 Lines • ▼ Show 20 Lines	switch (N->getOpcode()) {
case X86ISD::FMIN: NewOp = X86ISD::FMINC; break;		case X86ISD::FMIN: NewOp = X86ISD::FMINC; break;
case X86ISD::FMAX: NewOp = X86ISD::FMAXC; break;		case X86ISD::FMAX: NewOp = X86ISD::FMAXC; break;
}		}

return DAG.getNode(NewOp, SDLoc(N), N->getValueType(0),		return DAG.getNode(NewOp, SDLoc(N), N->getValueType(0),
N->getOperand(0), N->getOperand(1));		N->getOperand(0), N->getOperand(1));
}		}

		static SDValue performFMaxNumCombine(SDNode *N, SelectionDAG &DAG,
		const X86Subtarget *Subtarget) {
		EVT VT = N->getValueType(0);

		// TODO: Check for global or instruction-level "nnan". In that case, we
		// should be able to lower to FMAX/FMIN alone.
		// TODO: If an operand is already known to be a NaN or not a NaN, this
		// should be an optional swap and FMAX/FMIN.
		// TODO: Allow f64, vectors, and fminnum.

		if (VT != MVT::f32 \|\| !Subtarget->hasSSE1() \|\| Subtarget->useSoftFloat())
		return SDValue();

		SDValue Op0 = N->getOperand(0);
		SDValue Op1 = N->getOperand(1);
		SDLoc DL(N);
		const TargetLowering &TLI = DAG.getTargetLoweringInfo();
		EVT SetCCType = TLI.getSetCCResultType(DAG.getDataLayout(), *DAG.getContext(),
		VT);

		// There are 4 possibilities involving NaN inputs, and these are the required
		// outputs:
		// Op1
		// Num NaN
		// ----------------
		// Num \| Max \| Op0 \|
		// Op0 ----------------
		// NaN \| Op1 \| NaN \|
		// ----------------
		//
		// The SSE FP max/min instructions were not designed for this case, but rather
		// to implement:
		// max = op1 > op2 ? op1 : op2
		//
		// So they always return op2 if either input is a NaN. However, we can still
		// use those instructions for fmaxnum by selecting away a NaN input.
		//
		// 1. If the first operand is a NaN, calculate the max of the second operand
		// against itself, so return the second operand.
		// 2. If the second operand is a NaN, return the first operand (if it's not a
		// NaN too).
		// 3. If both operands are NaN, return the second operand's NaN value because
		// Op1 was selected.
		// 4. If neither operand is a NaN, calculate the max of the first operand and
		// the second operand.

		// Is the first operand a NaN?
		SDValue IsOp0Nan = DAG.getSetCC(DL, SetCCType , Op0, Op0, ISD::SETUO);

		// If the first operand is not a NaN, then pass it through. Otherwise, choose
		// the second operand.
		SDValue Op0NotNan = DAG.getNode(ISD::SELECT, DL, VT, IsOp0Nan, Op1, Op0);

		return DAG.getNode(X86ISD::FMAX, DL, VT, Op1, Op0NotNan);
		}


/// Do target-specific dag combines on X86ISD::FAND nodes.		/// Do target-specific dag combines on X86ISD::FAND nodes.
static SDValue PerformFANDCombine(SDNode *N, SelectionDAG &DAG,		static SDValue PerformFANDCombine(SDNode *N, SelectionDAG &DAG,
const X86Subtarget *Subtarget) {		const X86Subtarget *Subtarget) {
// FAND(0.0, x) -> 0.0		// FAND(0.0, x) -> 0.0
if (ConstantFPSDNode *C = dyn_cast<ConstantFPSDNode>(N->getOperand(0)))		if (ConstantFPSDNode *C = dyn_cast<ConstantFPSDNode>(N->getOperand(0)))
if (C->getValueAPF().isPosZero())		if (C->getValueAPF().isPosZero())
return N->getOperand(0);		return N->getOperand(0);

▲ Show 20 Lines • Show All 829 Lines • ▼ Show 20 Lines	SDValue X86TargetLowering::PerformDAGCombine(SDNode *N,
case ISD::FADD: return PerformFADDCombine(N, DAG, Subtarget);		case ISD::FADD: return PerformFADDCombine(N, DAG, Subtarget);
case ISD::FSUB: return PerformFSUBCombine(N, DAG, Subtarget);		case ISD::FSUB: return PerformFSUBCombine(N, DAG, Subtarget);
case ISD::FNEG: return PerformFNEGCombine(N, DAG, Subtarget);		case ISD::FNEG: return PerformFNEGCombine(N, DAG, Subtarget);
case ISD::TRUNCATE: return PerformTRUNCATECombine(N, DAG, Subtarget);		case ISD::TRUNCATE: return PerformTRUNCATECombine(N, DAG, Subtarget);
case X86ISD::FXOR:		case X86ISD::FXOR:
case X86ISD::FOR: return PerformFORCombine(N, DAG, Subtarget);		case X86ISD::FOR: return PerformFORCombine(N, DAG, Subtarget);
case X86ISD::FMIN:		case X86ISD::FMIN:
case X86ISD::FMAX: return PerformFMinFMaxCombine(N, DAG);		case X86ISD::FMAX: return PerformFMinFMaxCombine(N, DAG);
		case ISD::FMAXNUM: return performFMaxNumCombine(N, DAG, Subtarget);
case X86ISD::FAND: return PerformFANDCombine(N, DAG, Subtarget);		case X86ISD::FAND: return PerformFANDCombine(N, DAG, Subtarget);
case X86ISD::FANDN: return PerformFANDNCombine(N, DAG, Subtarget);		case X86ISD::FANDN: return PerformFANDNCombine(N, DAG, Subtarget);
case X86ISD::BT: return PerformBTCombine(N, DAG, DCI);		case X86ISD::BT: return PerformBTCombine(N, DAG, DCI);
case X86ISD::VZEXT_MOVL: return PerformVZEXT_MOVLCombine(N, DAG);		case X86ISD::VZEXT_MOVL: return PerformVZEXT_MOVLCombine(N, DAG);
case ISD::ANY_EXTEND:		case ISD::ANY_EXTEND:
case ISD::ZERO_EXTEND: return PerformZExtCombine(N, DAG, DCI, Subtarget);		case ISD::ZERO_EXTEND: return PerformZExtCombine(N, DAG, DCI, Subtarget);
case ISD::SIGN_EXTEND: return PerformSExtCombine(N, DAG, DCI, Subtarget);		case ISD::SIGN_EXTEND: return PerformSExtCombine(N, DAG, DCI, Subtarget);
case ISD::SIGN_EXTEND_INREG:		case ISD::SIGN_EXTEND_INREG:
▲ Show 20 Lines • Show All 847 Lines • Show Last 20 Lines

test/CodeGen/X86/fmaxnum.ll

	Show All 10 Lines
	declare <2 x float> @llvm.maxnum.v2f32(<2 x float>, <2 x float>)			declare <2 x float> @llvm.maxnum.v2f32(<2 x float>, <2 x float>)
	declare <4 x float> @llvm.maxnum.v4f32(<4 x float>, <4 x float>)			declare <4 x float> @llvm.maxnum.v4f32(<4 x float>, <4 x float>)
	declare <2 x double> @llvm.maxnum.v2f64(<2 x double>, <2 x double>)			declare <2 x double> @llvm.maxnum.v2f64(<2 x double>, <2 x double>)
	declare <4 x double> @llvm.maxnum.v4f64(<4 x double>, <4 x double>)			declare <4 x double> @llvm.maxnum.v4f64(<4 x double>, <4 x double>)
	declare <8 x double> @llvm.maxnum.v8f64(<8 x double>, <8 x double>)			declare <8 x double> @llvm.maxnum.v8f64(<8 x double>, <8 x double>)


	; CHECK-LABEL: @test_fmaxf			; CHECK-LABEL: @test_fmaxf
	; CHECK: jmp fmaxf			; SSE: movaps %xmm0, %xmm2
				; SSE-NEXT: cmpunordss %xmm2, %xmm2
				; SSE-NEXT: movaps %xmm2, %xmm3
				; SSE-NEXT: andps %xmm1, %xmm3
				; SSE-NEXT: andnps %xmm0, %xmm2
				; SSE-NEXT: orps %xmm3, %xmm2
				; SSE-NEXT: maxss %xmm2, %xmm1
				; SSE-NEXT: movaps %xmm1, %xmm0
				; SSE-NEXT: retq
				;
				; AVX: vcmpunordss %xmm0, %xmm0, %xmm2
				; AVX-NEXT: vblendvps %xmm2, %xmm1, %xmm0, %xmm0
				; AVX-NEXT: vmaxss %xmm0, %xmm1, %xmm0
				; AVX-NEXT: retq
	define float @test_fmaxf(float %x, float %y) {			define float @test_fmaxf(float %x, float %y) {
	%z = call float @fmaxf(float %x, float %y) readnone			%z = call float @fmaxf(float %x, float %y) readnone
	ret float %z			ret float %z
	}			}

	; CHECK-LABEL: @test_fmax			; CHECK-LABEL: @test_fmax
	; CHECK: jmp fmax			; CHECK: jmp fmax
	define double @test_fmax(double %x, double %y) {			define double @test_fmax(double %x, double %y) {
	%z = call double @fmax(double %x, double %y) readnone			%z = call double @fmax(double %x, double %y) readnone
	ret double %z			ret double %z
	}			}

	; CHECK-LABEL: @test_fmaxl			; CHECK-LABEL: @test_fmaxl
	; CHECK: callq fmaxl			; CHECK: callq fmaxl
	define x86_fp80 @test_fmaxl(x86_fp80 %x, x86_fp80 %y) {			define x86_fp80 @test_fmaxl(x86_fp80 %x, x86_fp80 %y) {
	%z = call x86_fp80 @fmaxl(x86_fp80 %x, x86_fp80 %y) readnone			%z = call x86_fp80 @fmaxl(x86_fp80 %x, x86_fp80 %y) readnone
	ret x86_fp80 %z			ret x86_fp80 %z
	}			}

	; CHECK-LABEL: @test_intrinsic_fmaxf			; CHECK-LABEL: @test_intrinsic_fmaxf
	; CHECK: jmp fmaxf			; SSE: movaps %xmm0, %xmm2
				; SSE-NEXT: cmpunordss %xmm2, %xmm2
				; SSE-NEXT: movaps %xmm2, %xmm3
				; SSE-NEXT: andps %xmm1, %xmm3
				; SSE-NEXT: andnps %xmm0, %xmm2
				; SSE-NEXT: orps %xmm3, %xmm2
				; SSE-NEXT: maxss %xmm2, %xmm1
				; SSE-NEXT: movaps %xmm1, %xmm0
				; SSE-NEXT: retq
				;
				; AVX: vcmpunordss %xmm0, %xmm0, %xmm2
				; AVX-NEXT: vblendvps %xmm2, %xmm1, %xmm0, %xmm0
				; AVX-NEXT: vmaxss %xmm0, %xmm1, %xmm0
				; AVX-NEXT: retq
	define float @test_intrinsic_fmaxf(float %x, float %y) {			define float @test_intrinsic_fmaxf(float %x, float %y) {
	%z = call float @llvm.maxnum.f32(float %x, float %y) readnone			%z = call float @llvm.maxnum.f32(float %x, float %y) readnone
	ret float %z			ret float %z
	}			}

	; CHECK-LABEL: @test_intrinsic_fmax			; CHECK-LABEL: @test_intrinsic_fmax
	; CHECK: jmp fmax			; CHECK: jmp fmax
	define double @test_intrinsic_fmax(double %x, double %y) {			define double @test_intrinsic_fmax(double %x, double %y) {
	%z = call double @llvm.maxnum.f64(double %x, double %y) readnone			%z = call double @llvm.maxnum.f64(double %x, double %y) readnone
	ret double %z			ret double %z
	}			}

	; CHECK-LABEL: @test_intrinsic_fmaxl			; CHECK-LABEL: @test_intrinsic_fmaxl
	; CHECK: callq fmaxl			; CHECK: callq fmaxl
	define x86_fp80 @test_intrinsic_fmaxl(x86_fp80 %x, x86_fp80 %y) {			define x86_fp80 @test_intrinsic_fmaxl(x86_fp80 %x, x86_fp80 %y) {
	%z = call x86_fp80 @llvm.maxnum.f80(x86_fp80 %x, x86_fp80 %y) readnone			%z = call x86_fp80 @llvm.maxnum.f80(x86_fp80 %x, x86_fp80 %y) readnone
	ret x86_fp80 %z			ret x86_fp80 %z
	}			}

	; CHECK-LABEL: @test_intrinsic_fmax_v2f32			; CHECK-LABEL: @test_intrinsic_fmax_v2f32
	; SSE: movaps %xmm1, {{[0-9]+}}(%rsp) # 16-byte Spill			; SSE: movaps %xmm1, %xmm2
	; SSE-NEXT: movaps %xmm0, {{[0-9]+}}(%rsp) # 16-byte Spill			; SSE-NEXT: shufps {{.*#+}} xmm2 = xmm2[3,1,2,3]
	; SSE-NEXT: shufps {{.*#+}} xmm0 = xmm0[3,1,2,3]			; SSE-NEXT: movaps %xmm0, %xmm3
	; SSE-NEXT: shufps {{.*#+}} xmm1 = xmm1[3,1,2,3]			; SSE-NEXT: shufps {{.*#+}} xmm3 = xmm3[3,1,2,3]
	; SSE-NEXT: callq fmaxf			; SSE-NEXT: movaps %xmm3, %xmm4
	; SSE-NEXT: movaps %xmm0, {{[0-9]+}}(%rsp) # 16-byte Spill			; SSE-NEXT: cmpunordss %xmm4, %xmm4
	; SSE-NEXT: movaps {{[0-9]+}}(%rsp), %xmm0 # 16-byte Reload			; SSE-NEXT: movaps %xmm4, %xmm5
	; SSE-NEXT: shufps {{.*#+}} xmm0 = xmm0[1,1,2,3]			; SSE-NEXT: andps %xmm2, %xmm5
	; SSE-NEXT: movaps {{[0-9]+}}(%rsp), %xmm1 # 16-byte Reload			; SSE-NEXT: andnps %xmm3, %xmm4
	; SSE-NEXT: shufps {{.*#+}} xmm1 = xmm1[1,1,2,3]			; SSE-NEXT: orps %xmm5, %xmm4
	; SSE-NEXT: callq fmaxf			; SSE-NEXT: maxss %xmm4, %xmm2
	; SSE-NEXT: unpcklps {{[0-9]+}}(%rsp), %xmm0 # 16-byte Folded Reload			; SSE-NEXT: movaps %xmm1, %xmm3
	; SSE: movaps %xmm0, {{[0-9]+}}(%rsp) # 16-byte Spill			; SSE-NEXT: shufps {{.*#+}} xmm3 = xmm3[1,1,2,3]
	; SSE-NEXT: movaps {{[0-9]+}}(%rsp), %xmm0 # 16-byte Reload			; SSE-NEXT: movaps %xmm0, %xmm4
	; SSE-NEXT: movaps {{[0-9]+}}(%rsp), %xmm1 # 16-byte Reload			; SSE-NEXT: shufps {{.*#+}} xmm4 = xmm4[1,1,2,3]
	; SSE-NEXT: callq fmaxf			; SSE-NEXT: movaps %xmm4, %xmm5
	; SSE-NEXT: movaps %xmm0, (%rsp) # 16-byte Spill			; SSE-NEXT: cmpunordss %xmm5, %xmm5
	; SSE-NEXT: movapd {{[0-9]+}}(%rsp), %xmm0 # 16-byte Reload			; SSE-NEXT: movaps %xmm5, %xmm6
	; SSE-NEXT: shufpd {{.*#+}} xmm0 = xmm0[1,0]			; SSE-NEXT: andps %xmm3, %xmm6
	; SSE-NEXT: movapd {{[0-9]+}}(%rsp), %xmm1 # 16-byte Reload			; SSE-NEXT: andnps %xmm4, %xmm5
				; SSE-NEXT: orps %xmm6, %xmm5
				; SSE-NEXT: maxss %xmm5, %xmm3
				; SSE-NEXT: unpcklps {{.*#+}} xmm3 = xmm3[0],xmm2[0],xmm3[1],xmm2[1]
				; SSE-NEXT: movaps %xmm0, %xmm4
				; SSE-NEXT: cmpunordss %xmm4, %xmm4
				; SSE-NEXT: movaps %xmm4, %xmm2
				; SSE-NEXT: andps %xmm1, %xmm2
				; SSE-NEXT: andnps %xmm0, %xmm4
				; SSE-NEXT: orps %xmm2, %xmm4
				; SSE-NEXT: movaps %xmm1, %xmm2
				; SSE-NEXT: maxss %xmm4, %xmm2
	; SSE-NEXT: shufpd {{.*#+}} xmm1 = xmm1[1,0]			; SSE-NEXT: shufpd {{.*#+}} xmm1 = xmm1[1,0]
	; SSE-NEXT: callq fmaxf			; SSE-NEXT: shufpd {{.*#+}} xmm0 = xmm0[1,0]
	; SSE-NEXT: movaps (%rsp), %xmm1 # 16-byte Reload			; SSE-NEXT: movapd %xmm0, %xmm4
	; SSE-NEXT: unpcklps {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1]			; SSE-NEXT: cmpunordss %xmm4, %xmm4
	; SSE-NEXT: unpcklps {{[0-9]+}}(%rsp), %xmm1 # 16-byte Folded Reload			; SSE-NEXT: movaps %xmm4, %xmm5
	; SSE: movaps %xmm1, %xmm0			; SSE-NEXT: andps %xmm1, %xmm5
	; SSE-NEXT: addq $72, %rsp			; SSE-NEXT: andnps %xmm0, %xmm4
				; SSE-NEXT: orps %xmm5, %xmm4
				; SSE-NEXT: maxss %xmm4, %xmm1
				; SSE-NEXT: unpcklps {{.*#+}} xmm2 = xmm2[0],xmm1[0],xmm2[1],xmm1[1]
				; SSE-NEXT: unpcklps {{.*#+}} xmm2 = xmm2[0],xmm3[0],xmm2[1],xmm3[1]
				; SSE-NEXT: movaps %xmm2, %xmm0
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX: vmovaps %xmm1, {{[0-9]+}}(%rsp) # 16-byte Spill			; AVX: vcmpunordss %xmm0, %xmm0, %xmm2
	; AVX-NEXT: vmovaps %xmm0, {{[0-9]+}}(%rsp) # 16-byte Spill			; AVX-NEXT: vblendvps %xmm2, %xmm1, %xmm0, %xmm2
	; AVX-NEXT: callq fmaxf			; AVX-NEXT: vmaxss %xmm2, %xmm1, %xmm2
	; AVX-NEXT: vmovaps %xmm0, (%rsp) # 16-byte Spill			; AVX-NEXT: vmovshdup {{.*#+}} xmm3 = xmm1[1,1,3,3]
	; AVX-NEXT: vmovshdup {{[0-9]+}}(%rsp), %xmm0 # 16-byte Folded Reload			; AVX-NEXT: vmovshdup {{.*#+}} xmm4 = xmm0[1,1,3,3]
	; AVX: vmovshdup {{[0-9]+}}(%rsp), %xmm1 # 16-byte Folded Reload			; AVX-NEXT: vcmpunordss %xmm4, %xmm4, %xmm5
	; AVX: callq fmaxf			; AVX-NEXT: vblendvps %xmm5, %xmm3, %xmm4, %xmm4
	; AVX-NEXT: vmovaps (%rsp), %xmm1 # 16-byte Reload			; AVX-NEXT: vmaxss %xmm4, %xmm3, %xmm3
	; AVX-NEXT: vinsertps {{.*#+}} xmm0 = xmm1[0],xmm0[0],xmm1[2,3]			; AVX-NEXT: vinsertps {{.*#+}} xmm2 = xmm2[0],xmm3[0],xmm2[2,3]
	; AVX-NEXT: vmovaps %xmm0, (%rsp) # 16-byte Spill			; AVX-NEXT: vpermilpd {{.*#+}} xmm3 = xmm1[1,0]
	; AVX-NEXT: vpermilpd $1, {{[0-9]+}}(%rsp), %xmm0 # 16-byte Folded Reload			; AVX-NEXT: vpermilpd {{.*#+}} xmm4 = xmm0[1,0]
	; AVX: vpermilpd $1, {{[0-9]+}}(%rsp), %xmm1 # 16-byte Folded Reload			; AVX-NEXT: vcmpunordss %xmm4, %xmm4, %xmm5
	; AVX: callq fmaxf			; AVX-NEXT: vblendvps %xmm5, %xmm3, %xmm4, %xmm4
	; AVX-NEXT: vmovaps (%rsp), %xmm1 # 16-byte Reload			; AVX-NEXT: vmaxss %xmm4, %xmm3, %xmm3
	; AVX-NEXT: vinsertps {{.*#+}} xmm0 = xmm1[0,1],xmm0[0],xmm1[3]			; AVX-NEXT: vinsertps {{.*#+}} xmm2 = xmm2[0,1],xmm3[0],xmm2[3]
	; AVX-NEXT: vmovaps %xmm0, (%rsp) # 16-byte Spill			; AVX-NEXT: vpermilps {{.*#+}} xmm1 = xmm1[3,1,2,3]
	; AVX-NEXT: vpermilps $231, {{[0-9]+}}(%rsp), %xmm0 # 16-byte Folded Reload			; AVX-NEXT: vpermilps {{.*#+}} xmm0 = xmm0[3,1,2,3]
	; AVX: vpermilps $231, {{[0-9]+}}(%rsp), %xmm1 # 16-byte Folded Reload			; AVX-NEXT: vcmpunordss %xmm0, %xmm0, %xmm3
	; AVX: callq fmaxf			; AVX-NEXT: vblendvps %xmm3, %xmm1, %xmm0, %xmm0
	; AVX-NEXT: vmovaps (%rsp), %xmm1 # 16-byte Reload			; AVX-NEXT: vmaxss %xmm0, %xmm1, %xmm0
	; AVX-NEXT: vinsertps {{.*#+}} xmm0 = xmm1[0,1,2],xmm0[0]			; AVX-NEXT: vinsertps {{.*#+}} xmm0 = xmm2[0,1,2],xmm0[0]
	; AVX-NEXT: addq $56, %rsp
	; AVX-NEXT: retq			; AVX-NEXT: retq
	define <2 x float> @test_intrinsic_fmax_v2f32(<2 x float> %x, <2 x float> %y) {			define <2 x float> @test_intrinsic_fmax_v2f32(<2 x float> %x, <2 x float> %y) {
	%z = call <2 x float> @llvm.maxnum.v2f32(<2 x float> %x, <2 x float> %y) readnone			%z = call <2 x float> @llvm.maxnum.v2f32(<2 x float> %x, <2 x float> %y) readnone
	ret <2 x float> %z			ret <2 x float> %z
	}			}

	; CHECK-LABEL: @test_intrinsic_fmax_v4f32			; CHECK-LABEL: @test_intrinsic_fmax_v4f32
	; SSE: movaps %xmm1, {{[0-9]+}}(%rsp) # 16-byte Spill			; SSE: movaps %xmm1, %xmm2
	; SSE-NEXT: movaps %xmm0, {{[0-9]+}}(%rsp) # 16-byte Spill			; SSE-NEXT: shufps {{.*#+}} xmm2 = xmm2[3,1,2,3]
	; SSE-NEXT: shufps {{.*#+}} xmm0 = xmm0[3,1,2,3]			; SSE-NEXT: movaps %xmm0, %xmm3
	; SSE-NEXT: shufps {{.*#+}} xmm1 = xmm1[3,1,2,3]			; SSE-NEXT: shufps {{.*#+}} xmm3 = xmm3[3,1,2,3]
	; SSE-NEXT: callq fmaxf			; SSE-NEXT: movaps %xmm3, %xmm4
	; SSE-NEXT: movaps %xmm0, {{[0-9]+}}(%rsp) # 16-byte Spill			; SSE-NEXT: cmpunordss %xmm4, %xmm4
	; SSE-NEXT: movaps {{[0-9]+}}(%rsp), %xmm0 # 16-byte Reload			; SSE-NEXT: movaps %xmm4, %xmm5
	; SSE-NEXT: shufps {{.*#+}} xmm0 = xmm0[1,1,2,3]			; SSE-NEXT: andps %xmm2, %xmm5
	; SSE-NEXT: movaps {{[0-9]+}}(%rsp), %xmm1 # 16-byte Reload			; SSE-NEXT: andnps %xmm3, %xmm4
	; SSE-NEXT: shufps {{.*#+}} xmm1 = xmm1[1,1,2,3]			; SSE-NEXT: orps %xmm5, %xmm4
	; SSE-NEXT: callq fmaxf			; SSE-NEXT: maxss %xmm4, %xmm2
	; SSE-NEXT: unpcklps {{[0-9]+}}(%rsp), %xmm0 # 16-byte Folded Reload			; SSE-NEXT: movaps %xmm1, %xmm3
	; SSE: movaps %xmm0, {{[0-9]+}}(%rsp) # 16-byte Spill			; SSE-NEXT: shufps {{.*#+}} xmm3 = xmm3[1,1,2,3]
	; SSE-NEXT: movaps {{[0-9]+}}(%rsp), %xmm0 # 16-byte Reload			; SSE-NEXT: movaps %xmm0, %xmm4
	; SSE-NEXT: movaps {{[0-9]+}}(%rsp), %xmm1 # 16-byte Reload			; SSE-NEXT: shufps {{.*#+}} xmm4 = xmm4[1,1,2,3]
	; SSE-NEXT: callq fmaxf			; SSE-NEXT: movaps %xmm4, %xmm5
	; SSE-NEXT: movaps %xmm0, (%rsp) # 16-byte Spill			; SSE-NEXT: cmpunordss %xmm5, %xmm5
	; SSE-NEXT: movapd {{[0-9]+}}(%rsp), %xmm0 # 16-byte Reload			; SSE-NEXT: movaps %xmm5, %xmm6
	; SSE-NEXT: shufpd {{.*#+}} xmm0 = xmm0[1,0]			; SSE-NEXT: andps %xmm3, %xmm6
	; SSE-NEXT: movapd {{[0-9]+}}(%rsp), %xmm1 # 16-byte Reload			; SSE-NEXT: andnps %xmm4, %xmm5
				; SSE-NEXT: orps %xmm6, %xmm5
				; SSE-NEXT: maxss %xmm5, %xmm3
				; SSE-NEXT: unpcklps {{.*#+}} xmm3 = xmm3[0],xmm2[0],xmm3[1],xmm2[1]
				; SSE-NEXT: movaps %xmm0, %xmm4
				; SSE-NEXT: cmpunordss %xmm4, %xmm4
				; SSE-NEXT: movaps %xmm4, %xmm2
				; SSE-NEXT: andps %xmm1, %xmm2
				; SSE-NEXT: andnps %xmm0, %xmm4
				; SSE-NEXT: orps %xmm2, %xmm4
				; SSE-NEXT: movaps %xmm1, %xmm2
				; SSE-NEXT: maxss %xmm4, %xmm2
	; SSE-NEXT: shufpd {{.*#+}} xmm1 = xmm1[1,0]			; SSE-NEXT: shufpd {{.*#+}} xmm1 = xmm1[1,0]
	; SSE-NEXT: callq fmaxf			; SSE-NEXT: shufpd {{.*#+}} xmm0 = xmm0[1,0]
	; SSE-NEXT: movaps (%rsp), %xmm1 # 16-byte Reload			; SSE-NEXT: movapd %xmm0, %xmm4
	; SSE-NEXT: unpcklps {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1]			; SSE-NEXT: cmpunordss %xmm4, %xmm4
	; SSE-NEXT: unpcklps {{[0-9]+}}(%rsp), %xmm1 # 16-byte Folded Reload			; SSE-NEXT: movaps %xmm4, %xmm5
	; SSE: movaps %xmm1, %xmm0			; SSE-NEXT: andps %xmm1, %xmm5
	; SSE-NEXT: addq $72, %rsp			; SSE-NEXT: andnps %xmm0, %xmm4
				; SSE-NEXT: orps %xmm5, %xmm4
				; SSE-NEXT: maxss %xmm4, %xmm1
				; SSE-NEXT: unpcklps {{.*#+}} xmm2 = xmm2[0],xmm1[0],xmm2[1],xmm1[1]
				; SSE-NEXT: unpcklps {{.*#+}} xmm2 = xmm2[0],xmm3[0],xmm2[1],xmm3[1]
				; SSE-NEXT: movaps %xmm2, %xmm0
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX: vmovaps %xmm1, {{[0-9]+}}(%rsp) # 16-byte Spill			; AVX: vcmpunordss %xmm0, %xmm0, %xmm2
	; AVX-NEXT: vmovaps %xmm0, {{[0-9]+}}(%rsp) # 16-byte Spill			; AVX-NEXT: vblendvps %xmm2, %xmm1, %xmm0, %xmm2
	; AVX-NEXT: callq fmaxf			; AVX-NEXT: vmaxss %xmm2, %xmm1, %xmm2
	; AVX-NEXT: vmovaps %xmm0, (%rsp) # 16-byte Spill			; AVX-NEXT: vmovshdup {{.*#+}} xmm3 = xmm1[1,1,3,3]
	; AVX-NEXT: vmovshdup {{[0-9]+}}(%rsp), %xmm0 # 16-byte Folded Reload			; AVX-NEXT: vmovshdup {{.*#+}} xmm4 = xmm0[1,1,3,3]
	; AVX: vmovshdup {{[0-9]+}}(%rsp), %xmm1 # 16-byte Folded Reload			; AVX-NEXT: vcmpunordss %xmm4, %xmm4, %xmm5
	; AVX: callq fmaxf			; AVX-NEXT: vblendvps %xmm5, %xmm3, %xmm4, %xmm4
	; AVX-NEXT: vmovaps (%rsp), %xmm1 # 16-byte Reload			; AVX-NEXT: vmaxss %xmm4, %xmm3, %xmm3
	; AVX-NEXT: vinsertps {{.*#+}} xmm0 = xmm1[0],xmm0[0],xmm1[2,3]			; AVX-NEXT: vinsertps {{.*#+}} xmm2 = xmm2[0],xmm3[0],xmm2[2,3]
	; AVX-NEXT: vmovaps %xmm0, (%rsp) # 16-byte Spill			; AVX-NEXT: vpermilpd {{.*#+}} xmm3 = xmm1[1,0]
	; AVX-NEXT: vpermilpd $1, {{[0-9]+}}(%rsp), %xmm0 # 16-byte Folded Reload			; AVX-NEXT: vpermilpd {{.*#+}} xmm4 = xmm0[1,0]
	; AVX: vpermilpd $1, {{[0-9]+}}(%rsp), %xmm1 # 16-byte Folded Reload			; AVX-NEXT: vcmpunordss %xmm4, %xmm4, %xmm5
	; AVX: callq fmaxf			; AVX-NEXT: vblendvps %xmm5, %xmm3, %xmm4, %xmm4
	; AVX-NEXT: vmovaps (%rsp), %xmm1 # 16-byte Reload			; AVX-NEXT: vmaxss %xmm4, %xmm3, %xmm3
	; AVX-NEXT: vinsertps {{.*#+}} xmm0 = xmm1[0,1],xmm0[0],xmm1[3]			; AVX-NEXT: vinsertps {{.*#+}} xmm2 = xmm2[0,1],xmm3[0],xmm2[3]
	; AVX-NEXT: vmovaps %xmm0, (%rsp) # 16-byte Spill			; AVX-NEXT: vpermilps {{.*#+}} xmm1 = xmm1[3,1,2,3]
	; AVX-NEXT: vpermilps $231, {{[0-9]+}}(%rsp), %xmm0 # 16-byte Folded Reload			; AVX-NEXT: vpermilps {{.*#+}} xmm0 = xmm0[3,1,2,3]
	; AVX: vpermilps $231, {{[0-9]+}}(%rsp), %xmm1 # 16-byte Folded Reload			; AVX-NEXT: vcmpunordss %xmm0, %xmm0, %xmm3
	; AVX: callq fmaxf			; AVX-NEXT: vblendvps %xmm3, %xmm1, %xmm0, %xmm0
	; AVX-NEXT: vmovaps (%rsp), %xmm1 # 16-byte Reload			; AVX-NEXT: vmaxss %xmm0, %xmm1, %xmm0
	; AVX-NEXT: vinsertps {{.*#+}} xmm0 = xmm1[0,1,2],xmm0[0]			; AVX-NEXT: vinsertps {{.*#+}} xmm0 = xmm2[0,1,2],xmm0[0]
	; AVX-NEXT: addq $56, %rsp
	; AVX-NEXT: retq			; AVX-NEXT: retq
	define <4 x float> @test_intrinsic_fmax_v4f32(<4 x float> %x, <4 x float> %y) {			define <4 x float> @test_intrinsic_fmax_v4f32(<4 x float> %x, <4 x float> %y) {
	%z = call <4 x float> @llvm.maxnum.v4f32(<4 x float> %x, <4 x float> %y) readnone			%z = call <4 x float> @llvm.maxnum.v4f32(<4 x float> %x, <4 x float> %y) readnone
	ret <4 x float> %z			ret <4 x float> %z
	}			}

	; CHECK-LABEL: @test_intrinsic_fmax_v2f64			; CHECK-LABEL: @test_intrinsic_fmax_v2f64
	; CHECK: callq fmax			; CHECK: callq fmax
	Show All 30 Lines