This is an archive of the discontinued LLVM Phabricator instance.

[x86] inline calls to fmaxf / llvm.maxnum.f32 using maxss (PR24475)
ClosedPublic

Authored by spatel on Dec 7 2015, 10:43 AM.

Download Raw Diff

Details

Reviewers

qcolombet
scanon
andreadb
kbsmith1
zansari

Commits

rG271efcdf209f: [x86] inline calls to fmaxf / llvm.maxnum.f32 using maxss (PR24475)
rL255700: [x86] inline calls to fmaxf / llvm.maxnum.f32 using maxss (PR24475)

Summary

This patch implements the suggested codegen from PR24475:
https://llvm.org/bugs/show_bug.cgi?id=24475

but only for the fmaxf() case to start, so we can sort out any bugs before extending to fmin, f64, and vectors.

The fmax / maxnum definitions provide us flexibility for signed zeros, so I hope the only thing we have to worry about in this replacement sequence is NaN handling.

Note 1: I initially implemented this as lowerFMAXNUM(), but that exposes a problem: SelectionDAGBuilder::visitSelect() transforms compare/select instructions into FMAXNUM nodes if we declare FMAXNUM legal or custom. Perhaps this should be checking for NaN inputs or global unsafe-math before transforming? As it stands, this bypasses a big set of optimizations that the x86 backend already has in PerformSELECTCombine(). I don't know what the tradeoffs are for making a 'combine' rather than a 'lower'. If a 'lower' is preferred, we will need to fix that problem.

Note 2: The v2f32 test reveals another bug; the vector is extended to v4f32, so we have completely unnecessary operations happening on undef elements of the vector.

Diff Detail

Repository: rL LLVM

Event Timeline

spatel updated this revision to Diff 42078.Dec 7 2015, 10:43 AM

spatel retitled this revision from to [x86] inline calls to fmaxf / llvm.maxnum.f32 using maxss (PR24475).

spatel updated this object.

spatel added reviewers: scanon, qcolombet, jmolloy.

spatel added a subscriber: llvm-commits.

As this primarily involves the x86 backend, I don't think I'm an appropriate reviewer; Resigning.

In D15294#309246, @jmolloy wrote:

As this primarily involves the x86 backend, I don't think I'm an appropriate reviewer; Resigning.

Thanks, James. I thought you might have some feedback on the SelectionDAGBuilder question, although that is separate from the functionality of this patch as it stands.

Adding some more potential x86 reviewers.

Hi Sanjay,

Quick high-level question : wouldn't it be better to pull the intermediate value out of the fmax to reduce the dependence chain?

So, instead of :

cond = isnan(op0)
V = select (op1, op0, cond)
Result = FMAX(op1, V)

do this:

t = FMAX(op1, op0)
cond = isnan(op0)
Result = select (op1, t, cond)

This way, we go from isnan->select->fmax to fmax/isnan -> select.

In D15294#310167, @zansari wrote:

Quick high-level question : wouldn't it be better to pull the intermediate value out of the fmax to reduce the dependence chain?

Yes, that would be better.

Because I'm SSE dyslexic, I altered the test program in https://llvm.org/bugs/show_bug.cgi?id=24475 to check, and this is what I came up with:

__m128 maxnum = _mm_max_ss(v2, v1);
__m128 isnan1 = _mm_cmpunord_ss(v1, v1);
maxnum = _mm_blendv_ps(maxnum, v2, isnan1);

Which compiles to (AT&T syntax - should invert the dyslexia, but I still can't get it right):

vmaxss        %xmm0, %xmm1, %xmm2         <--- if either input is NaN, xmm0 (v1) is returned
vcmpunordss   %xmm0, %xmm0, %xmm0
vblendvps     %xmm0, %xmm1, %xmm2, %xmm0  <--- if xmm0 (v1) is NaN, output xmm1 (v2); if not, output max or v1

I'll translate that to LLVM and update the patch. Thanks!

Patch updated:

Implement more efficient sequence suggested by Zia.
Fix checks in the test to match.
Add FIXME comments for other fmax tests that aren't handled yet.

I'd disable the expansion under minsize, at least.. Otherwise, lgtm.

LGTM

This revision is now accepted and ready to land.Dec 15 2015, 12:40 PM

In D15294#311164, @zansari wrote:

I'd disable the expansion under minsize, at least.. Otherwise, lgtm.

Thanks! I'll add that check and a test case and get this checked in.

Closed by commit rL255700: [x86] inline calls to fmaxf / llvm.maxnum.f32 using maxss (PR24475) (authored by spatel). · Explain WhyDec 15 2015, 3:14 PM

This revision was automatically updated to reflect the committed changes.

pengfei mentioned this in D145634: [X86] Support llvm.{min,max}imum.f{16,32,64}.Mar 12 2023, 3:08 AM

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

X86/

X86ISelLowering.cpp

52 lines

test/

CodeGen/

X86/

fmaxnum.ll

274 lines

Diff 42924

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,793 Lines • ▼ Show 20 Lines
setTargetDAGCombine(ISD::SRL);		setTargetDAGCombine(ISD::SRL);
setTargetDAGCombine(ISD::OR);		setTargetDAGCombine(ISD::OR);
setTargetDAGCombine(ISD::AND);		setTargetDAGCombine(ISD::AND);
setTargetDAGCombine(ISD::ADD);		setTargetDAGCombine(ISD::ADD);
setTargetDAGCombine(ISD::FADD);		setTargetDAGCombine(ISD::FADD);
setTargetDAGCombine(ISD::FSUB);		setTargetDAGCombine(ISD::FSUB);
setTargetDAGCombine(ISD::FNEG);		setTargetDAGCombine(ISD::FNEG);
setTargetDAGCombine(ISD::FMA);		setTargetDAGCombine(ISD::FMA);
		setTargetDAGCombine(ISD::FMAXNUM);
setTargetDAGCombine(ISD::SUB);		setTargetDAGCombine(ISD::SUB);
setTargetDAGCombine(ISD::LOAD);		setTargetDAGCombine(ISD::LOAD);
setTargetDAGCombine(ISD::MLOAD);		setTargetDAGCombine(ISD::MLOAD);
setTargetDAGCombine(ISD::STORE);		setTargetDAGCombine(ISD::STORE);
setTargetDAGCombine(ISD::MSTORE);		setTargetDAGCombine(ISD::MSTORE);
setTargetDAGCombine(ISD::TRUNCATE);		setTargetDAGCombine(ISD::TRUNCATE);
setTargetDAGCombine(ISD::ZERO_EXTEND);		setTargetDAGCombine(ISD::ZERO_EXTEND);
setTargetDAGCombine(ISD::ANY_EXTEND);		setTargetDAGCombine(ISD::ANY_EXTEND);
▲ Show 20 Lines • Show All 24,718 Lines • ▼ Show 20 Lines	switch (N->getOpcode()) {
case X86ISD::FMIN: NewOp = X86ISD::FMINC; break;		case X86ISD::FMIN: NewOp = X86ISD::FMINC; break;
case X86ISD::FMAX: NewOp = X86ISD::FMAXC; break;		case X86ISD::FMAX: NewOp = X86ISD::FMAXC; break;
}		}

return DAG.getNode(NewOp, SDLoc(N), N->getValueType(0),		return DAG.getNode(NewOp, SDLoc(N), N->getValueType(0),
N->getOperand(0), N->getOperand(1));		N->getOperand(0), N->getOperand(1));
}		}

		static SDValue performFMaxNumCombine(SDNode *N, SelectionDAG &DAG,
		const X86Subtarget *Subtarget) {
		// This takes at least 3 instructions, so favor a library call when
		// minimizing code size.
		if (DAG.getMachineFunction().getFunction()->optForMinSize())
		return SDValue();

		EVT VT = N->getValueType(0);

		// TODO: Check for global or instruction-level "nnan". In that case, we
		// should be able to lower to FMAX/FMIN alone.
		// TODO: If an operand is already known to be a NaN or not a NaN, this
		// should be an optional swap and FMAX/FMIN.
		// TODO: Allow f64, vectors, and fminnum.

		if (VT != MVT::f32 \|\| !Subtarget->hasSSE1() \|\| Subtarget->useSoftFloat())
		return SDValue();

		SDValue Op0 = N->getOperand(0);
		SDValue Op1 = N->getOperand(1);
		SDLoc DL(N);
		EVT SetCCType = DAG.getTargetLoweringInfo().getSetCCResultType(
		DAG.getDataLayout(), *DAG.getContext(), VT);

		// There are 4 possibilities involving NaN inputs, and these are the required
		// outputs:
		// Op1
		// Num NaN
		// ----------------
		// Num \| Max \| Op0 \|
		// Op0 ----------------
		// NaN \| Op1 \| NaN \|
		// ----------------
		//
		// The SSE FP max/min instructions were not designed for this case, but rather
		// to implement:
		// Max = Op1 > Op0 ? Op1 : Op0
		//
		// So they always return Op0 if either input is a NaN. However, we can still
		// use those instructions for fmaxnum by selecting away a NaN input.

		// If either operand is NaN, the 2nd source operand (Op0) is passed through.
		SDValue Max = DAG.getNode(X86ISD::FMAX, DL, VT, Op1, Op0);
		SDValue IsOp0Nan = DAG.getSetCC(DL, SetCCType , Op0, Op0, ISD::SETUO);

		// If Op0 is a NaN, select Op1. Otherwise, select the max. If both operands
		// are NaN, the NaN value of Op1 is the result.
		return DAG.getNode(ISD::SELECT, DL, VT, IsOp0Nan, Op1, Max);
		}

/// Do target-specific dag combines on X86ISD::FAND nodes.		/// Do target-specific dag combines on X86ISD::FAND nodes.
static SDValue PerformFANDCombine(SDNode *N, SelectionDAG &DAG,		static SDValue PerformFANDCombine(SDNode *N, SelectionDAG &DAG,
const X86Subtarget *Subtarget) {		const X86Subtarget *Subtarget) {
// FAND(0.0, x) -> 0.0		// FAND(0.0, x) -> 0.0
if (ConstantFPSDNode *C = dyn_cast<ConstantFPSDNode>(N->getOperand(0)))		if (ConstantFPSDNode *C = dyn_cast<ConstantFPSDNode>(N->getOperand(0)))
if (C->getValueAPF().isPosZero())		if (C->getValueAPF().isPosZero())
return N->getOperand(0);		return N->getOperand(0);

▲ Show 20 Lines • Show All 843 Lines • ▼ Show 20 Lines	SDValue X86TargetLowering::PerformDAGCombine(SDNode *N,
case ISD::FADD: return PerformFADDCombine(N, DAG, Subtarget);		case ISD::FADD: return PerformFADDCombine(N, DAG, Subtarget);
case ISD::FSUB: return PerformFSUBCombine(N, DAG, Subtarget);		case ISD::FSUB: return PerformFSUBCombine(N, DAG, Subtarget);
case ISD::FNEG: return PerformFNEGCombine(N, DAG, Subtarget);		case ISD::FNEG: return PerformFNEGCombine(N, DAG, Subtarget);
case ISD::TRUNCATE: return PerformTRUNCATECombine(N, DAG, Subtarget);		case ISD::TRUNCATE: return PerformTRUNCATECombine(N, DAG, Subtarget);
case X86ISD::FXOR:		case X86ISD::FXOR:
case X86ISD::FOR: return PerformFORCombine(N, DAG, Subtarget);		case X86ISD::FOR: return PerformFORCombine(N, DAG, Subtarget);
case X86ISD::FMIN:		case X86ISD::FMIN:
case X86ISD::FMAX: return PerformFMinFMaxCombine(N, DAG);		case X86ISD::FMAX: return PerformFMinFMaxCombine(N, DAG);
		case ISD::FMAXNUM: return performFMaxNumCombine(N, DAG, Subtarget);
case X86ISD::FAND: return PerformFANDCombine(N, DAG, Subtarget);		case X86ISD::FAND: return PerformFANDCombine(N, DAG, Subtarget);
case X86ISD::FANDN: return PerformFANDNCombine(N, DAG, Subtarget);		case X86ISD::FANDN: return PerformFANDNCombine(N, DAG, Subtarget);
case X86ISD::BT: return PerformBTCombine(N, DAG, DCI);		case X86ISD::BT: return PerformBTCombine(N, DAG, DCI);
case X86ISD::VZEXT_MOVL: return PerformVZEXT_MOVLCombine(N, DAG);		case X86ISD::VZEXT_MOVL: return PerformVZEXT_MOVLCombine(N, DAG);
case ISD::ANY_EXTEND:		case ISD::ANY_EXTEND:
case ISD::ZERO_EXTEND: return PerformZExtCombine(N, DAG, DCI, Subtarget);		case ISD::ZERO_EXTEND: return PerformZExtCombine(N, DAG, DCI, Subtarget);
case ISD::SIGN_EXTEND: return PerformSExtCombine(N, DAG, DCI, Subtarget);		case ISD::SIGN_EXTEND: return PerformSExtCombine(N, DAG, DCI, Subtarget);
case ISD::SIGN_EXTEND_INREG:		case ISD::SIGN_EXTEND_INREG:
▲ Show 20 Lines • Show All 851 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/fmaxnum.ll

	Show All 10 Lines
	declare <2 x float> @llvm.maxnum.v2f32(<2 x float>, <2 x float>)			declare <2 x float> @llvm.maxnum.v2f32(<2 x float>, <2 x float>)
	declare <4 x float> @llvm.maxnum.v4f32(<4 x float>, <4 x float>)			declare <4 x float> @llvm.maxnum.v4f32(<4 x float>, <4 x float>)
	declare <2 x double> @llvm.maxnum.v2f64(<2 x double>, <2 x double>)			declare <2 x double> @llvm.maxnum.v2f64(<2 x double>, <2 x double>)
	declare <4 x double> @llvm.maxnum.v4f64(<4 x double>, <4 x double>)			declare <4 x double> @llvm.maxnum.v4f64(<4 x double>, <4 x double>)
	declare <8 x double> @llvm.maxnum.v8f64(<8 x double>, <8 x double>)			declare <8 x double> @llvm.maxnum.v8f64(<8 x double>, <8 x double>)


	; CHECK-LABEL: @test_fmaxf			; CHECK-LABEL: @test_fmaxf
	; CHECK: jmp fmaxf			; SSE: movaps %xmm0, %xmm2
				; SSE-NEXT: cmpunordss %xmm2, %xmm2
				; SSE-NEXT: movaps %xmm2, %xmm3
				; SSE-NEXT: andps %xmm1, %xmm3
				; SSE-NEXT: maxss %xmm0, %xmm1
				; SSE-NEXT: andnps %xmm1, %xmm2
				; SSE-NEXT: orps %xmm3, %xmm2
				; SSE-NEXT: movaps %xmm2, %xmm0
				; SSE-NEXT: retq
				;
				; AVX: vmaxss %xmm0, %xmm1, %xmm2
				; AVX-NEXT: vcmpunordss %xmm0, %xmm0, %xmm0
				; AVX-NEXT: vblendvps %xmm0, %xmm1, %xmm2, %xmm0
				; AVX-NEXT: retq
	define float @test_fmaxf(float %x, float %y) {			define float @test_fmaxf(float %x, float %y) {
	%z = call float @fmaxf(float %x, float %y) readnone			%z = call float @fmaxf(float %x, float %y) readnone
	ret float %z			ret float %z
	}			}

				; CHECK-LABEL: @test_fmaxf_minsize
				; CHECK: jmp fmaxf
				define float @test_fmaxf_minsize(float %x, float %y) minsize {
				%z = call float @fmaxf(float %x, float %y) readnone
				ret float %z
				}

				; FIXME: Doubles should be inlined similarly to floats.

	; CHECK-LABEL: @test_fmax			; CHECK-LABEL: @test_fmax
	; CHECK: jmp fmax			; CHECK: jmp fmax
	define double @test_fmax(double %x, double %y) {			define double @test_fmax(double %x, double %y) {
	%z = call double @fmax(double %x, double %y) readnone			%z = call double @fmax(double %x, double %y) readnone
	ret double %z			ret double %z
	}			}

	; CHECK-LABEL: @test_fmaxl			; CHECK-LABEL: @test_fmaxl
	; CHECK: callq fmaxl			; CHECK: callq fmaxl
	define x86_fp80 @test_fmaxl(x86_fp80 %x, x86_fp80 %y) {			define x86_fp80 @test_fmaxl(x86_fp80 %x, x86_fp80 %y) {
	%z = call x86_fp80 @fmaxl(x86_fp80 %x, x86_fp80 %y) readnone			%z = call x86_fp80 @fmaxl(x86_fp80 %x, x86_fp80 %y) readnone
	ret x86_fp80 %z			ret x86_fp80 %z
	}			}

	; CHECK-LABEL: @test_intrinsic_fmaxf			; CHECK-LABEL: @test_intrinsic_fmaxf
	; CHECK: jmp fmaxf			; SSE: movaps %xmm0, %xmm2
				; SSE-NEXT: cmpunordss %xmm2, %xmm2
				; SSE-NEXT: movaps %xmm2, %xmm3
				; SSE-NEXT: andps %xmm1, %xmm3
				; SSE-NEXT: maxss %xmm0, %xmm1
				; SSE-NEXT: andnps %xmm1, %xmm2
				; SSE-NEXT: orps %xmm3, %xmm2
				; SSE-NEXT: movaps %xmm2, %xmm0
				; SSE-NEXT: retq
				;
				; AVX: vmaxss %xmm0, %xmm1, %xmm2
				; AVX-NEXT: vcmpunordss %xmm0, %xmm0, %xmm0
				; AVX-NEXT: vblendvps %xmm0, %xmm1, %xmm2, %xmm0
				; AVX-NEXT: retq
	define float @test_intrinsic_fmaxf(float %x, float %y) {			define float @test_intrinsic_fmaxf(float %x, float %y) {
	%z = call float @llvm.maxnum.f32(float %x, float %y) readnone			%z = call float @llvm.maxnum.f32(float %x, float %y) readnone
	ret float %z			ret float %z
	}			}

				; FIXME: Doubles should be inlined similarly to floats.

	; CHECK-LABEL: @test_intrinsic_fmax			; CHECK-LABEL: @test_intrinsic_fmax
	; CHECK: jmp fmax			; CHECK: jmp fmax
	define double @test_intrinsic_fmax(double %x, double %y) {			define double @test_intrinsic_fmax(double %x, double %y) {
	%z = call double @llvm.maxnum.f64(double %x, double %y) readnone			%z = call double @llvm.maxnum.f64(double %x, double %y) readnone
	ret double %z			ret double %z
	}			}

	; CHECK-LABEL: @test_intrinsic_fmaxl			; CHECK-LABEL: @test_intrinsic_fmaxl
	; CHECK: callq fmaxl			; CHECK: callq fmaxl
	define x86_fp80 @test_intrinsic_fmaxl(x86_fp80 %x, x86_fp80 %y) {			define x86_fp80 @test_intrinsic_fmaxl(x86_fp80 %x, x86_fp80 %y) {
	%z = call x86_fp80 @llvm.maxnum.f80(x86_fp80 %x, x86_fp80 %y) readnone			%z = call x86_fp80 @llvm.maxnum.f80(x86_fp80 %x, x86_fp80 %y) readnone
	ret x86_fp80 %z			ret x86_fp80 %z
	}			}

				; FIXME: This should not be doing 4 scalar ops on a 2 element vector.
				; FIXME: This should use vector ops (maxps / cmpps).

	; CHECK-LABEL: @test_intrinsic_fmax_v2f32			; CHECK-LABEL: @test_intrinsic_fmax_v2f32
	; SSE: movaps %xmm1, {{[0-9]+}}(%rsp) # 16-byte Spill			; SSE: movaps %xmm1, %xmm2
	; SSE-NEXT: movaps %xmm0, {{[0-9]+}}(%rsp) # 16-byte Spill			; SSE-NEXT: shufps {{.*#+}} xmm2 = xmm2[3,1,2,3]
	; SSE-NEXT: shufps {{.*#+}} xmm0 = xmm0[3,1,2,3]			; SSE-NEXT: movaps %xmm0, %xmm3
	; SSE-NEXT: shufps {{.*#+}} xmm1 = xmm1[3,1,2,3]			; SSE-NEXT: shufps {{.*#+}} xmm3 = xmm3[3,1,2,3]
	; SSE-NEXT: callq fmaxf			; SSE-NEXT: movaps %xmm3, %xmm4
	; SSE-NEXT: movaps %xmm0, {{[0-9]+}}(%rsp) # 16-byte Spill			; SSE-NEXT: cmpunordss %xmm4, %xmm4
	; SSE-NEXT: movaps {{[0-9]+}}(%rsp), %xmm0 # 16-byte Reload			; SSE-NEXT: movaps %xmm4, %xmm5
	; SSE-NEXT: shufps {{.*#+}} xmm0 = xmm0[1,1,2,3]			; SSE-NEXT: andps %xmm2, %xmm5
	; SSE-NEXT: movaps {{[0-9]+}}(%rsp), %xmm1 # 16-byte Reload			; SSE-NEXT: maxss %xmm3, %xmm2
	; SSE-NEXT: shufps {{.*#+}} xmm1 = xmm1[1,1,2,3]			; SSE-NEXT: andnps %xmm2, %xmm4
	; SSE-NEXT: callq fmaxf			; SSE-NEXT: orps %xmm5, %xmm4
	; SSE-NEXT: unpcklps {{[0-9]+}}(%rsp), %xmm0 # 16-byte Folded Reload			; SSE-NEXT: movaps %xmm1, %xmm2
	; SSE: movaps %xmm0, {{[0-9]+}}(%rsp) # 16-byte Spill			; SSE-NEXT: shufps {{.*#+}} xmm2 = xmm2[1,1,2,3]
	; SSE-NEXT: movaps {{[0-9]+}}(%rsp), %xmm0 # 16-byte Reload			; SSE-NEXT: movaps %xmm0, %xmm5
	; SSE-NEXT: movaps {{[0-9]+}}(%rsp), %xmm1 # 16-byte Reload			; SSE-NEXT: shufps {{.*#+}} xmm5 = xmm5[1,1,2,3]
	; SSE-NEXT: callq fmaxf			; SSE-NEXT: movaps %xmm5, %xmm3
	; SSE-NEXT: movaps %xmm0, (%rsp) # 16-byte Spill			; SSE-NEXT: cmpunordss %xmm3, %xmm3
	; SSE-NEXT: movapd {{[0-9]+}}(%rsp), %xmm0 # 16-byte Reload			; SSE-NEXT: movaps %xmm3, %xmm6
	; SSE-NEXT: shufpd {{.*#+}} xmm0 = xmm0[1,0]			; SSE-NEXT: andps %xmm2, %xmm6
	; SSE-NEXT: movapd {{[0-9]+}}(%rsp), %xmm1 # 16-byte Reload			; SSE-NEXT: maxss %xmm5, %xmm2
				; SSE-NEXT: andnps %xmm2, %xmm3
				; SSE-NEXT: orps %xmm6, %xmm3
				; SSE-NEXT: unpcklps {{.*#+}} xmm3 = xmm3[0],xmm4[0],xmm3[1],xmm4[1]
				; SSE-NEXT: movaps %xmm0, %xmm2
				; SSE-NEXT: cmpunordss %xmm2, %xmm2
				; SSE-NEXT: movaps %xmm2, %xmm4
				; SSE-NEXT: andps %xmm1, %xmm4
				; SSE-NEXT: movaps %xmm1, %xmm5
				; SSE-NEXT: maxss %xmm0, %xmm5
				; SSE-NEXT: andnps %xmm5, %xmm2
				; SSE-NEXT: orps %xmm4, %xmm2
	; SSE-NEXT: shufpd {{.*#+}} xmm1 = xmm1[1,0]			; SSE-NEXT: shufpd {{.*#+}} xmm1 = xmm1[1,0]
	; SSE-NEXT: callq fmaxf			; SSE-NEXT: shufpd {{.*#+}} xmm0 = xmm0[1,0]
	; SSE-NEXT: movaps (%rsp), %xmm1 # 16-byte Reload			; SSE-NEXT: movapd %xmm0, %xmm4
	; SSE-NEXT: unpcklps {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1]			; SSE-NEXT: cmpunordss %xmm4, %xmm4
	; SSE-NEXT: unpcklps {{[0-9]+}}(%rsp), %xmm1 # 16-byte Folded Reload			; SSE-NEXT: movaps %xmm4, %xmm5
	; SSE: movaps %xmm1, %xmm0			; SSE-NEXT: andps %xmm1, %xmm5
	; SSE-NEXT: addq $72, %rsp			; SSE-NEXT: maxss %xmm0, %xmm1
				; SSE-NEXT: andnps %xmm1, %xmm4
				; SSE-NEXT: orps %xmm5, %xmm4
				; SSE-NEXT: unpcklps {{.*#+}} xmm2 = xmm2[0],xmm4[0],xmm2[1],xmm4[1]
				; SSE-NEXT: unpcklps {{.*#+}} xmm2 = xmm2[0],xmm3[0],xmm2[1],xmm3[1]
				; SSE-NEXT: movaps %xmm2, %xmm0
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX: vmovaps %xmm1, {{[0-9]+}}(%rsp) # 16-byte Spill			; AVX: vmaxss %xmm0, %xmm1, %xmm2
	; AVX-NEXT: vmovaps %xmm0, {{[0-9]+}}(%rsp) # 16-byte Spill			; AVX-NEXT: vcmpunordss %xmm0, %xmm0, %xmm3
	; AVX-NEXT: callq fmaxf			; AVX-NEXT: vblendvps %xmm3, %xmm1, %xmm2, %xmm2
	; AVX-NEXT: vmovaps %xmm0, (%rsp) # 16-byte Spill			; AVX-NEXT: vmovshdup {{.*#+}} xmm3 = xmm0[1,1,3,3]
	; AVX-NEXT: vmovshdup {{[0-9]+}}(%rsp), %xmm0 # 16-byte Folded Reload			; AVX-NEXT: vmovshdup {{.*#+}} xmm4 = xmm1[1,1,3,3]
	; AVX: vmovshdup {{[0-9]+}}(%rsp), %xmm1 # 16-byte Folded Reload			; AVX-NEXT: vmaxss %xmm3, %xmm4, %xmm5
	; AVX: callq fmaxf			; AVX-NEXT: vcmpunordss %xmm3, %xmm3, %xmm3
	; AVX-NEXT: vmovaps (%rsp), %xmm1 # 16-byte Reload			; AVX-NEXT: vblendvps %xmm3, %xmm4, %xmm5, %xmm3
	; AVX-NEXT: vinsertps {{.*#+}} xmm0 = xmm1[0],xmm0[0],xmm1[2,3]			; AVX-NEXT: vinsertps {{.*#+}} xmm2 = xmm2[0],xmm3[0],xmm2[2,3]
	; AVX-NEXT: vmovaps %xmm0, (%rsp) # 16-byte Spill			; AVX-NEXT: vpermilpd {{.*#+}} xmm3 = xmm0[1,0]
	; AVX-NEXT: vpermilpd $1, {{[0-9]+}}(%rsp), %xmm0 # 16-byte Folded Reload			; AVX-NEXT: vpermilpd {{.*#+}} xmm4 = xmm1[1,0]
	; AVX: vpermilpd $1, {{[0-9]+}}(%rsp), %xmm1 # 16-byte Folded Reload			; AVX-NEXT: vmaxss %xmm3, %xmm4, %xmm5
	; AVX: callq fmaxf			; AVX-NEXT: vcmpunordss %xmm3, %xmm3, %xmm3
	; AVX-NEXT: vmovaps (%rsp), %xmm1 # 16-byte Reload			; AVX-NEXT: vblendvps %xmm3, %xmm4, %xmm5, %xmm3
	; AVX-NEXT: vinsertps {{.*#+}} xmm0 = xmm1[0,1],xmm0[0],xmm1[3]			; AVX-NEXT: vinsertps {{.*#+}} xmm2 = xmm2[0,1],xmm3[0],xmm2[3]
	; AVX-NEXT: vmovaps %xmm0, (%rsp) # 16-byte Spill			; AVX-NEXT: vpermilps {{.*#+}} xmm0 = xmm0[3,1,2,3]
	; AVX-NEXT: vpermilps $231, {{[0-9]+}}(%rsp), %xmm0 # 16-byte Folded Reload			; AVX-NEXT: vpermilps {{.*#+}} xmm1 = xmm1[3,1,2,3]
	; AVX: vpermilps $231, {{[0-9]+}}(%rsp), %xmm1 # 16-byte Folded Reload			; AVX-NEXT: vmaxss %xmm0, %xmm1, %xmm3
	; AVX: callq fmaxf			; AVX-NEXT: vcmpunordss %xmm0, %xmm0, %xmm0
	; AVX-NEXT: vmovaps (%rsp), %xmm1 # 16-byte Reload			; AVX-NEXT: vblendvps %xmm0, %xmm1, %xmm3, %xmm0
	; AVX-NEXT: vinsertps {{.*#+}} xmm0 = xmm1[0,1,2],xmm0[0]			; AVX-NEXT: vinsertps {{.*#+}} xmm0 = xmm2[0,1,2],xmm0[0]
	; AVX-NEXT: addq $56, %rsp
	; AVX-NEXT: retq			; AVX-NEXT: retq
	define <2 x float> @test_intrinsic_fmax_v2f32(<2 x float> %x, <2 x float> %y) {			define <2 x float> @test_intrinsic_fmax_v2f32(<2 x float> %x, <2 x float> %y) {
	%z = call <2 x float> @llvm.maxnum.v2f32(<2 x float> %x, <2 x float> %y) readnone			%z = call <2 x float> @llvm.maxnum.v2f32(<2 x float> %x, <2 x float> %y) readnone
	ret <2 x float> %z			ret <2 x float> %z
	}			}

				; FIXME: This should use vector ops (maxps / cmpps).

	; CHECK-LABEL: @test_intrinsic_fmax_v4f32			; CHECK-LABEL: @test_intrinsic_fmax_v4f32
	; SSE: movaps %xmm1, {{[0-9]+}}(%rsp) # 16-byte Spill			; SSE: movaps %xmm1, %xmm2
	; SSE-NEXT: movaps %xmm0, {{[0-9]+}}(%rsp) # 16-byte Spill			; SSE-NEXT: shufps {{.*#+}} xmm2 = xmm2[3,1,2,3]
	; SSE-NEXT: shufps {{.*#+}} xmm0 = xmm0[3,1,2,3]			; SSE-NEXT: movaps %xmm0, %xmm3
	; SSE-NEXT: shufps {{.*#+}} xmm1 = xmm1[3,1,2,3]			; SSE-NEXT: shufps {{.*#+}} xmm3 = xmm3[3,1,2,3]
	; SSE-NEXT: callq fmaxf			; SSE-NEXT: movaps %xmm3, %xmm4
	; SSE-NEXT: movaps %xmm0, {{[0-9]+}}(%rsp) # 16-byte Spill			; SSE-NEXT: cmpunordss %xmm4, %xmm4
	; SSE-NEXT: movaps {{[0-9]+}}(%rsp), %xmm0 # 16-byte Reload			; SSE-NEXT: movaps %xmm4, %xmm5
	; SSE-NEXT: shufps {{.*#+}} xmm0 = xmm0[1,1,2,3]			; SSE-NEXT: andps %xmm2, %xmm5
	; SSE-NEXT: movaps {{[0-9]+}}(%rsp), %xmm1 # 16-byte Reload			; SSE-NEXT: maxss %xmm3, %xmm2
	; SSE-NEXT: shufps {{.*#+}} xmm1 = xmm1[1,1,2,3]			; SSE-NEXT: andnps %xmm2, %xmm4
	; SSE-NEXT: callq fmaxf			; SSE-NEXT: orps %xmm5, %xmm4
	; SSE-NEXT: unpcklps {{[0-9]+}}(%rsp), %xmm0 # 16-byte Folded Reload			; SSE-NEXT: movaps %xmm1, %xmm2
	; SSE: movaps %xmm0, {{[0-9]+}}(%rsp) # 16-byte Spill			; SSE-NEXT: shufps {{.*#+}} xmm2 = xmm2[1,1,2,3]
	; SSE-NEXT: movaps {{[0-9]+}}(%rsp), %xmm0 # 16-byte Reload			; SSE-NEXT: movaps %xmm0, %xmm5
	; SSE-NEXT: movaps {{[0-9]+}}(%rsp), %xmm1 # 16-byte Reload			; SSE-NEXT: shufps {{.*#+}} xmm5 = xmm5[1,1,2,3]
	; SSE-NEXT: callq fmaxf			; SSE-NEXT: movaps %xmm5, %xmm3
	; SSE-NEXT: movaps %xmm0, (%rsp) # 16-byte Spill			; SSE-NEXT: cmpunordss %xmm3, %xmm3
	; SSE-NEXT: movapd {{[0-9]+}}(%rsp), %xmm0 # 16-byte Reload			; SSE-NEXT: movaps %xmm3, %xmm6
	; SSE-NEXT: shufpd {{.*#+}} xmm0 = xmm0[1,0]			; SSE-NEXT: andps %xmm2, %xmm6
	; SSE-NEXT: movapd {{[0-9]+}}(%rsp), %xmm1 # 16-byte Reload			; SSE-NEXT: maxss %xmm5, %xmm2
				; SSE-NEXT: andnps %xmm2, %xmm3
				; SSE-NEXT: orps %xmm6, %xmm3
				; SSE-NEXT: unpcklps {{.*#+}} xmm3 = xmm3[0],xmm4[0],xmm3[1],xmm4[1]
				; SSE-NEXT: movaps %xmm0, %xmm2
				; SSE-NEXT: cmpunordss %xmm2, %xmm2
				; SSE-NEXT: movaps %xmm2, %xmm4
				; SSE-NEXT: andps %xmm1, %xmm4
				; SSE-NEXT: movaps %xmm1, %xmm5
				; SSE-NEXT: maxss %xmm0, %xmm5
				; SSE-NEXT: andnps %xmm5, %xmm2
				; SSE-NEXT: orps %xmm4, %xmm2
	; SSE-NEXT: shufpd {{.*#+}} xmm1 = xmm1[1,0]			; SSE-NEXT: shufpd {{.*#+}} xmm1 = xmm1[1,0]
	; SSE-NEXT: callq fmaxf			; SSE-NEXT: shufpd {{.*#+}} xmm0 = xmm0[1,0]
	; SSE-NEXT: movaps (%rsp), %xmm1 # 16-byte Reload			; SSE-NEXT: movapd %xmm0, %xmm4
	; SSE-NEXT: unpcklps {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1]			; SSE-NEXT: cmpunordss %xmm4, %xmm4
	; SSE-NEXT: unpcklps {{[0-9]+}}(%rsp), %xmm1 # 16-byte Folded Reload			; SSE-NEXT: movaps %xmm4, %xmm5
	; SSE: movaps %xmm1, %xmm0			; SSE-NEXT: andps %xmm1, %xmm5
	; SSE-NEXT: addq $72, %rsp			; SSE-NEXT: maxss %xmm0, %xmm1
				; SSE-NEXT: andnps %xmm1, %xmm4
				; SSE-NEXT: orps %xmm5, %xmm4
				; SSE-NEXT: unpcklps {{.*#+}} xmm2 = xmm2[0],xmm4[0],xmm2[1],xmm4[1]
				; SSE-NEXT: unpcklps {{.*#+}} xmm2 = xmm2[0],xmm3[0],xmm2[1],xmm3[1]
				; SSE-NEXT: movaps %xmm2, %xmm0
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX: vmovaps %xmm1, {{[0-9]+}}(%rsp) # 16-byte Spill			; AVX: vmaxss %xmm0, %xmm1, %xmm2
	; AVX-NEXT: vmovaps %xmm0, {{[0-9]+}}(%rsp) # 16-byte Spill			; AVX-NEXT: vcmpunordss %xmm0, %xmm0, %xmm3
	; AVX-NEXT: callq fmaxf			; AVX-NEXT: vblendvps %xmm3, %xmm1, %xmm2, %xmm2
	; AVX-NEXT: vmovaps %xmm0, (%rsp) # 16-byte Spill			; AVX-NEXT: vmovshdup {{.*#+}} xmm3 = xmm0[1,1,3,3]
	; AVX-NEXT: vmovshdup {{[0-9]+}}(%rsp), %xmm0 # 16-byte Folded Reload			; AVX-NEXT: vmovshdup {{.*#+}} xmm4 = xmm1[1,1,3,3]
	; AVX: vmovshdup {{[0-9]+}}(%rsp), %xmm1 # 16-byte Folded Reload			; AVX-NEXT: vmaxss %xmm3, %xmm4, %xmm5
	; AVX: callq fmaxf			; AVX-NEXT: vcmpunordss %xmm3, %xmm3, %xmm3
	; AVX-NEXT: vmovaps (%rsp), %xmm1 # 16-byte Reload			; AVX-NEXT: vblendvps %xmm3, %xmm4, %xmm5, %xmm3
	; AVX-NEXT: vinsertps {{.*#+}} xmm0 = xmm1[0],xmm0[0],xmm1[2,3]			; AVX-NEXT: vinsertps {{.*#+}} xmm2 = xmm2[0],xmm3[0],xmm2[2,3]
	; AVX-NEXT: vmovaps %xmm0, (%rsp) # 16-byte Spill			; AVX-NEXT: vpermilpd {{.*#+}} xmm3 = xmm0[1,0]
	; AVX-NEXT: vpermilpd $1, {{[0-9]+}}(%rsp), %xmm0 # 16-byte Folded Reload			; AVX-NEXT: vpermilpd {{.*#+}} xmm4 = xmm1[1,0]
	; AVX: vpermilpd $1, {{[0-9]+}}(%rsp), %xmm1 # 16-byte Folded Reload			; AVX-NEXT: vmaxss %xmm3, %xmm4, %xmm5
	; AVX: callq fmaxf			; AVX-NEXT: vcmpunordss %xmm3, %xmm3, %xmm3
	; AVX-NEXT: vmovaps (%rsp), %xmm1 # 16-byte Reload			; AVX-NEXT: vblendvps %xmm3, %xmm4, %xmm5, %xmm3
	; AVX-NEXT: vinsertps {{.*#+}} xmm0 = xmm1[0,1],xmm0[0],xmm1[3]			; AVX-NEXT: vinsertps {{.*#+}} xmm2 = xmm2[0,1],xmm3[0],xmm2[3]
	; AVX-NEXT: vmovaps %xmm0, (%rsp) # 16-byte Spill			; AVX-NEXT: vpermilps {{.*#+}} xmm0 = xmm0[3,1,2,3]
	; AVX-NEXT: vpermilps $231, {{[0-9]+}}(%rsp), %xmm0 # 16-byte Folded Reload			; AVX-NEXT: vpermilps {{.*#+}} xmm1 = xmm1[3,1,2,3]
	; AVX: vpermilps $231, {{[0-9]+}}(%rsp), %xmm1 # 16-byte Folded Reload			; AVX-NEXT: vmaxss %xmm0, %xmm1, %xmm3
	; AVX: callq fmaxf			; AVX-NEXT: vcmpunordss %xmm0, %xmm0, %xmm0
	; AVX-NEXT: vmovaps (%rsp), %xmm1 # 16-byte Reload			; AVX-NEXT: vblendvps %xmm0, %xmm1, %xmm3, %xmm0
	; AVX-NEXT: vinsertps {{.*#+}} xmm0 = xmm1[0,1,2],xmm0[0]			; AVX-NEXT: vinsertps {{.*#+}} xmm0 = xmm2[0,1,2],xmm0[0]
	; AVX-NEXT: addq $56, %rsp
	; AVX-NEXT: retq			; AVX-NEXT: retq
	define <4 x float> @test_intrinsic_fmax_v4f32(<4 x float> %x, <4 x float> %y) {			define <4 x float> @test_intrinsic_fmax_v4f32(<4 x float> %x, <4 x float> %y) {
	%z = call <4 x float> @llvm.maxnum.v4f32(<4 x float> %x, <4 x float> %y) readnone			%z = call <4 x float> @llvm.maxnum.v4f32(<4 x float> %x, <4 x float> %y) readnone
	ret <4 x float> %z			ret <4 x float> %z
	}			}

				; FIXME: Vector of doubles should be inlined similarly to vector of floats.

	; CHECK-LABEL: @test_intrinsic_fmax_v2f64			; CHECK-LABEL: @test_intrinsic_fmax_v2f64
	; CHECK: callq fmax			; CHECK: callq fmax
	; CHECK: callq fmax			; CHECK: callq fmax
	define <2 x double> @test_intrinsic_fmax_v2f64(<2 x double> %x, <2 x double> %y) {			define <2 x double> @test_intrinsic_fmax_v2f64(<2 x double> %x, <2 x double> %y) {
	%z = call <2 x double> @llvm.maxnum.v2f64(<2 x double> %x, <2 x double> %y) readnone			%z = call <2 x double> @llvm.maxnum.v2f64(<2 x double> %x, <2 x double> %y) readnone
	ret <2 x double> %z			ret <2 x double> %z
	}			}

				; FIXME: Vector of doubles should be inlined similarly to vector of floats.

	; CHECK-LABEL: @test_intrinsic_fmax_v4f64			; CHECK-LABEL: @test_intrinsic_fmax_v4f64
	; CHECK: callq fmax			; CHECK: callq fmax
	; CHECK: callq fmax			; CHECK: callq fmax
	; CHECK: callq fmax			; CHECK: callq fmax
	; CHECK: callq fmax			; CHECK: callq fmax
	define <4 x double> @test_intrinsic_fmax_v4f64(<4 x double> %x, <4 x double> %y) {			define <4 x double> @test_intrinsic_fmax_v4f64(<4 x double> %x, <4 x double> %y) {
	%z = call <4 x double> @llvm.maxnum.v4f64(<4 x double> %x, <4 x double> %y) readnone			%z = call <4 x double> @llvm.maxnum.v4f64(<4 x double> %x, <4 x double> %y) readnone
	ret <4 x double> %z			ret <4 x double> %z
	}			}

				; FIXME: Vector of doubles should be inlined similarly to vector of floats.

	; CHECK-LABEL: @test_intrinsic_fmax_v8f64			; CHECK-LABEL: @test_intrinsic_fmax_v8f64
	; CHECK: callq fmax			; CHECK: callq fmax
	; CHECK: callq fmax			; CHECK: callq fmax
	; CHECK: callq fmax			; CHECK: callq fmax
	; CHECK: callq fmax			; CHECK: callq fmax
	; CHECK: callq fmax			; CHECK: callq fmax
	; CHECK: callq fmax			; CHECK: callq fmax
	; CHECK: callq fmax			; CHECK: callq fmax
	; CHECK: callq fmax			; CHECK: callq fmax
	define <8 x double> @test_intrinsic_fmax_v8f64(<8 x double> %x, <8 x double> %y) {			define <8 x double> @test_intrinsic_fmax_v8f64(<8 x double> %x, <8 x double> %y) {
	%z = call <8 x double> @llvm.maxnum.v8f64(<8 x double> %x, <8 x double> %y) readnone			%z = call <8 x double> @llvm.maxnum.v8f64(<8 x double> %x, <8 x double> %y) readnone
	ret <8 x double> %z			ret <8 x double> %z
	}			}