This is an archive of the discontinued LLVM Phabricator instance.

[DAGCombiner] Improve FMA support for interpolation patterns
ClosedPublic

Authored by RKSimon on Sep 20 2015, 11:00 AM.

Download Raw Diff

Details

Reviewers

spatel
delena
arsenm
hfinkel

Commits

rG4003ed2da300: [DAGCombiner] Improve FMA support for interpolation patterns
rL248210: [DAGCombiner] Improve FMA support for interpolation patterns

Summary

This patch adds support for combining patterns such as (FMUL(FADD(1.0, x), y)) and (FMUL(FSUB(x, 1.0), y)) to their FMA equivalents.

This is useful in particular for linear interpolation cases such as (FADD(FMUL(x, t), FMUL(y, FSUB(1.0, t))))

Diff Detail

Repository: rL LLVM

Event Timeline

RKSimon updated this revision to Diff 35206.Sep 20 2015, 11:00 AM

RKSimon retitled this revision from to [DAGCombiner] Improve FMA support for interpolation patterns.

RKSimon updated this object.

RKSimon added reviewers: hfinkel, arsenm, spatel, delena.

RKSimon set the repository for this revision to rL LLVM.

RKSimon added a subscriber: llvm-commits.

This mostly LGTM.

There aren't any tests stressing the FMAD path. AMDGPU seems to be only target using it still, and the one test change is in the expansion of an intrinsic which should be removed. If you can add some of those that would be good, otherwise I can try to do it after you commit

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
7937 ↗	(On Diff #35206)	A better name would be AllowFusion or something like that
7944–7946 ↗	(On Diff #35206)	I think the AllowFusion/UnsafeFPMath check should be first
7953 ↗	(On Diff #35206)	Usually the int is omitted
7954 ↗	(On Diff #35206)	It seems wrong to use this in the FMAD case, although AMDGPU happens to not care because enableAggressiveFMAFusion always reports true and it seems to be what is used already.

There aren't any tests stressing the FMAD path. AMDGPU seems to be only target using it still, and the one test change is in the expansion of an intrinsic which should be removed. If you can add some of those that would be good, otherwise I can try to do it after you commit

Thanks Matt, I can add some FMAD tests for v_mad_f32 - is that the only instruction I should be testing for?

Most of your comments about the preamble are just as relevant for the other FMA pattern combines (visitFADDForFMACombine, visitFSUBForFMACombine); given that I copied+pasted most of it from them should they be updated as well?

In D13003#249540, @RKSimon wrote:

There aren't any tests stressing the FMAD path. AMDGPU seems to be only target using it still, and the one test change is in the expansion of an intrinsic which should be removed. If you can add some of those that would be good, otherwise I can try to do it after you commit

Thanks Matt, I can add some FMAD tests for v_mad_f32 - is that the only instruction I should be testing for?

Yes. The fneg should be folded in as a source modifier that looks something like v_mad_f32 v0, v1, v2, -v3. Sometimes v_mac_f32 is used, although in these cases that shouldn't happen

Most of your comments about the preamble are just as relevant for the other FMA pattern combines (visitFADDForFMACombine, visitFSUBForFMACombine); given that I copied+pasted most of it from them should they be updated as well?

Yes, probably

Updated all FMA combine helpers based on Matt's feedback.

Added AMDGPU FMA/FMAD tests

LGTM, although I think you should split the renames in the other parts into a separate patch

This revision is now accepted and ready to land.Sep 21 2015, 9:45 AM

RKSimon mentioned this in rL248206: [DAGCombiner] Tidy up FMA combine helpers. NFCI..Sep 21 2015, 1:16 PM

Closed by commit rL248210: [DAGCombiner] Improve FMA support for interpolation patterns (authored by RKSimon). · Explain WhySep 21 2015, 1:34 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

CodeGen/

SelectionDAG/

DAGCombiner.cpp

89 lines

test/

CodeGen/

AMDGPU/

fma-combine.ll

200 lines

llvm.amdgpu.lrp.ll

2 lines

X86/

fma_patterns.ll

305 lines

Diff 35302

llvm/trunk/lib/CodeGen/SelectionDAG/DAGCombiner.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 315 Lines • ▼ Show 20 Lines	private:
SDValue visitMSTORE(SDNode *N);		SDValue visitMSTORE(SDNode *N);
SDValue visitMGATHER(SDNode *N);		SDValue visitMGATHER(SDNode *N);
SDValue visitMSCATTER(SDNode *N);		SDValue visitMSCATTER(SDNode *N);
SDValue visitFP_TO_FP16(SDNode *N);		SDValue visitFP_TO_FP16(SDNode *N);
SDValue visitFP16_TO_FP(SDNode *N);		SDValue visitFP16_TO_FP(SDNode *N);

SDValue visitFADDForFMACombine(SDNode *N);		SDValue visitFADDForFMACombine(SDNode *N);
SDValue visitFSUBForFMACombine(SDNode *N);		SDValue visitFSUBForFMACombine(SDNode *N);
		SDValue visitFMULForFMACombine(SDNode *N);

SDValue XformToShuffleWithZero(SDNode *N);		SDValue XformToShuffleWithZero(SDNode *N);
SDValue ReassociateOps(unsigned Opc, SDLoc DL, SDValue LHS, SDValue RHS);		SDValue ReassociateOps(unsigned Opc, SDLoc DL, SDValue LHS, SDValue RHS);

SDValue visitShiftByConstant(SDNode N, ConstantSDNode Amt);		SDValue visitShiftByConstant(SDNode N, ConstantSDNode Amt);

bool SimplifySelectOps(SDNode *SELECT, SDValue LHS, SDValue RHS);		bool SimplifySelectOps(SDNode *SELECT, SDValue LHS, SDValue RHS);
SDValue SimplifyBinOpWithSameOpcodeHands(SDNode *N);		SDValue SimplifyBinOpWithSameOpcodeHands(SDNode *N);
▲ Show 20 Lines • Show All 7,583 Lines • ▼ Show 20 Lines	if (AllowFusion && LookThroughFPExt) {
}		}
}		}
}		}
}		}

return SDValue();		return SDValue();
}		}

		/// Try to perform FMA combining on a given FMUL node.
		SDValue DAGCombiner::visitFMULForFMACombine(SDNode *N) {
		SDValue N0 = N->getOperand(0);
		SDValue N1 = N->getOperand(1);
		EVT VT = N->getValueType(0);
		SDLoc SL(N);

		assert(N->getOpcode() == ISD::FMUL && "Expected FMUL Operation");

		const TargetOptions &Options = DAG.getTarget().Options;
		bool AllowFusion =
		(Options.AllowFPOpFusion == FPOpFusion::Fast \|\| Options.UnsafeFPMath);

		// Floating-point multiply-add with intermediate rounding.
		bool HasFMAD = (LegalOperations && TLI.isOperationLegal(ISD::FMAD, VT));

		// Floating-point multiply-add without intermediate rounding.
		bool HasFMA =
		AllowFusion && TLI.isFMAFasterThanFMulAndFAdd(VT) &&
		(!LegalOperations \|\| TLI.isOperationLegalOrCustom(ISD::FMA, VT));

		// No valid opcode, do not combine.
		if (!HasFMAD && !HasFMA)
		return SDValue();

		// Always prefer FMAD to FMA for precision.
		unsigned PreferredFusedOpcode = HasFMAD ? ISD::FMAD : ISD::FMA;
		bool Aggressive = TLI.enableAggressiveFMAFusion(VT);

		// fold (fmul (fadd x, +1.0), y) -> (fma x, y, y)
		// fold (fmul (fadd x, -1.0), y) -> (fma x, y, (fneg y))
		auto FuseFADD = [&](SDValue X, SDValue Y) {
		if (X.getOpcode() == ISD::FADD && (Aggressive \|\| X->hasOneUse())) {
		auto XC1 = isConstOrConstSplatFP(X.getOperand(1));
		if (XC1 && XC1->isExactlyValue(+1.0))
		return DAG.getNode(PreferredFusedOpcode, SL, VT, X.getOperand(0), Y, Y);
		if (XC1 && XC1->isExactlyValue(-1.0))
		return DAG.getNode(PreferredFusedOpcode, SL, VT, X.getOperand(0), Y,
		DAG.getNode(ISD::FNEG, SL, VT, Y));
		}
		return SDValue();
		};

		if (SDValue FMA = FuseFADD(N0, N1))
		return FMA;
		if (SDValue FMA = FuseFADD(N1, N0))
		return FMA;

		// fold (fmul (fsub +1.0, x), y) -> (fma (fneg x), y, y)
		// fold (fmul (fsub -1.0, x), y) -> (fma (fneg x), y, (fneg y))
		// fold (fmul (fsub x, +1.0), y) -> (fma x, y, (fneg y))
		// fold (fmul (fsub x, -1.0), y) -> (fma x, y, y)
		auto FuseFSUB = [&](SDValue X, SDValue Y) {
		if (X.getOpcode() == ISD::FSUB && (Aggressive \|\| X->hasOneUse())) {
		auto XC0 = isConstOrConstSplatFP(X.getOperand(0));
		if (XC0 && XC0->isExactlyValue(+1.0))
		return DAG.getNode(PreferredFusedOpcode, SL, VT,
		DAG.getNode(ISD::FNEG, SL, VT, X.getOperand(1)), Y,
		Y);
		if (XC0 && XC0->isExactlyValue(-1.0))
		return DAG.getNode(PreferredFusedOpcode, SL, VT,
		DAG.getNode(ISD::FNEG, SL, VT, X.getOperand(1)), Y,
		DAG.getNode(ISD::FNEG, SL, VT, Y));

		auto XC1 = isConstOrConstSplatFP(X.getOperand(1));
		if (XC1 && XC1->isExactlyValue(+1.0))
		return DAG.getNode(PreferredFusedOpcode, SL, VT, X.getOperand(0), Y,
		DAG.getNode(ISD::FNEG, SL, VT, Y));
		if (XC1 && XC1->isExactlyValue(-1.0))
		return DAG.getNode(PreferredFusedOpcode, SL, VT, X.getOperand(0), Y, Y);
		}
		return SDValue();
		};

		if (SDValue FMA = FuseFSUB(N0, N1))
		return FMA;
		if (SDValue FMA = FuseFSUB(N1, N0))
		return FMA;

		return SDValue();
		}

SDValue DAGCombiner::visitFADD(SDNode *N) {		SDValue DAGCombiner::visitFADD(SDNode *N) {
SDValue N0 = N->getOperand(0);		SDValue N0 = N->getOperand(0);
SDValue N1 = N->getOperand(1);		SDValue N1 = N->getOperand(1);
ConstantFPSDNode *N0CFP = dyn_cast<ConstantFPSDNode>(N0);		ConstantFPSDNode *N0CFP = dyn_cast<ConstantFPSDNode>(N0);
ConstantFPSDNode *N1CFP = dyn_cast<ConstantFPSDNode>(N1);		ConstantFPSDNode *N1CFP = dyn_cast<ConstantFPSDNode>(N1);
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);
SDLoc DL(N);		SDLoc DL(N);
const TargetOptions &Options = DAG.getTarget().Options;		const TargetOptions &Options = DAG.getTarget().Options;
▲ Show 20 Lines • Show All 291 Lines • ▼ Show 20 Lines	if (char RHSNeg = isNegatibleForFree(N1, LegalOperations, TLI, &Options)) {
if (LHSNeg == 2 \|\| RHSNeg == 2)		if (LHSNeg == 2 \|\| RHSNeg == 2)
return DAG.getNode(ISD::FMUL, DL, VT,		return DAG.getNode(ISD::FMUL, DL, VT,
GetNegatedExpression(N0, DAG, LegalOperations),		GetNegatedExpression(N0, DAG, LegalOperations),
GetNegatedExpression(N1, DAG, LegalOperations),		GetNegatedExpression(N1, DAG, LegalOperations),
Flags);		Flags);
}		}
}		}

		// FMUL -> FMA combines:
		if (SDValue Fused = visitFMULForFMACombine(N)) {
		AddToWorklist(Fused.getNode());
		return Fused;
		}

return SDValue();		return SDValue();
}		}

SDValue DAGCombiner::visitFMA(SDNode *N) {		SDValue DAGCombiner::visitFMA(SDNode *N) {
SDValue N0 = N->getOperand(0);		SDValue N0 = N->getOperand(0);
SDValue N1 = N->getOperand(1);		SDValue N1 = N->getOperand(1);
SDValue N2 = N->getOperand(2);		SDValue N2 = N->getOperand(2);
ConstantFPSDNode *N0CFP = dyn_cast<ConstantFPSDNode>(N0);		ConstantFPSDNode *N0CFP = dyn_cast<ConstantFPSDNode>(N0);
▲ Show 20 Lines • Show All 6,216 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/AMDGPU/fma-combine.ll

Show First 20 Lines • Show All 358 Lines • ▼ Show 20 Lines	define void @aggressive_combine_to_fma_fsub_1_f64(double addrspace(1)* noalias %out, double addrspace(1)* noalias %in) #1 {
%tmp0 = fmul double %u, %v		%tmp0 = fmul double %u, %v
%tmp1 = call double @llvm.fma.f64(double %y, double %z, double %tmp0) #0		%tmp1 = call double @llvm.fma.f64(double %y, double %z, double %tmp0) #0
%tmp2 = fsub double %x, %tmp1		%tmp2 = fsub double %x, %tmp1

store double %tmp2, double addrspace(1)* %gep.out		store double %tmp2, double addrspace(1)* %gep.out
ret void		ret void
}		}

		;
		; Patterns (+ fneg variants): mul(add(1.0,x),y), mul(sub(1.0,x),y), mul(sub(x,1.0),y)
		;

		; FUNC-LABEL: {{^}}test_f32_mul_add_x_one_y:
		; SI: v_mac_f32_e32 [[VY:v[0-9]]], [[VY:v[0-9]]], [[VX:v[0-9]]]
		define void @test_f32_mul_add_x_one_y(float addrspace(1)* %out,
		float addrspace(1)* %in1,
		float addrspace(1)* %in2) {
		%x = load float, float addrspace(1)* %in1
		%y = load float, float addrspace(1)* %in2
		%a = fadd float %x, 1.0
		%m = fmul float %a, %y
		store float %m, float addrspace(1)* %out
		ret void
		}

		; FUNC-LABEL: {{^}}test_f32_mul_y_add_x_one:
		; SI: v_mac_f32_e32 [[VY:v[0-9]]], [[VY:v[0-9]]], [[VX:v[0-9]]]
		define void @test_f32_mul_y_add_x_one(float addrspace(1)* %out,
		float addrspace(1)* %in1,
		float addrspace(1)* %in2) {
		%x = load float, float addrspace(1)* %in1
		%y = load float, float addrspace(1)* %in2
		%a = fadd float %x, 1.0
		%m = fmul float %y, %a
		store float %m, float addrspace(1)* %out
		ret void
		}

		; FUNC-LABEL: {{^}}test_f32_mul_add_x_negone_y:
		; SI: v_mad_f32 [[VX:v[0-9]]], [[VX]], [[VY:v[0-9]]], -[[VY]]
		define void @test_f32_mul_add_x_negone_y(float addrspace(1)* %out,
		float addrspace(1)* %in1,
		float addrspace(1)* %in2) {
		%x = load float, float addrspace(1)* %in1
		%y = load float, float addrspace(1)* %in2
		%a = fadd float %x, -1.0
		%m = fmul float %a, %y
		store float %m, float addrspace(1)* %out
		ret void
		}

		; FUNC-LABEL: {{^}}test_f32_mul_y_add_x_negone:
		; SI: v_mad_f32 [[VX:v[0-9]]], [[VX]], [[VY:v[0-9]]], -[[VY]]
		define void @test_f32_mul_y_add_x_negone(float addrspace(1)* %out,
		float addrspace(1)* %in1,
		float addrspace(1)* %in2) {
		%x = load float, float addrspace(1)* %in1
		%y = load float, float addrspace(1)* %in2
		%a = fadd float %x, -1.0
		%m = fmul float %y, %a
		store float %m, float addrspace(1)* %out
		ret void
		}

		; FUNC-LABEL: {{^}}test_f32_mul_sub_one_x_y:
		; SI: v_mad_f32 [[VX:v[0-9]]], -[[VX]], [[VY:v[0-9]]], [[VY]]
		define void @test_f32_mul_sub_one_x_y(float addrspace(1)* %out,
		float addrspace(1)* %in1,
		float addrspace(1)* %in2) {
		%x = load float, float addrspace(1)* %in1
		%y = load float, float addrspace(1)* %in2
		%s = fsub float 1.0, %x
		%m = fmul float %s, %y
		store float %m, float addrspace(1)* %out
		ret void
		}

		; FUNC-LABEL: {{^}}test_f32_mul_y_sub_one_x:
		; SI: v_mad_f32 [[VX:v[0-9]]], -[[VX]], [[VY:v[0-9]]], [[VY]]
		define void @test_f32_mul_y_sub_one_x(float addrspace(1)* %out,
		float addrspace(1)* %in1,
		float addrspace(1)* %in2) {
		%x = load float, float addrspace(1)* %in1
		%y = load float, float addrspace(1)* %in2
		%s = fsub float 1.0, %x
		%m = fmul float %y, %s
		store float %m, float addrspace(1)* %out
		ret void
		}

		; FUNC-LABEL: {{^}}test_f32_mul_sub_negone_x_y:
		; SI: v_mad_f32 [[VX:v[0-9]]], -[[VX]], [[VY:v[0-9]]], -[[VY]]
		define void @test_f32_mul_sub_negone_x_y(float addrspace(1)* %out,
		float addrspace(1)* %in1,
		float addrspace(1)* %in2) {
		%x = load float, float addrspace(1)* %in1
		%y = load float, float addrspace(1)* %in2
		%s = fsub float -1.0, %x
		%m = fmul float %s, %y
		store float %m, float addrspace(1)* %out
		ret void
		}

		; FUNC-LABEL: {{^}}test_f32_mul_y_sub_negone_x:
		; SI: v_mad_f32 [[VX:v[0-9]]], -[[VX]], [[VY:v[0-9]]], -[[VY]]
		define void @test_f32_mul_y_sub_negone_x(float addrspace(1)* %out,
		float addrspace(1)* %in1,
		float addrspace(1)* %in2) {
		%x = load float, float addrspace(1)* %in1
		%y = load float, float addrspace(1)* %in2
		%s = fsub float -1.0, %x
		%m = fmul float %y, %s
		store float %m, float addrspace(1)* %out
		ret void
		}

		; FUNC-LABEL: {{^}}test_f32_mul_sub_x_one_y:
		; SI: v_mad_f32 [[VX:v[0-9]]], [[VX]], [[VY:v[0-9]]], -[[VY]]
		define void @test_f32_mul_sub_x_one_y(float addrspace(1)* %out,
		float addrspace(1)* %in1,
		float addrspace(1)* %in2) {
		%x = load float, float addrspace(1)* %in1
		%y = load float, float addrspace(1)* %in2
		%s = fsub float %x, 1.0
		%m = fmul float %s, %y
		store float %m, float addrspace(1)* %out
		ret void
		}

		; FUNC-LABEL: {{^}}test_f32_mul_y_sub_x_one:
		; SI: v_mad_f32 [[VX:v[0-9]]], [[VX]], [[VY:v[0-9]]], -[[VY]]
		define void @test_f32_mul_y_sub_x_one(float addrspace(1)* %out,
		float addrspace(1)* %in1,
		float addrspace(1)* %in2) {
		%x = load float, float addrspace(1)* %in1
		%y = load float, float addrspace(1)* %in2
		%s = fsub float %x, 1.0
		%m = fmul float %y, %s
		store float %m, float addrspace(1)* %out
		ret void
		}

		; FUNC-LABEL: {{^}}test_f32_mul_sub_x_negone_y:
		; SI: v_mac_f32_e32 [[VY:v[0-9]]], [[VY]], [[VX:v[0-9]]]
		define void @test_f32_mul_sub_x_negone_y(float addrspace(1)* %out,
		float addrspace(1)* %in1,
		float addrspace(1)* %in2) {
		%x = load float, float addrspace(1)* %in1
		%y = load float, float addrspace(1)* %in2
		%s = fsub float %x, -1.0
		%m = fmul float %s, %y
		store float %m, float addrspace(1)* %out
		ret void
		}

		; FUNC-LABEL: {{^}}test_f32_mul_y_sub_x_negone:
		; SI: v_mac_f32_e32 [[VY:v[0-9]]], [[VY]], [[VX:v[0-9]]]
		define void @test_f32_mul_y_sub_x_negone(float addrspace(1)* %out,
		float addrspace(1)* %in1,
		float addrspace(1)* %in2) {
		%x = load float, float addrspace(1)* %in1
		%y = load float, float addrspace(1)* %in2
		%s = fsub float %x, -1.0
		%m = fmul float %y, %s
		store float %m, float addrspace(1)* %out
		ret void
		}

		;
		; Interpolation Patterns: add(mul(x,t),mul(sub(1.0,t),y))
		;

		; FUNC-LABEL: {{^}}test_f32_interp:
		; SI: v_mad_f32 [[VR:v[0-9]]], -[[VT:v[0-9]]], [[VY:v[0-9]]], [[VY]]
		; SI: v_mac_f32_e32 [[VR]], [[VT]], [[VX:v[0-9]]]
		define void @test_f32_interp(float addrspace(1)* %out,
		float addrspace(1)* %in1,
		float addrspace(1)* %in2,
		float addrspace(1)* %in3) {
		%x = load float, float addrspace(1)* %in1
		%y = load float, float addrspace(1)* %in2
		%t = load float, float addrspace(1)* %in3
		%t1 = fsub float 1.0, %t
		%tx = fmul float %x, %t
		%ty = fmul float %y, %t1
		%r = fadd float %tx, %ty
		store float %r, float addrspace(1)* %out
		ret void
		}

		; FUNC-LABEL: {{^}}test_f64_interp:
		; SI: v_fma_f64 [[VR:v\[[0-9]+:[0-9]+\]]], -[[VT:v\[[0-9]+:[0-9]+\]]], [[VY:v\[[0-9]+:[0-9]+\]]], [[VY]]
		; SI: v_fma_f64 [[VR:v\[[0-9]+:[0-9]+\]]], [[VX:v\[[0-9]+:[0-9]+\]]], [[VT]], [[VR]]
		define void @test_f64_interp(double addrspace(1)* %out,
		double addrspace(1)* %in1,
		double addrspace(1)* %in2,
		double addrspace(1)* %in3) {
		%x = load double, double addrspace(1)* %in1
		%y = load double, double addrspace(1)* %in2
		%t = load double, double addrspace(1)* %in3
		%t1 = fsub double 1.0, %t
		%tx = fmul double %x, %t
		%ty = fmul double %y, %t1
		%r = fadd double %tx, %ty
		store double %r, double addrspace(1)* %out
		ret void
		}

attributes #0 = { nounwind readnone }		attributes #0 = { nounwind readnone }
attributes #1 = { nounwind }		attributes #1 = { nounwind }

llvm/trunk/test/CodeGen/AMDGPU/llvm.amdgpu.lrp.ll

	; RUN: llc -march=amdgcn -mcpu=SI -verify-machineinstrs < %s \| FileCheck -check-prefix=SI -check-prefix=FUNC %s			; RUN: llc -march=amdgcn -mcpu=SI -verify-machineinstrs < %s \| FileCheck -check-prefix=SI -check-prefix=FUNC %s
	; RUN: llc -march=amdgcn -mcpu=tonga -verify-machineinstrs < %s \| FileCheck -check-prefix=SI -check-prefix=FUNC %s			; RUN: llc -march=amdgcn -mcpu=tonga -verify-machineinstrs < %s \| FileCheck -check-prefix=SI -check-prefix=FUNC %s

	declare float @llvm.AMDGPU.lrp(float, float, float) nounwind readnone			declare float @llvm.AMDGPU.lrp(float, float, float) nounwind readnone

	; FUNC-LABEL: {{^}}test_lrp:			; FUNC-LABEL: {{^}}test_lrp:
	; SI: v_sub_f32			; SI: v_mad_f32
	; SI: v_mac_f32_e32			; SI: v_mac_f32_e32
	define void @test_lrp(float addrspace(1)* %out, float %src0, float %src1, float %src2) nounwind {			define void @test_lrp(float addrspace(1)* %out, float %src0, float %src1, float %src2) nounwind {
	%mad = call float @llvm.AMDGPU.lrp(float %src0, float %src1, float %src2) nounwind readnone			%mad = call float @llvm.AMDGPU.lrp(float %src0, float %src1, float %src2) nounwind readnone
	store float %mad, float addrspace(1)* %out, align 4			store float %mad, float addrspace(1)* %out, align 4
	ret void			ret void
	}			}

llvm/trunk/test/CodeGen/X86/fma_patterns.ll

	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx,+fma -fp-contract=fast \| FileCheck %s			; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx,+fma -fp-contract=fast \| FileCheck %s
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx,+fma4,+fma -fp-contract=fast \| FileCheck %s			; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx,+fma4,+fma -fp-contract=fast \| FileCheck %s
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx,+fma4 -fp-contract=fast \| FileCheck %s --check-prefix=CHECK_FMA4			; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx,+fma4 -fp-contract=fast \| FileCheck %s --check-prefix=CHECK_FMA4

				;
				; Patterns (+ fneg variants): add(mul(x,y),z), sub(mul(x,y),z)
				;

	define <4 x float> @test_x86_fmadd_ps(<4 x float> %a0, <4 x float> %a1, <4 x float> %a2) {			define <4 x float> @test_x86_fmadd_ps(<4 x float> %a0, <4 x float> %a1, <4 x float> %a2) {
	; CHECK-LABEL: test_x86_fmadd_ps:			; CHECK-LABEL: test_x86_fmadd_ps:
	; CHECK: # BB#0:			; CHECK: # BB#0:
	; CHECK-NEXT: vfmadd213ps %xmm2, %xmm1, %xmm0			; CHECK-NEXT: vfmadd213ps %xmm2, %xmm1, %xmm0
	; CHECK-NEXT: retq			; CHECK-NEXT: retq
	;			;
	; CHECK_FMA4-LABEL: test_x86_fmadd_ps:			; CHECK_FMA4-LABEL: test_x86_fmadd_ps:
	; CHECK_FMA4: # BB#0:			; CHECK_FMA4: # BB#0:
	▲ Show 20 Lines • Show All 246 Lines • ▼ Show 20 Lines
	; CHECK_FMA4-NEXT: vfmsubps %xmm1, (%rdi), %xmm0, %xmm0			; CHECK_FMA4-NEXT: vfmsubps %xmm1, (%rdi), %xmm0, %xmm0
	; CHECK_FMA4-NEXT: retq			; CHECK_FMA4-NEXT: retq
	%x = load <4 x float>, <4 x float>* %a0			%x = load <4 x float>, <4 x float>* %a0
	%y = fmul <4 x float> %x, %a1			%y = fmul <4 x float> %x, %a1
	%res = fsub <4 x float> %y, %a2			%res = fsub <4 x float> %y, %a2
	ret <4 x float> %res			ret <4 x float> %res
	}			}

				;
				; Patterns (+ fneg variants): mul(add(1.0,x),y), mul(sub(1.0,x),y), mul(sub(x,1.0),y)
				;

				define <4 x float> @test_v4f32_mul_add_x_one_y(<4 x float> %x, <4 x float> %y) {
				; CHECK-LABEL: test_v4f32_mul_add_x_one_y:
				; CHECK: # BB#0:
				; CHECK-NEXT: vfmadd213ps %xmm1, %xmm1, %xmm0
				; CHECK-NEXT: retq
				;
				; CHECK_FMA4-LABEL: test_v4f32_mul_add_x_one_y:
				; CHECK_FMA4: # BB#0:
				; CHECK_FMA4-NEXT: vfmaddps %xmm1, %xmm1, %xmm0, %xmm0
				; CHECK_FMA4-NEXT: retq
				%a = fadd <4 x float> %x, <float 1.0, float 1.0, float 1.0, float 1.0>
				%m = fmul <4 x float> %a, %y
				ret <4 x float> %m
				}

				define <4 x float> @test_v4f32_mul_y_add_x_one(<4 x float> %x, <4 x float> %y) {
				; CHECK-LABEL: test_v4f32_mul_y_add_x_one:
				; CHECK: # BB#0:
				; CHECK-NEXT: vfmadd213ps %xmm1, %xmm1, %xmm0
				; CHECK-NEXT: retq
				;
				; CHECK_FMA4-LABEL: test_v4f32_mul_y_add_x_one:
				; CHECK_FMA4: # BB#0:
				; CHECK_FMA4-NEXT: vfmaddps %xmm1, %xmm1, %xmm0, %xmm0
				; CHECK_FMA4-NEXT: retq
				%a = fadd <4 x float> %x, <float 1.0, float 1.0, float 1.0, float 1.0>
				%m = fmul <4 x float> %y, %a
				ret <4 x float> %m
				}

				define <4 x float> @test_v4f32_mul_add_x_negone_y(<4 x float> %x, <4 x float> %y) {
				; CHECK-LABEL: test_v4f32_mul_add_x_negone_y:
				; CHECK: # BB#0:
				; CHECK-NEXT: vfmsub213ps %xmm1, %xmm1, %xmm0
				; CHECK-NEXT: retq
				;
				; CHECK_FMA4-LABEL: test_v4f32_mul_add_x_negone_y:
				; CHECK_FMA4: # BB#0:
				; CHECK_FMA4-NEXT: vfmsubps %xmm1, %xmm1, %xmm0, %xmm0
				; CHECK_FMA4-NEXT: retq
				%a = fadd <4 x float> %x, <float -1.0, float -1.0, float -1.0, float -1.0>
				%m = fmul <4 x float> %a, %y
				ret <4 x float> %m
				}

				define <4 x float> @test_v4f32_mul_y_add_x_negone(<4 x float> %x, <4 x float> %y) {
				; CHECK-LABEL: test_v4f32_mul_y_add_x_negone:
				; CHECK: # BB#0:
				; CHECK-NEXT: vfmsub213ps %xmm1, %xmm1, %xmm0
				; CHECK-NEXT: retq
				;
				; CHECK_FMA4-LABEL: test_v4f32_mul_y_add_x_negone:
				; CHECK_FMA4: # BB#0:
				; CHECK_FMA4-NEXT: vfmsubps %xmm1, %xmm1, %xmm0, %xmm0
				; CHECK_FMA4-NEXT: retq
				%a = fadd <4 x float> %x, <float -1.0, float -1.0, float -1.0, float -1.0>
				%m = fmul <4 x float> %y, %a
				ret <4 x float> %m
				}

				define <4 x float> @test_v4f32_mul_sub_one_x_y(<4 x float> %x, <4 x float> %y) {
				; CHECK-LABEL: test_v4f32_mul_sub_one_x_y:
				; CHECK: # BB#0:
				; CHECK-NEXT: vfnmadd213ps %xmm1, %xmm1, %xmm0
				; CHECK-NEXT: retq
				;
				; CHECK_FMA4-LABEL: test_v4f32_mul_sub_one_x_y:
				; CHECK_FMA4: # BB#0:
				; CHECK_FMA4-NEXT: vfnmaddps %xmm1, %xmm1, %xmm0, %xmm0
				; CHECK_FMA4-NEXT: retq
				%s = fsub <4 x float> <float 1.0, float 1.0, float 1.0, float 1.0>, %x
				%m = fmul <4 x float> %s, %y
				ret <4 x float> %m
				}

				define <4 x float> @test_v4f32_mul_y_sub_one_x(<4 x float> %x, <4 x float> %y) {
				; CHECK-LABEL: test_v4f32_mul_y_sub_one_x:
				; CHECK: # BB#0:
				; CHECK-NEXT: vfnmadd213ps %xmm1, %xmm1, %xmm0
				; CHECK-NEXT: retq
				;
				; CHECK_FMA4-LABEL: test_v4f32_mul_y_sub_one_x:
				; CHECK_FMA4: # BB#0:
				; CHECK_FMA4-NEXT: vfnmaddps %xmm1, %xmm1, %xmm0, %xmm0
				; CHECK_FMA4-NEXT: retq
				%s = fsub <4 x float> <float 1.0, float 1.0, float 1.0, float 1.0>, %x
				%m = fmul <4 x float> %y, %s
				ret <4 x float> %m
				}

				define <4 x float> @test_v4f32_mul_sub_negone_x_y(<4 x float> %x, <4 x float> %y) {
				; CHECK-LABEL: test_v4f32_mul_sub_negone_x_y:
				; CHECK: # BB#0:
				; CHECK-NEXT: vfnmsub213ps %xmm1, %xmm1, %xmm0
				; CHECK-NEXT: retq
				;
				; CHECK_FMA4-LABEL: test_v4f32_mul_sub_negone_x_y:
				; CHECK_FMA4: # BB#0:
				; CHECK_FMA4-NEXT: vfnmsubps %xmm1, %xmm1, %xmm0, %xmm0
				; CHECK_FMA4-NEXT: retq
				%s = fsub <4 x float> <float -1.0, float -1.0, float -1.0, float -1.0>, %x
				%m = fmul <4 x float> %s, %y
				ret <4 x float> %m
				}

				define <4 x float> @test_v4f32_mul_y_sub_negone_x(<4 x float> %x, <4 x float> %y) {
				; CHECK-LABEL: test_v4f32_mul_y_sub_negone_x:
				; CHECK: # BB#0:
				; CHECK-NEXT: vfnmsub213ps %xmm1, %xmm1, %xmm0
				; CHECK-NEXT: retq
				;
				; CHECK_FMA4-LABEL: test_v4f32_mul_y_sub_negone_x:
				; CHECK_FMA4: # BB#0:
				; CHECK_FMA4-NEXT: vfnmsubps %xmm1, %xmm1, %xmm0, %xmm0
				; CHECK_FMA4-NEXT: retq
				%s = fsub <4 x float> <float -1.0, float -1.0, float -1.0, float -1.0>, %x
				%m = fmul <4 x float> %y, %s
				ret <4 x float> %m
				}

				define <4 x float> @test_v4f32_mul_sub_x_one_y(<4 x float> %x, <4 x float> %y) {
				; CHECK-LABEL: test_v4f32_mul_sub_x_one_y:
				; CHECK: # BB#0:
				; CHECK-NEXT: vfmsub213ps %xmm1, %xmm1, %xmm0
				; CHECK-NEXT: retq
				;
				; CHECK_FMA4-LABEL: test_v4f32_mul_sub_x_one_y:
				; CHECK_FMA4: # BB#0:
				; CHECK_FMA4-NEXT: vfmsubps %xmm1, %xmm1, %xmm0, %xmm0
				; CHECK_FMA4-NEXT: retq
				%s = fsub <4 x float> %x, <float 1.0, float 1.0, float 1.0, float 1.0>
				%m = fmul <4 x float> %s, %y
				ret <4 x float> %m
				}

				define <4 x float> @test_v4f32_mul_y_sub_x_one(<4 x float> %x, <4 x float> %y) {
				; CHECK-LABEL: test_v4f32_mul_y_sub_x_one:
				; CHECK: # BB#0:
				; CHECK-NEXT: vfmsub213ps %xmm1, %xmm1, %xmm0
				; CHECK-NEXT: retq
				;
				; CHECK_FMA4-LABEL: test_v4f32_mul_y_sub_x_one:
				; CHECK_FMA4: # BB#0:
				; CHECK_FMA4-NEXT: vfmsubps %xmm1, %xmm1, %xmm0, %xmm0
				; CHECK_FMA4-NEXT: retq
				%s = fsub <4 x float> %x, <float 1.0, float 1.0, float 1.0, float 1.0>
				%m = fmul <4 x float> %y, %s
				ret <4 x float> %m
				}

				define <4 x float> @test_v4f32_mul_sub_x_negone_y(<4 x float> %x, <4 x float> %y) {
				; CHECK-LABEL: test_v4f32_mul_sub_x_negone_y:
				; CHECK: # BB#0:
				; CHECK-NEXT: vfmadd213ps %xmm1, %xmm1, %xmm0
				; CHECK-NEXT: retq
				;
				; CHECK_FMA4-LABEL: test_v4f32_mul_sub_x_negone_y:
				; CHECK_FMA4: # BB#0:
				; CHECK_FMA4-NEXT: vfmaddps %xmm1, %xmm1, %xmm0, %xmm0
				; CHECK_FMA4-NEXT: retq
				%s = fsub <4 x float> %x, <float -1.0, float -1.0, float -1.0, float -1.0>
				%m = fmul <4 x float> %s, %y
				ret <4 x float> %m
				}

				define <4 x float> @test_v4f32_mul_y_sub_x_negone(<4 x float> %x, <4 x float> %y) {
				; CHECK-LABEL: test_v4f32_mul_y_sub_x_negone:
				; CHECK: # BB#0:
				; CHECK-NEXT: vfmadd213ps %xmm1, %xmm1, %xmm0
				; CHECK-NEXT: retq
				;
				; CHECK_FMA4-LABEL: test_v4f32_mul_y_sub_x_negone:
				; CHECK_FMA4: # BB#0:
				; CHECK_FMA4-NEXT: vfmaddps %xmm1, %xmm1, %xmm0, %xmm0
				; CHECK_FMA4-NEXT: retq
				%s = fsub <4 x float> %x, <float -1.0, float -1.0, float -1.0, float -1.0>
				%m = fmul <4 x float> %y, %s
				ret <4 x float> %m
				}

				;
				; Interpolation Patterns: add(mul(x,t),mul(sub(1.0,t),y))
				;

				define float @test_f32_interp(float %x, float %y, float %t) {
				; CHECK-LABEL: test_f32_interp:
				; CHECK: # BB#0:
				; CHECK-NEXT: vfnmadd213ss %xmm1, %xmm2, %xmm1
				; CHECK-NEXT: vfmadd213ss %xmm1, %xmm2, %xmm0
				; CHECK-NEXT: retq
				;
				; CHECK_FMA4-LABEL: test_f32_interp:
				; CHECK_FMA4: # BB#0:
				; CHECK_FMA4-NEXT: vfnmaddss %xmm1, %xmm1, %xmm2, %xmm1
				; CHECK_FMA4-NEXT: vfmaddss %xmm1, %xmm2, %xmm0, %xmm0
				; CHECK_FMA4-NEXT: retq
				%t1 = fsub float 1.0, %t
				%tx = fmul float %x, %t
				%ty = fmul float %y, %t1
				%r = fadd float %tx, %ty
				ret float %r
				}

				define <4 x float> @test_v4f32_interp(<4 x float> %x, <4 x float> %y, <4 x float> %t) {
				; CHECK-LABEL: test_v4f32_interp:
				; CHECK: # BB#0:
				; CHECK-NEXT: vfnmadd213ps %xmm1, %xmm2, %xmm1
				; CHECK-NEXT: vfmadd213ps %xmm1, %xmm2, %xmm0
				; CHECK-NEXT: retq
				;
				; CHECK_FMA4-LABEL: test_v4f32_interp:
				; CHECK_FMA4: # BB#0:
				; CHECK_FMA4-NEXT: vfnmaddps %xmm1, %xmm1, %xmm2, %xmm1
				; CHECK_FMA4-NEXT: vfmaddps %xmm1, %xmm2, %xmm0, %xmm0
				; CHECK_FMA4-NEXT: retq
				%t1 = fsub <4 x float> <float 1.0, float 1.0, float 1.0, float 1.0>, %t
				%tx = fmul <4 x float> %x, %t
				%ty = fmul <4 x float> %y, %t1
				%r = fadd <4 x float> %tx, %ty
				ret <4 x float> %r
				}

				define <8 x float> @test_v8f32_interp(<8 x float> %x, <8 x float> %y, <8 x float> %t) {
				; CHECK-LABEL: test_v8f32_interp:
				; CHECK: # BB#0:
				; CHECK-NEXT: vfnmadd213ps %ymm1, %ymm2, %ymm1
				; CHECK-NEXT: vfmadd213ps %ymm1, %ymm2, %ymm0
				; CHECK-NEXT: retq
				;
				; CHECK_FMA4-LABEL: test_v8f32_interp:
				; CHECK_FMA4: # BB#0:
				; CHECK_FMA4-NEXT: vfnmaddps %ymm1, %ymm1, %ymm2, %ymm1
				; CHECK_FMA4-NEXT: vfmaddps %ymm1, %ymm2, %ymm0, %ymm0
				; CHECK_FMA4-NEXT: retq
				%t1 = fsub <8 x float> <float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0>, %t
				%tx = fmul <8 x float> %x, %t
				%ty = fmul <8 x float> %y, %t1
				%r = fadd <8 x float> %tx, %ty
				ret <8 x float> %r
				}

				define double @test_f64_interp(double %x, double %y, double %t) {
				; CHECK-LABEL: test_f64_interp:
				; CHECK: # BB#0:
				; CHECK-NEXT: vfnmadd213sd %xmm1, %xmm2, %xmm1
				; CHECK-NEXT: vfmadd213sd %xmm1, %xmm2, %xmm0
				; CHECK-NEXT: retq
				;
				; CHECK_FMA4-LABEL: test_f64_interp:
				; CHECK_FMA4: # BB#0:
				; CHECK_FMA4-NEXT: vfnmaddsd %xmm1, %xmm1, %xmm2, %xmm1
				; CHECK_FMA4-NEXT: vfmaddsd %xmm1, %xmm2, %xmm0, %xmm0
				; CHECK_FMA4-NEXT: retq
				%t1 = fsub double 1.0, %t
				%tx = fmul double %x, %t
				%ty = fmul double %y, %t1
				%r = fadd double %tx, %ty
				ret double %r
				}

				define <2 x double> @test_v2f64_interp(<2 x double> %x, <2 x double> %y, <2 x double> %t) {
				; CHECK-LABEL: test_v2f64_interp:
				; CHECK: # BB#0:
				; CHECK-NEXT: vfnmadd213pd %xmm1, %xmm2, %xmm1
				; CHECK-NEXT: vfmadd213pd %xmm1, %xmm2, %xmm0
				; CHECK-NEXT: retq
				;
				; CHECK_FMA4-LABEL: test_v2f64_interp:
				; CHECK_FMA4: # BB#0:
				; CHECK_FMA4-NEXT: vfnmaddpd %xmm1, %xmm1, %xmm2, %xmm1
				; CHECK_FMA4-NEXT: vfmaddpd %xmm1, %xmm2, %xmm0, %xmm0
				; CHECK_FMA4-NEXT: retq
				%t1 = fsub <2 x double> <double 1.0, double 1.0>, %t
				%tx = fmul <2 x double> %x, %t
				%ty = fmul <2 x double> %y, %t1
				%r = fadd <2 x double> %tx, %ty
				ret <2 x double> %r
				}

				define <4 x double> @test_v4f64_interp(<4 x double> %x, <4 x double> %y, <4 x double> %t) {
				; CHECK-LABEL: test_v4f64_interp:
				; CHECK: # BB#0:
				; CHECK-NEXT: vfnmadd213pd %ymm1, %ymm2, %ymm1
				; CHECK-NEXT: vfmadd213pd %ymm1, %ymm2, %ymm0
				; CHECK-NEXT: retq
				;
				; CHECK_FMA4-LABEL: test_v4f64_interp:
				; CHECK_FMA4: # BB#0:
				; CHECK_FMA4-NEXT: vfnmaddpd %ymm1, %ymm1, %ymm2, %ymm1
				; CHECK_FMA4-NEXT: vfmaddpd %ymm1, %ymm2, %ymm0, %ymm0
				; CHECK_FMA4-NEXT: retq
				%t1 = fsub <4 x double> <double 1.0, double 1.0, double 1.0, double 1.0>, %t
				%tx = fmul <4 x double> %x, %t
				%ty = fmul <4 x double> %y, %t1
				%r = fadd <4 x double> %tx, %ty
				ret <4 x double> %r
				}