This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/CodeGen/SelectionDAG/
-
CodeGen/
-
SelectionDAG/
1/4
DAGCombiner.cpp
-
test/CodeGen/
-
CodeGen/
-
AArch64/
-
fadd-combines.ll
-
X86/
-
fma_patterns.ll

Differential D80801

[DAGCombiner] allow more folding of fadd + fmul into fma
ClosedPublic

Authored by spatel on May 29 2020, 7:41 AM.

Download Raw Diff

Details

Reviewers

efriedma
fhahn
craig.topper
RKSimon
lebedev.ri

Group Reviewers

Restricted Project

Commits

rG702cf933565e: [DAGCombiner] allow more folding of fadd + fmul into fma

Summary

If fmul and fadd are separated by an fma, we can fold them together to save an instruction:
fadd (fma A, B, (fmul C, D)), N1 --> fma(A, B, fma(C, D, N1))

The fold implemented here is actually a specialization - we should be able to peek through >1 fma to find this pattern. That's another patch if we want to try that enhancement though.

This transform was guarded by the TLI hook for enableAggressiveFMAFusion(), so it was done for some in-tree targets like PowerPC, but not AArch64 or x86. The hook is protecting against forming a potentially more expensive computation when fma takes longer to execute than a single fadd. That hook may be needed for other transforms, but in this case, we are replacing fmul+fadd with fma, and the fma should never take longer than the 2 individual instructions.

'contract' FMF is all we need to allow this transform. That flag corresponds to -ffp-contract=fast in Clang, so we are allowed to form fma ops freely across expressions.

Diff Detail

Event Timeline

spatel created this revision.May 29 2020, 7:41 AM

Herald added a project: Restricted Project. · View Herald TranscriptMay 29 2020, 7:41 AM

Herald added subscribers: ecnelises, steven.zhang, hiraditya and 2 others. · View Herald Transcript

rscottmanley added a subscriber: rscottmanley.May 29 2020, 10:18 AM

Wouldn't it be better to choose between what you have here fmadd(a,b,fma(c,d,n)) and a*b + fmadd(c,d,n) for targets that perform worse with FMA chains?

In D80801#2063488, @rscottmanley wrote:

Wouldn't it be better to choose between what you have here fmadd(a,b,fma(c,d,n)) and a*b + fmadd(c,d,n) for targets that perform worse with FMA chains?

Not sure if I'm understanding the question. Is there a target or a code pattern with a known disadvantage for the 2 fma variant?
Note that the entire set of transforms in visitFADDForFMACombine() is gated on legality checks for FMA(D) nodes, so we are assuming that these ops are supported if we reach this point in the code. There's also an existing target bailout with the generateFMAsInMachineCombiner() hook.

Not sure if I'm understanding the question. Is there a target or a code pattern with a known disadvantage for the 2 fma variant?

I mostly agree that what you're proposing is an improvement over what's currently there, but my question is that if you're gonna change it, is it really better to do what you're suggesting compared to creating instructions that can be executed in parallel? There are certainly performance differences between fma(a,b,fma(c,d,n)) and a*b + fma(c,d,n) depending on the target.

In D80801#2063810, @rscottmanley wrote:

Not sure if I'm understanding the question. Is there a target or a code pattern with a known disadvantage for the 2 fma variant?

I mostly agree that what you're proposing is an improvement over what's currently there, but my question is that if you're gonna change it, is it really better to do what you're suggesting compared to creating instructions that can be executed in parallel? There are certainly performance differences between fma(a,b,fma(c,d,n)) and a*b + fma(c,d,n) depending on the target.

Ah - I missed that you are rearranging the operands to make the fmul independent from the fma. I agree that could be better depending on the target and surrounding code.

But we need some detailed target model info to make that transform *over* this one. We have that info in MachineCombiner, and I think the scenario you are suggesting is at least partly why the TLI hook for that already exists. It's also possible to expand this back out to separate fmul/fadd in that pass.

Given the constraints in SDAG, we should choose the (fma(fma)) variant by default (assuming as we do here that the target has fma instructions). For example on x86, our best perf heuristic at this stage of compilation on any recent Intel or AMD core is number of uops. The option with separate fmul and fadd always has more uops, so it would be backwards to choose that sequence here and then try to undo that later.

If this is the wrong choice for some other target, they can can still opt out, so I think this is a safe option. Alternatively, we could make the enableAggressiveFMAFusion() hook more nuanced by changing the bool return to a enum'd level of aggression, and then deciding which of the current transforms under here require a higher level of fma perf.

In D80801#2064150, @spatel wrote:

In D80801#2063810, @rscottmanley wrote:

Not sure if I'm understanding the question. Is there a target or a code pattern with a known disadvantage for the 2 fma variant?

I mostly agree that what you're proposing is an improvement over what's currently there, but my question is that if you're gonna change it, is it really better to do what you're suggesting compared to creating instructions that can be executed in parallel? There are certainly performance differences between fma(a,b,fma(c,d,n)) and a*b + fma(c,d,n) depending on the target.

Ah - I missed that you are rearranging the operands to make the fmul independent from the fma. I agree that could be better depending on the target and surrounding code.

But we need some detailed target model info to make that transform *over* this one. We have that info in MachineCombiner, and I think the scenario you are suggesting is at least partly why the TLI hook for that already exists. It's also possible to expand this back out to separate fmul/fadd in that pass.

Given the constraints in SDAG, we should choose the (fma(fma)) variant by default (assuming as we do here that the target has fma instructions). For example on x86, our best perf heuristic at this stage of compilation on any recent Intel or AMD core is number of uops. The option with separate fmul and fadd always has more uops, so it would be backwards to choose that sequence here and then try to undo that later.

If this is the wrong choice for some other target, they can can still opt out, so I think this is a safe option. Alternatively, we could make the enableAggressiveFMAFusion() hook more nuanced by changing the bool return to a enum'd level of aggression, and then deciding which of the current transforms under here require a higher level of fma perf.

Broadwell might be an interesting X86 target here. MUL and ADD both have 3 cycle latency and FMA is 5 cycle latency. Haswell is ADD 3, MUL/FMA 5. Everything is uniform on SKL at 4 cycles.

Given the constraints in SDAG, we should choose the (fma(fma)) variant by default (assuming as we do here that the target has fma instructions). For example on x86, our best perf heuristic at this stage of compilation on any recent Intel or AMD core is number of uops. The option with separate fmul and fadd always has more uops, so it would be backwards to choose that sequence here and then try to undo that later.

I agree SDAG is not ideal -- I explored doing this earlier in opt and also later using MIs, but both places have their own problems. It's a surprisingly cumbersome optimization if you are concerned about multiple targets which have different sets of FMA "flavours".

If this is the wrong choice for some other target, they can can still opt out, so I think this is a safe option. Alternatively, we could make the enableAggressiveFMAFusion() hook more nuanced by changing the bool return to a enum'd level of aggression, and then deciding which of the current transforms under here require a higher level of fma perf.

Yes, that would be a nice feature to explore at some point.

In D80801#2064218, @rscottmanley wrote:

I agree SDAG is not ideal -- I explored doing this earlier in opt and also later using MIs, but both places have their own problems. It's a surprisingly cumbersome optimization if you are concerned about multiple targets which have different sets of FMA "flavours".

This discussion reminded me of the examples here:
https://reviews.llvm.org/D18751#402906
(and that's where the MachineCombiner hook was added)

So we really can't win all cases - even on the same target - without seeing the entire loop. I still view this patch as an instruction/uop win, so it's the right default choice.

In D80801#2064204, @craig.topper wrote:

Broadwell might be an interesting X86 target here. MUL and ADD both have 3 cycle latency and FMA is 5 cycle latency. Haswell is ADD 3, MUL/FMA 5. Everything is uniform on SKL at 4 cycles.

Thanks - I overlooked Broadwell. From what I see in Agner's tables, Broadwell is then tied with Ryzen for worst relative FMA implementation (3/3/5 for single-precision).
Do you think this is worth trying as-is for x86, or do we need to work harder to undo FMA first? @RKSimon - any thoughts about AMD CPUs?

spatel mentioned this in D80604: [MachineCombiner] add a hook to allow some extension for resource length - NFC.May 30 2020, 9:43 AM

On ARM CPUs, there's a special forwarding path to reduce the latency of chains of fma isntructions, so this seems fine there even if the latency is relevant.

In D80801#2065649, @efriedma wrote:

On ARM CPUs, there's a special forwarding path to reduce the latency of chains of fma isntructions, so this seems fine there even if the latency is relevant.

Thanks for that info.

Just to reiterate: the transform that this patch is making can't hurt latency unless FMA has a cost that is greater than the sum of individual FMUL + FADD. If that slow of an FMA exists somewhere, I'd be interested in learning about the design. :)

Seems reasonable to me, but please wait for someone else.

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
11830	Comma missing. I understand that it is preexisting style here, but i find it to more hurt readability to name operands. I'd think // fold (fadd (fma N00, N01, (fmul N020, N021)), N1) -> (fma N00, N01, (fma N020, N021, N1)) is more readable given the code.
11831	We only care if the root node `N` allows producing FMA, we don't care about leafs?
11843	Comma missing, // fold (fadd N0, (fma N10, N11, (fmul N120, N121)) -> (fma N10, N11, (fma N120, N121, N0))

This revision is now accepted and ready to land.Jun 1 2020, 6:01 AM

spatel marked an inline comment as done.Jun 1 2020, 6:42 AM

spatel added subscribers: shchenz, nemanjai.

spatel added inline comments.

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
11831	Yes, this is the usual FMF constraint: if the final/root value has the necessary flags, we assume that intermediate calculations leading to that value can be relaxed using those flags as well. And that's carried over in the propagation of FMF when creating the new nodes as well. (cc @nemanjai @shchenz @steven.zhang who may be looking at related changes). I think we have seen some cases where that might be too liberal of an interpretation of FMF, but as you noticed, I didn't change any of the code here (including typos) because I only wanted to adjust the aggressive TLI constraint in this patch. In practice, all of the FP nodes will have the same flags unless we have mixed FP-optimized compilation with LTO + inlining or something similar to that.

Herald added a subscriber: • wuzish. · View Herald TranscriptJun 1 2020, 6:43 AM

Patch updated:
Rebased on top of clean-up of the existing code.

spatel mentioned this in rG302cc8a121b5: [DAGCombiner] clean-up FMA+FMUL folds; NFC.Jun 6 2020, 7:59 AM

If there are no other concerns/objections, can someone provide a second LGTM/accept for this patch?

Here's an llvm-mca demo to simulate throughput/latency:
https://godbolt.org/z/pWAxcW

LGTM - btw if we do end up with targets that struggle with this, it should be possible to tweak isFMAFasterThanFMulAndFAdd/enableAggressiveFMAFusion to help us to account for it.

Closed by commit rG702cf933565e: [DAGCombiner] allow more folding of fadd + fmul into fma (authored by spatel). · Explain WhyJun 9 2020, 8:12 AM

This revision was automatically updated to reflect the committed changes.

Sorry for the late question, but I don't understand why this kind of folding is not considered reassociation. I thought reassociation was not allowed even when -ffp-contract=fast.

In D80801#2102014, @bryanpkc wrote:

Sorry for the late question, but I don't understand why this kind of folding is not considered reassociation. I thought reassociation was not allowed even when -ffp-contract=fast.

General reassociation of FP ops is not allowed without "reassoc" fast-math-flags or the legacy global setting - see isContractableFMUL().

But this transform is a special-case. We are only pulling the trailing addition in with an existing multiply. That's the kind of transform -ffp-contract=fast ("contract" FMF) is intended to enable. So with that setting, the compiler can choose whether we end up with fma(A, B, fma(C, D, E)) or fma(C, D, fma(A, B, E)). AFAIK, LLVM matches gcc behavior here.

I think "contract" alone is also enough to split an fma back into fmul + fadd as was suggested earlier in the review, but I'm not sure if there's precedence for that.

We are only pulling the trailing addition in with an existing multiply.

The problem here is that it's the "wrong" multiply: you have, essentially (A*B+D*E)+F, and you're turning it into A*B+(D*E+F). I don't think contraction is supposed to cover that.

In D80801#2102291, @efriedma wrote:

We are only pulling the trailing addition in with an existing multiply.

The problem here is that it's the "wrong" multiply: you have, essentially (A*B+D*E)+F, and you're turning it into A*B+(D*E+F). I don't think contraction is supposed to cover that.

That is exactly my concern; I was under the impression that contraction only allowed fusing (A*B+C) and nothing else.

In D80801#2102362, @bryanpkc wrote:

In D80801#2102291, @efriedma wrote:

We are only pulling the trailing addition in with an existing multiply.

The problem here is that it's the "wrong" multiply: you have, essentially (A*B+D*E)+F, and you're turning it into A*B+(D*E+F). I don't think contraction is supposed to cover that.

That is exactly my concern; I was under the impression that contraction only allowed fusing (A*B+C) and nothing else.

Here's a playground to check various compilers/orderings:
https://godbolt.org/z/3nV8Mx

So I was wrong - gcc trunk is not forming 2 fmas on "ab + cd + e". Neither is icc (if I specified the optimization options correctly).
AFAIK, there's no formal or even loose specification for "-ffp-contract=fast", so we probably want to follow gcc's lead here? Ie, this transform should require "reassoc"; "contract" is not enough.

Note that this patch did not actually change the transform behavior - it only enabled the existing transform on more targets. So if we require looser FP settings to do the transform, it may regress targets like PowerPC.

spatel mentioned this in rG26fd3ffa7836: [x86][AArch64] add tests for fmul-fma combine; NFC.Jun 24 2020, 1:02 PM

spatel mentioned this in D82499: [DAGCombiner] tighten constraints for fma fold.Jun 24 2020, 1:29 PM

Proposal to require stronger FMF on this transform:
D82499

spatel mentioned this in rG39009a8245da: [DAGCombiner] tighten fast-math constraints for fma fold.Jul 12 2020, 5:53 AM

foad mentioned this in D132837: [ISel] Enable generating more fma instructions..Aug 30 2022, 4:23 AM

tsymalla mentioned this in rGc98a46fee6f4: [ISel] Enable generating more fma instructions..Sep 21 2022, 3:03 AM

Revision Contents

Path

Size

llvm/

lib/

CodeGen/

SelectionDAG/

DAGCombiner.cpp

46 lines

test/

CodeGen/

AArch64/

fadd-combines.ll

18 lines

X86/

fma_patterns.ll

49 lines

Diff 268940

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 9,899 Lines • ▼ Show 20 Lines
	}			}

	// Finally, this must be the case where we are shrinking elements: each input			// Finally, this must be the case where we are shrinking elements: each input
	// turns into multiple outputs.			// turns into multiple outputs.
	unsigned NumOutputsPerInput = SrcBitSize/DstBitSize;			unsigned NumOutputsPerInput = SrcBitSize/DstBitSize;
	EVT VT = EVT::getVectorVT(*DAG.getContext(), DstEltVT,			EVT VT = EVT::getVectorVT(*DAG.getContext(), DstEltVT,
	NumOutputsPerInput*BV->getNumOperands());			NumOutputsPerInput*BV->getNumOperands());
	SmallVector<SDValue, 8> Ops;			SmallVector<SDValue, 8> Ops;

				lebedev.riUnsubmitted Not Done Reply Inline Actions Comma missing. I understand that it is preexisting style here, but i find it to more hurt readability to name operands. I'd think // fold (fadd (fma N00, N01, (fmul N020, N021)), N1) -> (fma N00, N01, (fma N020, N021, N1)) is more readable given the code. lebedev.ri: Comma missing. I understand that it is preexisting style here, but i find it to more hurt…
	for (const SDValue &Op : BV->op_values()) {			for (const SDValue &Op : BV->op_values()) {
				lebedev.riUnsubmitted Not Done Reply Inline Actions We only care if the root node `N` allows producing FMA, we don't care about leafs? lebedev.ri: We only care if the root node `N` allows producing FMA, we don't care about leafs?
				spatelAuthorUnsubmitted Done Reply Inline Actions Yes, this is the usual FMF constraint: if the final/root value has the necessary flags, we assume that intermediate calculations leading to that value can be relaxed using those flags as well. And that's carried over in the propagation of FMF when creating the new nodes as well. (cc @nemanjai @shchenz @steven.zhang who may be looking at related changes). I think we have seen some cases where that might be too liberal of an interpretation of FMF, but as you noticed, I didn't change any of the code here (including typos) because I only wanted to adjust the aggressive TLI constraint in this patch. In practice, all of the FP nodes will have the same flags unless we have mixed FP-optimized compilation with LTO + inlining or something similar to that. spatel: Yes, this is the usual FMF constraint: if the final/root value has the necessary flags, we…
	if (Op.isUndef()) {			if (Op.isUndef()) {
	Ops.append(NumOutputsPerInput, DAG.getUNDEF(DstEltVT));			Ops.append(NumOutputsPerInput, DAG.getUNDEF(DstEltVT));
	continue;			continue;
	}			}

	APInt OpVal = cast<ConstantSDNode>(Op)->			APInt OpVal = cast<ConstantSDNode>(Op)->
	getAPIntValue().zextOrTrunc(SrcBitSize);			getAPIntValue().zextOrTrunc(SrcBitSize);

	for (unsigned j = 0; j != NumOutputsPerInput; ++j) {			for (unsigned j = 0; j != NumOutputsPerInput; ++j) {
	APInt ThisVal = OpVal.trunc(DstBitSize);			APInt ThisVal = OpVal.trunc(DstBitSize);
	Ops.push_back(DAG.getConstant(ThisVal, DL, DstEltVT));			Ops.push_back(DAG.getConstant(ThisVal, DL, DstEltVT));
	OpVal.lshrInPlace(DstBitSize);			OpVal.lshrInPlace(DstBitSize);
				lebedev.riUnsubmitted Not Done Reply Inline Actions Comma missing, // fold (fadd N0, (fma N10, N11, (fmul N120, N121)) -> (fma N10, N11, (fma N120, N121, N0)) lebedev.ri: Comma missing, ``` // fold (fadd N0, (fma N10, N11, (fmul N120, N121)) -> (fma N10, N11…
	}			}

	// For big endian targets, swap the order of the pieces of each element.			// For big endian targets, swap the order of the pieces of each element.
	if (DAG.getDataLayout().isBigEndian())			if (DAG.getDataLayout().isBigEndian())
	std::reverse(Ops.end()-NumOutputsPerInput, Ops.end());			std::reverse(Ops.end()-NumOutputsPerInput, Ops.end());
	}			}

	return DAG.getBuildVector(VT, DL, Ops);			return DAG.getBuildVector(VT, DL, Ops);
	▲ Show 20 Lines • Show All 62 Lines • ▼ Show 20 Lines

	// fold (fadd x, (fmul y, z)) -> (fma y, z, x)			// fold (fadd x, (fmul y, z)) -> (fma y, z, x)
	// Note: Commutes FADD operands.			// Note: Commutes FADD operands.
	if (isContractableFMUL(N1) && (Aggressive \|\| N1->hasOneUse())) {			if (isContractableFMUL(N1) && (Aggressive \|\| N1->hasOneUse())) {
	return DAG.getNode(PreferredFusedOpcode, SL, VT,			return DAG.getNode(PreferredFusedOpcode, SL, VT,
	N1.getOperand(0), N1.getOperand(1), N0, Flags);			N1.getOperand(0), N1.getOperand(1), N0, Flags);
	}			}

				// fadd (fma A, B, (fmul C, D)), E --> fma A, B, (fma C, D, E)
				// fadd E, (fma A, B, (fmul C, D)) --> fma A, B, (fma C, D, E)
				SDValue FMA, E;
				if (CanFuse && N0.getOpcode() == PreferredFusedOpcode &&
				N0.getOperand(2).getOpcode() == ISD::FMUL && N0.hasOneUse() &&
				N0.getOperand(2).hasOneUse()) {
				FMA = N0;
				E = N1;
				} else if (CanFuse && N1.getOpcode() == PreferredFusedOpcode &&
				N1.getOperand(2).getOpcode() == ISD::FMUL && N1.hasOneUse() &&
				N1.getOperand(2).hasOneUse()) {
				FMA = N1;
				E = N0;
				}
				if (FMA && E) {
				SDValue A = FMA.getOperand(0);
				SDValue B = FMA.getOperand(1);
				SDValue C = FMA.getOperand(2).getOperand(0);
				SDValue D = FMA.getOperand(2).getOperand(1);
				SDValue CDE = DAG.getNode(PreferredFusedOpcode, SL, VT, C, D, E, Flags);
				return DAG.getNode(PreferredFusedOpcode, SL, VT, A, B, CDE, Flags);
				}

	// Look through FP_EXTEND nodes to do more combining.			// Look through FP_EXTEND nodes to do more combining.

	// fold (fadd (fpext (fmul x, y)), z) -> (fma (fpext x), (fpext y), z)			// fold (fadd (fpext (fmul x, y)), z) -> (fma (fpext x), (fpext y), z)
	if (N0.getOpcode() == ISD::FP_EXTEND) {			if (N0.getOpcode() == ISD::FP_EXTEND) {
	SDValue N00 = N0.getOperand(0);			SDValue N00 = N0.getOperand(0);
	if (isContractableFMUL(N00) &&			if (isContractableFMUL(N00) &&
	TLI.isFPExtFoldable(DAG, PreferredFusedOpcode, VT,			TLI.isFPExtFoldable(DAG, PreferredFusedOpcode, VT,
	N00.getValueType())) {			N00.getValueType())) {
	Show All 17 Lines
	N10.getOperand(0)),			N10.getOperand(0)),
	DAG.getNode(ISD::FP_EXTEND, SL, VT,			DAG.getNode(ISD::FP_EXTEND, SL, VT,
	N10.getOperand(1)), N0, Flags);			N10.getOperand(1)), N0, Flags);
	}			}
	}			}

	// More folding opportunities when target permits.			// More folding opportunities when target permits.
	if (Aggressive) {			if (Aggressive) {
	// fadd (fma A, B, (fmul C, D)), E --> fma A, B, (fma C, D, E)
	// fadd E, (fma A, B, (fmul C, D)) --> fma A, B, (fma C, D, E)
	SDValue FMA, E;
	if (CanFuse && N0.getOpcode() == PreferredFusedOpcode &&
	N0.getOperand(2).getOpcode() == ISD::FMUL && N0.hasOneUse() &&
	N0.getOperand(2).hasOneUse()) {
	FMA = N0;
	E = N1;
	} else if (CanFuse && N1.getOpcode() == PreferredFusedOpcode &&
	N1.getOperand(2).getOpcode() == ISD::FMUL && N1.hasOneUse() &&
	N1.getOperand(2).hasOneUse()) {
	FMA = N1;
	E = N0;
	}
	if (FMA && E) {
	SDValue A = FMA.getOperand(0);
	SDValue B = FMA.getOperand(1);
	SDValue C = FMA.getOperand(2).getOperand(0);
	SDValue D = FMA.getOperand(2).getOperand(1);
	SDValue CDE = DAG.getNode(PreferredFusedOpcode, SL, VT, C, D, E, Flags);
	return DAG.getNode(PreferredFusedOpcode, SL, VT, A, B, CDE, Flags);
	}

	// fold (fadd (fma x, y, (fpext (fmul u, v))), z)			// fold (fadd (fma x, y, (fpext (fmul u, v))), z)
	// -> (fma x, y, (fma (fpext u), (fpext v), z))			// -> (fma x, y, (fma (fpext u), (fpext v), z))
	auto FoldFAddFMAFPExtFMul = [&] (			auto FoldFAddFMAFPExtFMul = [&] (
	SDValue X, SDValue Y, SDValue U, SDValue V, SDValue Z,			SDValue X, SDValue Y, SDValue U, SDValue V, SDValue Z,
	SDNodeFlags Flags) {			SDNodeFlags Flags) {
	return DAG.getNode(PreferredFusedOpcode, SL, VT, X, Y,			return DAG.getNode(PreferredFusedOpcode, SL, VT, X, Y,
	DAG.getNode(PreferredFusedOpcode, SL, VT,			DAG.getNode(PreferredFusedOpcode, SL, VT,
	DAG.getNode(ISD::FP_EXTEND, SL, VT, U),			DAG.getNode(ISD::FP_EXTEND, SL, VT, U),
	▲ Show 20 Lines • Show All 9,951 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/fadd-combines.ll

Show First 20 Lines • Show All 191 Lines • ▼ Show 20 Lines	; CHECK-NEXT: ret
ret <2 x double> %sub		ret <2 x double> %sub
}		}

; ((ab) + (cd)) + n1 --> (ab) + ((cd) + n1)		; ((ab) + (cd)) + n1 --> (ab) + ((cd) + n1)

define double @fadd_fma_fmul_1(double %a, double %b, double %c, double %d, double %n1) nounwind {		define double @fadd_fma_fmul_1(double %a, double %b, double %c, double %d, double %n1) nounwind {
; CHECK-LABEL: fadd_fma_fmul_1:		; CHECK-LABEL: fadd_fma_fmul_1:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: fmul d2, d2, d3		; CHECK-NEXT: fmadd d2, d2, d3, d4
; CHECK-NEXT: fmadd d0, d0, d1, d2		; CHECK-NEXT: fmadd d0, d0, d1, d2
; CHECK-NEXT: fadd d0, d0, d4
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%m1 = fmul fast double %a, %b		%m1 = fmul fast double %a, %b
%m2 = fmul fast double %c, %d		%m2 = fmul fast double %c, %d
%a1 = fadd fast double %m1, %m2		%a1 = fadd fast double %m1, %m2
%a2 = fadd fast double %a1, %n1		%a2 = fadd fast double %a1, %n1
ret double %a2		ret double %a2
}		}

; Minimum FMF, commute final add operands, change type.		; Minimum FMF, commute final add operands, change type.

define float @fadd_fma_fmul_2(float %a, float %b, float %c, float %d, float %n0) nounwind {		define float @fadd_fma_fmul_2(float %a, float %b, float %c, float %d, float %n0) nounwind {
; CHECK-LABEL: fadd_fma_fmul_2:		; CHECK-LABEL: fadd_fma_fmul_2:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: fmul s2, s2, s3		; CHECK-NEXT: fmadd s2, s2, s3, s4
; CHECK-NEXT: fmadd s0, s0, s1, s2		; CHECK-NEXT: fmadd s0, s0, s1, s2
; CHECK-NEXT: fadd s0, s4, s0
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%m1 = fmul float %a, %b		%m1 = fmul float %a, %b
%m2 = fmul float %c, %d		%m2 = fmul float %c, %d
%a1 = fadd contract float %m1, %m2		%a1 = fadd contract float %m1, %m2
%a2 = fadd contract float %n0, %a1		%a2 = fadd contract float %n0, %a1
ret float %a2		ret float %a2
}		}

; The final fadd can be folded with either 1 of the leading fmuls.		; The final fadd can be folded with either 1 of the leading fmuls.

define <2 x double> @fadd_fma_fmul_3(<2 x double> %x1, <2 x double> %x2, <2 x double> %x3, <2 x double> %x4, <2 x double> %x5, <2 x double> %x6, <2 x double> %x7, <2 x double> %x8) nounwind {		define <2 x double> @fadd_fma_fmul_3(<2 x double> %x1, <2 x double> %x2, <2 x double> %x3, <2 x double> %x4, <2 x double> %x5, <2 x double> %x6, <2 x double> %x7, <2 x double> %x8) nounwind {
; CHECK-LABEL: fadd_fma_fmul_3:		; CHECK-LABEL: fadd_fma_fmul_3:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: fmul v2.2d, v2.2d, v3.2d		; CHECK-NEXT: fmul v2.2d, v2.2d, v3.2d
; CHECK-NEXT: fmul v3.2d, v6.2d, v7.2d
; CHECK-NEXT: fmla v2.2d, v1.2d, v0.2d		; CHECK-NEXT: fmla v2.2d, v1.2d, v0.2d
; CHECK-NEXT: fmla v3.2d, v5.2d, v4.2d		; CHECK-NEXT: fmla v2.2d, v7.2d, v6.2d
; CHECK-NEXT: fadd v0.2d, v2.2d, v3.2d		; CHECK-NEXT: fmla v2.2d, v5.2d, v4.2d
		; CHECK-NEXT: mov v0.16b, v2.16b
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%m1 = fmul fast <2 x double> %x1, %x2		%m1 = fmul fast <2 x double> %x1, %x2
%m2 = fmul fast <2 x double> %x3, %x4		%m2 = fmul fast <2 x double> %x3, %x4
%m3 = fmul fast <2 x double> %x5, %x6		%m3 = fmul fast <2 x double> %x5, %x6
%m4 = fmul fast <2 x double> %x7, %x8		%m4 = fmul fast <2 x double> %x7, %x8
%a1 = fadd fast <2 x double> %m1, %m2		%a1 = fadd fast <2 x double> %m1, %m2
%a2 = fadd fast <2 x double> %m3, %m4		%a2 = fadd fast <2 x double> %m3, %m4
%a3 = fadd fast <2 x double> %a1, %a2		%a3 = fadd fast <2 x double> %a1, %a2
ret <2 x double> %a3		ret <2 x double> %a3
}		}

		; negative test

define float @fadd_fma_fmul_extra_use_1(float %a, float %b, float %c, float %d, float %n0, float* %p) nounwind {		define float @fadd_fma_fmul_extra_use_1(float %a, float %b, float %c, float %d, float %n0, float* %p) nounwind {
; CHECK-LABEL: fadd_fma_fmul_extra_use_1:		; CHECK-LABEL: fadd_fma_fmul_extra_use_1:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: fmul s1, s0, s1		; CHECK-NEXT: fmul s1, s0, s1
; CHECK-NEXT: fmadd s0, s2, s3, s1		; CHECK-NEXT: fmadd s0, s2, s3, s1
; CHECK-NEXT: fadd s0, s4, s0		; CHECK-NEXT: fadd s0, s4, s0
; CHECK-NEXT: str s1, [x0]		; CHECK-NEXT: str s1, [x0]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%m1 = fmul fast float %a, %b		%m1 = fmul fast float %a, %b
store float %m1, float* %p		store float %m1, float* %p
%m2 = fmul fast float %c, %d		%m2 = fmul fast float %c, %d
%a1 = fadd fast float %m1, %m2		%a1 = fadd fast float %m1, %m2
%a2 = fadd fast float %n0, %a1		%a2 = fadd fast float %n0, %a1
ret float %a2		ret float %a2
}		}

		; negative test

define float @fadd_fma_fmul_extra_use_2(float %a, float %b, float %c, float %d, float %n0, float* %p) nounwind {		define float @fadd_fma_fmul_extra_use_2(float %a, float %b, float %c, float %d, float %n0, float* %p) nounwind {
; CHECK-LABEL: fadd_fma_fmul_extra_use_2:		; CHECK-LABEL: fadd_fma_fmul_extra_use_2:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: fmul s2, s2, s3		; CHECK-NEXT: fmul s2, s2, s3
; CHECK-NEXT: fmadd s0, s0, s1, s2		; CHECK-NEXT: fmadd s0, s0, s1, s2
; CHECK-NEXT: fadd s0, s4, s0		; CHECK-NEXT: fadd s0, s4, s0
; CHECK-NEXT: str s2, [x0]		; CHECK-NEXT: str s2, [x0]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%m1 = fmul fast float %a, %b		%m1 = fmul fast float %a, %b
%m2 = fmul fast float %c, %d		%m2 = fmul fast float %c, %d
store float %m2, float* %p		store float %m2, float* %p
%a1 = fadd fast float %m1, %m2		%a1 = fadd fast float %m1, %m2
%a2 = fadd fast float %n0, %a1		%a2 = fadd fast float %n0, %a1
ret float %a2		ret float %a2
}		}

		; negative test

define float @fadd_fma_fmul_extra_use_3(float %a, float %b, float %c, float %d, float %n0, float* %p) nounwind {		define float @fadd_fma_fmul_extra_use_3(float %a, float %b, float %c, float %d, float %n0, float* %p) nounwind {
; CHECK-LABEL: fadd_fma_fmul_extra_use_3:		; CHECK-LABEL: fadd_fma_fmul_extra_use_3:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: fmul s2, s2, s3		; CHECK-NEXT: fmul s2, s2, s3
; CHECK-NEXT: fmadd s1, s0, s1, s2		; CHECK-NEXT: fmadd s1, s0, s1, s2
; CHECK-NEXT: fadd s0, s4, s1		; CHECK-NEXT: fadd s0, s4, s1
; CHECK-NEXT: str s1, [x0]		; CHECK-NEXT: str s1, [x0]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
Show All 10 Lines

llvm/test/CodeGen/X86/fma_patterns.ll

Show First 20 Lines • Show All 1,793 Lines • ▼ Show 20 Lines	; AVX512-NEXT: retq
ret <4 x double> %n		ret <4 x double> %n
}		}

; ((ab) + (cd)) + n1 --> (ab) + ((cd) + n1)		; ((ab) + (cd)) + n1 --> (ab) + ((cd) + n1)

define double @fadd_fma_fmul_1(double %a, double %b, double %c, double %d, double %n1) nounwind {		define double @fadd_fma_fmul_1(double %a, double %b, double %c, double %d, double %n1) nounwind {
; FMA-LABEL: fadd_fma_fmul_1:		; FMA-LABEL: fadd_fma_fmul_1:
; FMA: # %bb.0:		; FMA: # %bb.0:
; FMA-NEXT: vmulsd %xmm3, %xmm2, %xmm2		; FMA-NEXT: vfmadd213sd {{.#+}} xmm2 = (xmm3 xmm2) + xmm4
; FMA-NEXT: vfmadd231sd {{.#+}} xmm2 = (xmm1 xmm0) + xmm2		; FMA-NEXT: vfmadd213sd {{.#+}} xmm0 = (xmm1 xmm0) + xmm2
; FMA-NEXT: vaddsd %xmm4, %xmm2, %xmm0
; FMA-NEXT: retq		; FMA-NEXT: retq
;		;
; FMA4-LABEL: fadd_fma_fmul_1:		; FMA4-LABEL: fadd_fma_fmul_1:
; FMA4: # %bb.0:		; FMA4: # %bb.0:
; FMA4-NEXT: vmulsd %xmm3, %xmm2, %xmm2		; FMA4-NEXT: vfmaddsd {{.#+}} xmm2 = (xmm2 xmm3) + xmm4
; FMA4-NEXT: vfmaddsd {{.#+}} xmm0 = (xmm0 xmm1) + xmm2		; FMA4-NEXT: vfmaddsd {{.#+}} xmm0 = (xmm0 xmm1) + xmm2
; FMA4-NEXT: vaddsd %xmm4, %xmm0, %xmm0
; FMA4-NEXT: retq		; FMA4-NEXT: retq
;		;
; AVX512-LABEL: fadd_fma_fmul_1:		; AVX512-LABEL: fadd_fma_fmul_1:
; AVX512: # %bb.0:		; AVX512: # %bb.0:
; AVX512-NEXT: vmulsd %xmm3, %xmm2, %xmm2		; AVX512-NEXT: vfmadd213sd {{.#+}} xmm2 = (xmm3 xmm2) + xmm4
; AVX512-NEXT: vfmadd231sd {{.#+}} xmm2 = (xmm1 xmm0) + xmm2		; AVX512-NEXT: vfmadd213sd {{.#+}} xmm0 = (xmm1 xmm0) + xmm2
; AVX512-NEXT: vaddsd %xmm4, %xmm2, %xmm0
; AVX512-NEXT: retq		; AVX512-NEXT: retq
%m1 = fmul fast double %a, %b		%m1 = fmul fast double %a, %b
%m2 = fmul fast double %c, %d		%m2 = fmul fast double %c, %d
%a1 = fadd fast double %m1, %m2		%a1 = fadd fast double %m1, %m2
%a2 = fadd fast double %a1, %n1		%a2 = fadd fast double %a1, %n1
ret double %a2		ret double %a2
}		}

; Minimum FMF, commute final add operands, change type.		; Minimum FMF, commute final add operands, change type.

define float @fadd_fma_fmul_2(float %a, float %b, float %c, float %d, float %n0) nounwind {		define float @fadd_fma_fmul_2(float %a, float %b, float %c, float %d, float %n0) nounwind {
; FMA-LABEL: fadd_fma_fmul_2:		; FMA-LABEL: fadd_fma_fmul_2:
; FMA: # %bb.0:		; FMA: # %bb.0:
; FMA-NEXT: vmulss %xmm3, %xmm2, %xmm2		; FMA-NEXT: vfmadd213ss {{.#+}} xmm2 = (xmm3 xmm2) + xmm4
; FMA-NEXT: vfmadd231ss {{.#+}} xmm2 = (xmm1 xmm0) + xmm2		; FMA-NEXT: vfmadd213ss {{.#+}} xmm0 = (xmm1 xmm0) + xmm2
; FMA-NEXT: vaddss %xmm2, %xmm4, %xmm0
; FMA-NEXT: retq		; FMA-NEXT: retq
;		;
; FMA4-LABEL: fadd_fma_fmul_2:		; FMA4-LABEL: fadd_fma_fmul_2:
; FMA4: # %bb.0:		; FMA4: # %bb.0:
; FMA4-NEXT: vmulss %xmm3, %xmm2, %xmm2		; FMA4-NEXT: vfmaddss {{.#+}} xmm2 = (xmm2 xmm3) + xmm4
; FMA4-NEXT: vfmaddss {{.#+}} xmm0 = (xmm0 xmm1) + xmm2		; FMA4-NEXT: vfmaddss {{.#+}} xmm0 = (xmm0 xmm1) + xmm2
; FMA4-NEXT: vaddss %xmm0, %xmm4, %xmm0
; FMA4-NEXT: retq		; FMA4-NEXT: retq
;		;
; AVX512-LABEL: fadd_fma_fmul_2:		; AVX512-LABEL: fadd_fma_fmul_2:
; AVX512: # %bb.0:		; AVX512: # %bb.0:
; AVX512-NEXT: vmulss %xmm3, %xmm2, %xmm2		; AVX512-NEXT: vfmadd213ss {{.#+}} xmm2 = (xmm3 xmm2) + xmm4
; AVX512-NEXT: vfmadd231ss {{.#+}} xmm2 = (xmm1 xmm0) + xmm2		; AVX512-NEXT: vfmadd213ss {{.#+}} xmm0 = (xmm1 xmm0) + xmm2
; AVX512-NEXT: vaddss %xmm2, %xmm4, %xmm0
; AVX512-NEXT: retq		; AVX512-NEXT: retq
%m1 = fmul float %a, %b		%m1 = fmul float %a, %b
%m2 = fmul float %c, %d		%m2 = fmul float %c, %d
%a1 = fadd contract float %m1, %m2		%a1 = fadd contract float %m1, %m2
%a2 = fadd contract float %n0, %a1		%a2 = fadd contract float %n0, %a1
ret float %a2		ret float %a2
}		}

; The final fadd can be folded with either 1 of the leading fmuls.		; The final fadd can be folded with either 1 of the leading fmuls.

define <2 x double> @fadd_fma_fmul_3(<2 x double> %x1, <2 x double> %x2, <2 x double> %x3, <2 x double> %x4, <2 x double> %x5, <2 x double> %x6, <2 x double> %x7, <2 x double> %x8) nounwind {		define <2 x double> @fadd_fma_fmul_3(<2 x double> %x1, <2 x double> %x2, <2 x double> %x3, <2 x double> %x4, <2 x double> %x5, <2 x double> %x6, <2 x double> %x7, <2 x double> %x8) nounwind {
; FMA-LABEL: fadd_fma_fmul_3:		; FMA-LABEL: fadd_fma_fmul_3:
; FMA: # %bb.0:		; FMA: # %bb.0:
; FMA-NEXT: vmulpd %xmm3, %xmm2, %xmm2		; FMA-NEXT: vmulpd %xmm3, %xmm2, %xmm2
; FMA-NEXT: vmulpd %xmm7, %xmm6, %xmm3
; FMA-NEXT: vfmadd231pd {{.#+}} xmm2 = (xmm1 xmm0) + xmm2		; FMA-NEXT: vfmadd231pd {{.#+}} xmm2 = (xmm1 xmm0) + xmm2
; FMA-NEXT: vfmadd231pd {{.#+}} xmm3 = (xmm5 xmm4) + xmm3		; FMA-NEXT: vfmadd231pd {{.#+}} xmm2 = (xmm7 xmm6) + xmm2
; FMA-NEXT: vaddpd %xmm3, %xmm2, %xmm0		; FMA-NEXT: vfmadd231pd {{.#+}} xmm2 = (xmm5 xmm4) + xmm2
		; FMA-NEXT: vmovapd %xmm2, %xmm0
; FMA-NEXT: retq		; FMA-NEXT: retq
;		;
; FMA4-LABEL: fadd_fma_fmul_3:		; FMA4-LABEL: fadd_fma_fmul_3:
; FMA4: # %bb.0:		; FMA4: # %bb.0:
; FMA4-NEXT: vmulpd %xmm3, %xmm2, %xmm2		; FMA4-NEXT: vmulpd %xmm3, %xmm2, %xmm2
; FMA4-NEXT: vmulpd %xmm7, %xmm6, %xmm3
; FMA4-NEXT: vfmaddpd {{.#+}} xmm0 = (xmm0 xmm1) + xmm2		; FMA4-NEXT: vfmaddpd {{.#+}} xmm0 = (xmm0 xmm1) + xmm2
; FMA4-NEXT: vfmaddpd {{.#+}} xmm1 = (xmm4 xmm5) + xmm3		; FMA4-NEXT: vfmaddpd {{.#+}} xmm0 = (xmm6 xmm7) + xmm0
; FMA4-NEXT: vaddpd %xmm1, %xmm0, %xmm0		; FMA4-NEXT: vfmaddpd {{.#+}} xmm0 = (xmm4 xmm5) + xmm0
; FMA4-NEXT: retq		; FMA4-NEXT: retq
;		;
; AVX512-LABEL: fadd_fma_fmul_3:		; AVX512-LABEL: fadd_fma_fmul_3:
; AVX512: # %bb.0:		; AVX512: # %bb.0:
; AVX512-NEXT: vmulpd %xmm3, %xmm2, %xmm2		; AVX512-NEXT: vmulpd %xmm3, %xmm2, %xmm2
; AVX512-NEXT: vmulpd %xmm7, %xmm6, %xmm3
; AVX512-NEXT: vfmadd231pd {{.#+}} xmm2 = (xmm1 xmm0) + xmm2		; AVX512-NEXT: vfmadd231pd {{.#+}} xmm2 = (xmm1 xmm0) + xmm2
; AVX512-NEXT: vfmadd231pd {{.#+}} xmm3 = (xmm5 xmm4) + xmm3		; AVX512-NEXT: vfmadd231pd {{.#+}} xmm2 = (xmm7 xmm6) + xmm2
; AVX512-NEXT: vaddpd %xmm3, %xmm2, %xmm0		; AVX512-NEXT: vfmadd231pd {{.#+}} xmm2 = (xmm5 xmm4) + xmm2
		; AVX512-NEXT: vmovapd %xmm2, %xmm0
; AVX512-NEXT: retq		; AVX512-NEXT: retq
%m1 = fmul fast <2 x double> %x1, %x2		%m1 = fmul fast <2 x double> %x1, %x2
%m2 = fmul fast <2 x double> %x3, %x4		%m2 = fmul fast <2 x double> %x3, %x4
%m3 = fmul fast <2 x double> %x5, %x6		%m3 = fmul fast <2 x double> %x5, %x6
%m4 = fmul fast <2 x double> %x7, %x8		%m4 = fmul fast <2 x double> %x7, %x8
%a1 = fadd fast <2 x double> %m1, %m2		%a1 = fadd fast <2 x double> %m1, %m2
%a2 = fadd fast <2 x double> %m3, %m4		%a2 = fadd fast <2 x double> %m3, %m4
%a3 = fadd fast <2 x double> %a1, %a2		%a3 = fadd fast <2 x double> %a1, %a2
ret <2 x double> %a3		ret <2 x double> %a3
}		}

		; negative test

define float @fadd_fma_fmul_extra_use_1(float %a, float %b, float %c, float %d, float %n0, float* %p) nounwind {		define float @fadd_fma_fmul_extra_use_1(float %a, float %b, float %c, float %d, float %n0, float* %p) nounwind {
; FMA-LABEL: fadd_fma_fmul_extra_use_1:		; FMA-LABEL: fadd_fma_fmul_extra_use_1:
; FMA: # %bb.0:		; FMA: # %bb.0:
; FMA-NEXT: vmulss %xmm1, %xmm0, %xmm0		; FMA-NEXT: vmulss %xmm1, %xmm0, %xmm0
; FMA-NEXT: vmovss %xmm0, (%rdi)		; FMA-NEXT: vmovss %xmm0, (%rdi)
; FMA-NEXT: vfmadd213ss {{.#+}} xmm2 = (xmm3 xmm2) + xmm0		; FMA-NEXT: vfmadd213ss {{.#+}} xmm2 = (xmm3 xmm2) + xmm0
; FMA-NEXT: vaddss %xmm2, %xmm4, %xmm0		; FMA-NEXT: vaddss %xmm2, %xmm4, %xmm0
; FMA-NEXT: retq		; FMA-NEXT: retq
Show All 16 Lines	; AVX512-NEXT: retq
%m1 = fmul fast float %a, %b		%m1 = fmul fast float %a, %b
store float %m1, float* %p		store float %m1, float* %p
%m2 = fmul fast float %c, %d		%m2 = fmul fast float %c, %d
%a1 = fadd fast float %m1, %m2		%a1 = fadd fast float %m1, %m2
%a2 = fadd fast float %n0, %a1		%a2 = fadd fast float %n0, %a1
ret float %a2		ret float %a2
}		}

		; negative test

define float @fadd_fma_fmul_extra_use_2(float %a, float %b, float %c, float %d, float %n0, float* %p) nounwind {		define float @fadd_fma_fmul_extra_use_2(float %a, float %b, float %c, float %d, float %n0, float* %p) nounwind {
; FMA-LABEL: fadd_fma_fmul_extra_use_2:		; FMA-LABEL: fadd_fma_fmul_extra_use_2:
; FMA: # %bb.0:		; FMA: # %bb.0:
; FMA-NEXT: vmulss %xmm3, %xmm2, %xmm2		; FMA-NEXT: vmulss %xmm3, %xmm2, %xmm2
; FMA-NEXT: vmovss %xmm2, (%rdi)		; FMA-NEXT: vmovss %xmm2, (%rdi)
; FMA-NEXT: vfmadd213ss {{.#+}} xmm0 = (xmm1 xmm0) + xmm2		; FMA-NEXT: vfmadd213ss {{.#+}} xmm0 = (xmm1 xmm0) + xmm2
; FMA-NEXT: vaddss %xmm0, %xmm4, %xmm0		; FMA-NEXT: vaddss %xmm0, %xmm4, %xmm0
; FMA-NEXT: retq		; FMA-NEXT: retq
Show All 16 Lines	; AVX512-NEXT: retq
%m1 = fmul fast float %a, %b		%m1 = fmul fast float %a, %b
%m2 = fmul fast float %c, %d		%m2 = fmul fast float %c, %d
store float %m2, float* %p		store float %m2, float* %p
%a1 = fadd fast float %m1, %m2		%a1 = fadd fast float %m1, %m2
%a2 = fadd fast float %n0, %a1		%a2 = fadd fast float %n0, %a1
ret float %a2		ret float %a2
}		}

		; negative test

define float @fadd_fma_fmul_extra_use_3(float %a, float %b, float %c, float %d, float %n0, float* %p) nounwind {		define float @fadd_fma_fmul_extra_use_3(float %a, float %b, float %c, float %d, float %n0, float* %p) nounwind {
; FMA-LABEL: fadd_fma_fmul_extra_use_3:		; FMA-LABEL: fadd_fma_fmul_extra_use_3:
; FMA: # %bb.0:		; FMA: # %bb.0:
; FMA-NEXT: vmulss %xmm3, %xmm2, %xmm2		; FMA-NEXT: vmulss %xmm3, %xmm2, %xmm2
; FMA-NEXT: vfmadd231ss {{.#+}} xmm2 = (xmm1 xmm0) + xmm2		; FMA-NEXT: vfmadd231ss {{.#+}} xmm2 = (xmm1 xmm0) + xmm2
; FMA-NEXT: vmovss %xmm2, (%rdi)		; FMA-NEXT: vmovss %xmm2, (%rdi)
; FMA-NEXT: vaddss %xmm2, %xmm4, %xmm0		; FMA-NEXT: vaddss %xmm2, %xmm4, %xmm0
; FMA-NEXT: retq		; FMA-NEXT: retq
Show All 25 Lines