Page MenuHomePhabricator

[DAGCombiner] allow more folding of fadd + fmul into fma
ClosedPublic

Authored by spatel on May 29 2020, 7:41 AM.

Details

Summary

If fmul and fadd are separated by an fma, we can fold them together to save an instruction:
fadd (fma A, B, (fmul C, D)), N1 --> fma(A, B, fma(C, D, N1))

The fold implemented here is actually a specialization - we should be able to peek through >1 fma to find this pattern. That's another patch if we want to try that enhancement though.

This transform was guarded by the TLI hook for enableAggressiveFMAFusion(), so it was done for some in-tree targets like PowerPC, but not AArch64 or x86. The hook is protecting against forming a potentially more expensive computation when fma takes longer to execute than a single fadd. That hook may be needed for other transforms, but in this case, we are replacing fmul+fadd with fma, and the fma should never take longer than the 2 individual instructions.

'contract' FMF is all we need to allow this transform. That flag corresponds to -ffp-contract=fast in Clang, so we are allowed to form fma ops freely across expressions.

Diff Detail

Event Timeline

spatel created this revision.May 29 2020, 7:41 AM

Wouldn't it be better to choose between what you have here fmadd(a,b,fma(c,d,n)) and a*b + fmadd(c,d,n) for targets that perform worse with FMA chains?

Wouldn't it be better to choose between what you have here fmadd(a,b,fma(c,d,n)) and a*b + fmadd(c,d,n) for targets that perform worse with FMA chains?

Not sure if I'm understanding the question. Is there a target or a code pattern with a known disadvantage for the 2 fma variant?
Note that the entire set of transforms in visitFADDForFMACombine() is gated on legality checks for FMA(D) nodes, so we are assuming that these ops are supported if we reach this point in the code. There's also an existing target bailout with the generateFMAsInMachineCombiner() hook.

Not sure if I'm understanding the question. Is there a target or a code pattern with a known disadvantage for the 2 fma variant?

I mostly agree that what you're proposing is an improvement over what's currently there, but my question is that if you're gonna change it, is it really better to do what you're suggesting compared to creating instructions that can be executed in parallel? There are certainly performance differences between fma(a,b,fma(c,d,n)) and a*b + fma(c,d,n) depending on the target.

Not sure if I'm understanding the question. Is there a target or a code pattern with a known disadvantage for the 2 fma variant?

I mostly agree that what you're proposing is an improvement over what's currently there, but my question is that if you're gonna change it, is it really better to do what you're suggesting compared to creating instructions that can be executed in parallel? There are certainly performance differences between fma(a,b,fma(c,d,n)) and a*b + fma(c,d,n) depending on the target.

Ah - I missed that you are rearranging the operands to make the fmul independent from the fma. I agree that could be better depending on the target and surrounding code.

But we need some detailed target model info to make that transform *over* this one. We have that info in MachineCombiner, and I think the scenario you are suggesting is at least partly why the TLI hook for that already exists. It's also possible to expand this back out to separate fmul/fadd in that pass.

Given the constraints in SDAG, we should choose the (fma(fma)) variant by default (assuming as we do here that the target has fma instructions). For example on x86, our best perf heuristic at this stage of compilation on any recent Intel or AMD core is number of uops. The option with separate fmul and fadd always has more uops, so it would be backwards to choose that sequence here and then try to undo that later.

If this is the wrong choice for some other target, they can can still opt out, so I think this is a safe option. Alternatively, we could make the enableAggressiveFMAFusion() hook more nuanced by changing the bool return to a enum'd level of aggression, and then deciding which of the current transforms under here require a higher level of fma perf.

Not sure if I'm understanding the question. Is there a target or a code pattern with a known disadvantage for the 2 fma variant?

I mostly agree that what you're proposing is an improvement over what's currently there, but my question is that if you're gonna change it, is it really better to do what you're suggesting compared to creating instructions that can be executed in parallel? There are certainly performance differences between fma(a,b,fma(c,d,n)) and a*b + fma(c,d,n) depending on the target.

Ah - I missed that you are rearranging the operands to make the fmul independent from the fma. I agree that could be better depending on the target and surrounding code.

But we need some detailed target model info to make that transform *over* this one. We have that info in MachineCombiner, and I think the scenario you are suggesting is at least partly why the TLI hook for that already exists. It's also possible to expand this back out to separate fmul/fadd in that pass.

Given the constraints in SDAG, we should choose the (fma(fma)) variant by default (assuming as we do here that the target has fma instructions). For example on x86, our best perf heuristic at this stage of compilation on any recent Intel or AMD core is number of uops. The option with separate fmul and fadd always has more uops, so it would be backwards to choose that sequence here and then try to undo that later.

If this is the wrong choice for some other target, they can can still opt out, so I think this is a safe option. Alternatively, we could make the enableAggressiveFMAFusion() hook more nuanced by changing the bool return to a enum'd level of aggression, and then deciding which of the current transforms under here require a higher level of fma perf.

Broadwell might be an interesting X86 target here. MUL and ADD both have 3 cycle latency and FMA is 5 cycle latency. Haswell is ADD 3, MUL/FMA 5. Everything is uniform on SKL at 4 cycles.

Given the constraints in SDAG, we should choose the (fma(fma)) variant by default (assuming as we do here that the target has fma instructions). For example on x86, our best perf heuristic at this stage of compilation on any recent Intel or AMD core is number of uops. The option with separate fmul and fadd always has more uops, so it would be backwards to choose that sequence here and then try to undo that later.

I agree SDAG is not ideal -- I explored doing this earlier in opt and also later using MIs, but both places have their own problems. It's a surprisingly cumbersome optimization if you are concerned about multiple targets which have different sets of FMA "flavours".

If this is the wrong choice for some other target, they can can still opt out, so I think this is a safe option. Alternatively, we could make the enableAggressiveFMAFusion() hook more nuanced by changing the bool return to a enum'd level of aggression, and then deciding which of the current transforms under here require a higher level of fma perf.

Yes, that would be a nice feature to explore at some point.

I agree SDAG is not ideal -- I explored doing this earlier in opt and also later using MIs, but both places have their own problems. It's a surprisingly cumbersome optimization if you are concerned about multiple targets which have different sets of FMA "flavours".

This discussion reminded me of the examples here:
https://reviews.llvm.org/D18751#402906
(and that's where the MachineCombiner hook was added)

So we really can't win all cases - even on the same target - without seeing the entire loop. I still view this patch as an instruction/uop win, so it's the right default choice.

Broadwell might be an interesting X86 target here. MUL and ADD both have 3 cycle latency and FMA is 5 cycle latency. Haswell is ADD 3, MUL/FMA 5. Everything is uniform on SKL at 4 cycles.

Thanks - I overlooked Broadwell. From what I see in Agner's tables, Broadwell is then tied with Ryzen for worst relative FMA implementation (3/3/5 for single-precision).
Do you think this is worth trying as-is for x86, or do we need to work harder to undo FMA first? @RKSimon - any thoughts about AMD CPUs?

On ARM CPUs, there's a special forwarding path to reduce the latency of chains of fma isntructions, so this seems fine there even if the latency is relevant.

spatel added a comment.Jun 1 2020, 5:05 AM

On ARM CPUs, there's a special forwarding path to reduce the latency of chains of fma isntructions, so this seems fine there even if the latency is relevant.

Thanks for that info.

Just to reiterate: the transform that this patch is making can't hurt latency unless FMA has a cost that is greater than the sum of individual FMUL + FADD. If that slow of an FMA exists somewhere, I'd be interested in learning about the design. :)

lebedev.ri accepted this revision.Jun 1 2020, 6:01 AM
lebedev.ri added a subscriber: lebedev.ri.

Seems reasonable to me, but please wait for someone else.

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
11830

Comma missing.

I understand that it is preexisting style here, but i find it to more hurt readability to name operands.
I'd think

// fold (fadd (fma N00, N01, (fmul N020, N021)), N1) -> (fma N00, N01, (fma N020, N021, N1))

is more readable given the code.

11831

We only care if the root node N allows producing FMA, we don't care about leafs?

11843

Comma missing,

// fold (fadd N0, (fma N10, N11, (fmul N120, N121)) -> (fma N10, N11, (fma N120, N121, N0))
This revision is now accepted and ready to land.Jun 1 2020, 6:01 AM
spatel marked an inline comment as done.Jun 1 2020, 6:42 AM
spatel added subscribers: shchenz, nemanjai.
spatel added inline comments.
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
11831

Yes, this is the usual FMF constraint: if the final/root value has the necessary flags, we assume that intermediate calculations leading to that value can be relaxed using those flags as well. And that's carried over in the propagation of FMF when creating the new nodes as well. (cc @nemanjai @shchenz @steven.zhang who may be looking at related changes).

I think we have seen some cases where that might be too liberal of an interpretation of FMF, but as you noticed, I didn't change any of the code here (including typos) because I only wanted to adjust the aggressive TLI constraint in this patch.

In practice, all of the FP nodes will have the same flags unless we have mixed FP-optimized compilation with LTO + inlining or something similar to that.

spatel updated this revision to Diff 268940.Jun 5 2020, 2:20 PM

Patch updated:
Rebased on top of clean-up of the existing code.

spatel added a comment.Jun 7 2020, 5:26 AM

If there are no other concerns/objections, can someone provide a second LGTM/accept for this patch?

Here's an llvm-mca demo to simulate throughput/latency:
https://godbolt.org/z/pWAxcW

RKSimon accepted this revision.Jun 8 2020, 12:13 PM

LGTM - btw if we do end up with targets that struggle with this, it should be possible to tweak isFMAFasterThanFMulAndFAdd/enableAggressiveFMAFusion to help us to account for it.

This revision was automatically updated to reflect the committed changes.

Sorry for the late question, but I don't understand why this kind of folding is not considered reassociation. I thought reassociation was not allowed even when -ffp-contract=fast.

Sorry for the late question, but I don't understand why this kind of folding is not considered reassociation. I thought reassociation was not allowed even when -ffp-contract=fast.

General reassociation of FP ops is not allowed without "reassoc" fast-math-flags or the legacy global setting - see isContractableFMUL().

But this transform is a special-case. We are only pulling the trailing addition in with an existing multiply. That's the kind of transform -ffp-contract=fast ("contract" FMF) is intended to enable. So with that setting, the compiler can choose whether we end up with fma(A, B, fma(C, D, E)) or fma(C, D, fma(A, B, E)). AFAIK, LLVM matches gcc behavior here.

I think "contract" alone is also enough to split an fma back into fmul + fadd as was suggested earlier in the review, but I'm not sure if there's precedence for that.

We are only pulling the trailing addition in with an existing multiply.

The problem here is that it's the "wrong" multiply: you have, essentially (A*B+D*E)+F, and you're turning it into A*B+(D*E+F). I don't think contraction is supposed to cover that.

We are only pulling the trailing addition in with an existing multiply.

The problem here is that it's the "wrong" multiply: you have, essentially (A*B+D*E)+F, and you're turning it into A*B+(D*E+F). I don't think contraction is supposed to cover that.

That is exactly my concern; I was under the impression that contraction only allowed fusing (A*B+C) and nothing else.

spatel added a reviewer: Restricted Project.Jun 19 2020, 4:49 AM

We are only pulling the trailing addition in with an existing multiply.

The problem here is that it's the "wrong" multiply: you have, essentially (A*B+D*E)+F, and you're turning it into A*B+(D*E+F). I don't think contraction is supposed to cover that.

That is exactly my concern; I was under the impression that contraction only allowed fusing (A*B+C) and nothing else.

Here's a playground to check various compilers/orderings:
https://godbolt.org/z/3nV8Mx

So I was wrong - gcc trunk is not forming 2 fmas on "ab + cd + e". Neither is icc (if I specified the optimization options correctly).
AFAIK, there's no formal or even loose specification for "-ffp-contract=fast", so we probably want to follow gcc's lead here? Ie, this transform should require "reassoc"; "contract" is not enough.

Note that this patch did not actually change the transform behavior - it only enabled the existing transform on more targets. So if we require looser FP settings to do the transform, it may regress targets like PowerPC.

Proposal to require stronger FMF on this transform:
D82499