This is an archive of the discontinued LLVM Phabricator instance.

[ISel] Enable generating more fma instructions.
ClosedPublic

Authored by tsymalla on Aug 29 2022, 2:02 AM.

Details

Summary

This patch changes a FADD / FMUL => FMA ISel pattern implemented
in D80801 so that it peeks through more than one FMA.

Diff Detail

Event Timeline

tsymalla created this revision.Aug 29 2022, 2:02 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 29 2022, 2:02 AM
tsymalla requested review of this revision.Aug 29 2022, 2:02 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 29 2022, 2:02 AM
foad added inline comments.Aug 30 2022, 4:12 AM
llvm/test/CodeGen/AMDGPU/dagcombine-fma-fmad.ll
178

Please precommit the new tests and rebase, so the patch shows the codegen diff.

foad added inline comments.
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
14178

This seems to be a more limited case of your new reassociation, so can we remove this code now?

14219

I don't understand the logic here. What do copies and extracts have to do with reassociating fmuls and fadds?

foad added a comment.Aug 30 2022, 4:23 AM

See the comment on D80801: "The fold implemented here is actually a specialization - we should be able to peek through >1 fma to find this pattern. That's another patch if we want to try that enhancement though." That's what you are implementing here.

tsymalla marked an inline comment as done.Aug 30 2022, 5:31 AM
tsymalla updated this revision to Diff 459025.Sep 9 2022, 5:59 AM
tsymalla marked an inline comment as done.

This changes a FADD / FMUL => FMA ISel pattern implemented
in D80801 so that it peeks through more than one FMA. This also
changes the order of the operands, which can help with eliminating
a final COPY.

tsymalla edited the summary of this revision. (Show Details)Sep 9 2022, 5:59 AM

Did not update the lit tests for other targets yet (CodeGen/AArch64/fadd-combines.ll, CodeGen/PowerPC/fma-assoc.ll, CodeGen/PowerPC/machine-combiner.ll, CodeGen/X86/fma_patterns.ll) because I wanted to get some opinion on the change in the operand order first.

arsenm added a subscriber: arsenm.Sep 9 2022, 7:08 AM
arsenm added inline comments.
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
14208

Why use UpdateNodeOperands here instead of just constructing the new node normally?

foad added inline comments.Sep 9 2022, 8:00 AM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
14181

Remove "on"?

14182–14184

Looks like clang-format has mangled this comment!

14186–14211

Can you explain this? It is not at all obvious. Are you sure it's not just random? Perhaps it helps your case, but could harm other cases?

tsymalla marked 2 inline comments as done.Sep 14 2022, 8:19 AM
tsymalla added inline comments.
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
14186–14211

When keeping the order

fadd (fma A, B, (fmul C, D)), E --> fma A, B, (fma C, D, E),

The compiler will assign the result of the inner FMA to a new (virtual) register which will be used by the outer FMA.
The outer FMA will write its output to the same virtual register which will, in the simple example (fmac_sequence_simple) cause a COPY to the first available register at the end.
Finally, the register allocator will try recoloring the registers.
This output register (in this case, %6) will be marked as recolorizable and assigned to the first available register. As the outer FMA is using %0 and % 1 (a and b), it can only assign to %2, which is in this case v2.
During virtual register rewriting, the compiler will now try to eliminate identity copies. Even if the final copy is regarded as candidate for deletion, it cannot do so because the COPY is used by the final SI_RETURN. So, this generates a superfluous V_MOV at the end.

By changing the operand order, we get following changes:
The multiplicand operand order of the innermost and the outermost FMA is swapped. So, innermost uses %0 and %1 and outermost uses %2 and %3. So, the compiler is able to assign %6 (output of outermost FMA) to $vgpr0 because the register is not used inside the outermost FMA (only in the innermost FMA).
By this, the final COPY can be eliminated because it's essentially a identity copy, removing the final V_MOV.

Changing the instruction order in the DAG basically frees up the desired output register for the register allocator.

I cannot assume this causes harm for other cases, but from fast-math point of view it should not cause any issues. Looking at the (currently failing) test cases, I don't see any actual issues, but please correct me if I'm wrong.

tsymalla added inline comments.Sep 16 2022, 12:33 AM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
14208

This is updating the innermost FMA node which already exists, I am just replacing the last operand of it.
The newly created node is the FMA node in NewFMA:

Example steps:

fadd (fma A, B, (fmul C, D)), E

1. NewFMA = fma_o (C, D, fma_i (A, B, fmul(C, D)))
2. fma_i = fma (A, B, E)
=> NewFMA = fma (C, D, fma (A, B, E))

More complex case:

fadd (fma_i0 A, B, (fma (C, D, (fmul (E, F))))), G:
1. TmpFMA = fma_i = fma (C, D, (fmul (E, F))))
2. NewFMA = fma (E, F, fma_i0) = fma (E, F, fma (A, B, fma (C, D, (fmul (E, F))))) (construct outermost FMA)
3. TmpFMA.UpdateNodeOperands: fma_i = fma (C, D, (fmul (E, F)) => fma (C, D, G) => fma (E, F, fma (A, B, fma (C, D, (fmul (E, F))))) = fma (E, F, fma (A, B, fma (C, D, G))) (replace innermost FMA operand with addition operator of the initial FADD)

I can create a new node for the innermost FMA, but isn't UpdateNodeOperands being used for such cases?

tsymalla updated this revision to Diff 460670.Sep 16 2022, 1:35 AM

Addressed comments, added a few comments and updated X86, AArch64, PowerPC
tests.y

foad added inline comments.Sep 16 2022, 2:55 AM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
14186–14211

https://reviews.llvm.org/differential/diff/460688/ adds a couple of variations of PPC test cases. All I have done is swapped the operands of the "fadd %F, %G" instruction. If you apply your patch on top of this you will see that the codegen gets worse for these cases, in exactly the same way that it got better for the existing tests. So I do not agree that changing the order of the fmas is a good thing in general.

Because of this, I would prefer to keep the existing order of the fmas. That makes the combine simpler because you only have to mutate the fmul to an fma, and remove the fadd. You don't have to change the existing fmas at all.

llvm/test/CodeGen/PowerPC/machine-combiner.ll
1 ↗(On Diff #460670)

Please do this as a separate patch and then rebase this patch on it.

tsymalla added inline comments.Sep 16 2022, 8:48 PM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
14186–14211

Agreed. Thanks for constructing the tests.
However, in this example, the number of instructions (fma_innermost test) will not decrease due to the addition of the v_mov instead of constructing the return value in-place.
I don't think this will negatively affect real-life examples (but I'll check), but for the sake of code quality, we could try getting rid of these moves probably at some other place.

llvm/test/CodeGen/PowerPC/machine-combiner.ll
1 ↗(On Diff #460670)

These will be removed again in the next diff.

tsymalla updated this revision to Diff 460981.Sep 17 2022, 2:17 AM

Once more, change the algorithm to keep the operands in order
Use node morphing instead of updating the node operands.

foad added a comment.Sep 20 2022, 5:58 AM

I think the patch looks good now, but I am a little confused by the test changes: one of them now uses fmac instead of mac, and the other uses mad instead of fma.

Also could you add another version of fmac_sequence_innermost_fmul to show that it still works if you swap the operands of the outermost fadd?

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
14197

You don't need to test TmpFMA here. Optionally, this loop could be rewritten as:

if (E) {
  do {
    ...
  } while (isFusedOp(TmpFMA));
}

Are you ready to include the changes on other targets?

foad added a comment.Sep 20 2022, 9:24 AM

I am a little confused by the test changes: one of them now uses fmac instead of mac, and the other uses mad instead of fma.

This is not your fault and need not block the patch. The choice of fma vs fmad depends on whether the combiner is running pre- or post-legalization, and it is a bit random because the pre-legalization combiner does not always re-run combines on new nodes that it creates, so some stuff is (wrongly) left for the post-legalizer combiner to clean up.

tsymalla edited the summary of this revision. (Show Details)Sep 21 2022, 2:00 AM
tsymalla updated this revision to Diff 461816.Sep 21 2022, 2:02 AM

Remove TmpFMA check, add additional test.

foad accepted this revision.Sep 21 2022, 2:07 AM

LGTM, thanks!

This revision is now accepted and ready to land.Sep 21 2022, 2:07 AM
This revision was landed with ongoing or failed builds.Sep 21 2022, 3:03 AM
This revision was automatically updated to reflect the committed changes.

LGTM, thanks!

Thanks for reviewing!

foad added a subscriber: deadalnix.Sep 21 2022, 5:35 AM

I am a little confused by the test changes: one of them now uses fmac instead of mac, and the other uses mad instead of fma.

This is not your fault and need not block the patch. The choice of fma vs fmad depends on whether the combiner is running pre- or post-legalization, and it is a bit random because the pre-legalization combiner does not always re-run combines on new nodes that it creates, so some stuff is (wrongly) left for the post-legalizer combiner to clean up.

@deadalnix @RKSimon I debugged this to a case where running the pre-legalization combiner twice in a row would do extra combines that a single pass missed, and D127115 did not help. Is this to be expected? Is it worth debugging more to work out exactly why some combines were missed the first time?

Note that there's a gray area for fast-math-flags with transforms like this: we generally just check FMF on the final value in the sequence to determine if the fold is allowed.
I haven't seen many examples of mixed FMF in practice, but front-ends are becoming more flexible about that via pragma or other decorations, so I added a couple of AArch64 tests to demonstrate current behavior:
ddd27a3d3934

If this patch is reverted for some reason, those tests will need to be updated.