This patch changes a FADD / FMUL => FMA ISel pattern implemented
in D80801 so that it peeks through more than one FMA.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
llvm/test/CodeGen/AMDGPU/dagcombine-fma-fmad.ll | ||
---|---|---|
178 | Please precommit the new tests and rebase, so the patch shows the codegen diff. |
See the comment on D80801: "The fold implemented here is actually a specialization - we should be able to peek through >1 fma to find this pattern. That's another patch if we want to try that enhancement though." That's what you are implementing here.
This changes a FADD / FMUL => FMA ISel pattern implemented
in D80801 so that it peeks through more than one FMA. This also
changes the order of the operands, which can help with eliminating
a final COPY.
Did not update the lit tests for other targets yet (CodeGen/AArch64/fadd-combines.ll, CodeGen/PowerPC/fma-assoc.ll, CodeGen/PowerPC/machine-combiner.ll, CodeGen/X86/fma_patterns.ll) because I wanted to get some opinion on the change in the operand order first.
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
14208 | Why use UpdateNodeOperands here instead of just constructing the new node normally? |
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
14181 | Remove "on"? | |
14182–14184 | Looks like clang-format has mangled this comment! | |
14186–14211 | Can you explain this? It is not at all obvious. Are you sure it's not just random? Perhaps it helps your case, but could harm other cases? |
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
14186–14211 | When keeping the order fadd (fma A, B, (fmul C, D)), E --> fma A, B, (fma C, D, E), The compiler will assign the result of the inner FMA to a new (virtual) register which will be used by the outer FMA. By changing the operand order, we get following changes: Changing the instruction order in the DAG basically frees up the desired output register for the register allocator. I cannot assume this causes harm for other cases, but from fast-math point of view it should not cause any issues. Looking at the (currently failing) test cases, I don't see any actual issues, but please correct me if I'm wrong. |
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
14208 | This is updating the innermost FMA node which already exists, I am just replacing the last operand of it. Example steps: fadd (fma A, B, (fmul C, D)), E 1. NewFMA = fma_o (C, D, fma_i (A, B, fmul(C, D))) 2. fma_i = fma (A, B, E) => NewFMA = fma (C, D, fma (A, B, E)) More complex case: fadd (fma_i0 A, B, (fma (C, D, (fmul (E, F))))), G: 1. TmpFMA = fma_i = fma (C, D, (fmul (E, F)))) 2. NewFMA = fma (E, F, fma_i0) = fma (E, F, fma (A, B, fma (C, D, (fmul (E, F))))) (construct outermost FMA) 3. TmpFMA.UpdateNodeOperands: fma_i = fma (C, D, (fmul (E, F)) => fma (C, D, G) => fma (E, F, fma (A, B, fma (C, D, (fmul (E, F))))) = fma (E, F, fma (A, B, fma (C, D, G))) (replace innermost FMA operand with addition operator of the initial FADD) I can create a new node for the innermost FMA, but isn't UpdateNodeOperands being used for such cases? |
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
14186–14211 | https://reviews.llvm.org/differential/diff/460688/ adds a couple of variations of PPC test cases. All I have done is swapped the operands of the "fadd %F, %G" instruction. If you apply your patch on top of this you will see that the codegen gets worse for these cases, in exactly the same way that it got better for the existing tests. So I do not agree that changing the order of the fmas is a good thing in general. Because of this, I would prefer to keep the existing order of the fmas. That makes the combine simpler because you only have to mutate the fmul to an fma, and remove the fadd. You don't have to change the existing fmas at all. | |
llvm/test/CodeGen/PowerPC/machine-combiner.ll | ||
1 ↗ | (On Diff #460670) | Please do this as a separate patch and then rebase this patch on it. |
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
14186–14211 | Agreed. Thanks for constructing the tests. | |
llvm/test/CodeGen/PowerPC/machine-combiner.ll | ||
1 ↗ | (On Diff #460670) | These will be removed again in the next diff. |
Once more, change the algorithm to keep the operands in order
Use node morphing instead of updating the node operands.
I think the patch looks good now, but I am a little confused by the test changes: one of them now uses fmac instead of mac, and the other uses mad instead of fma.
Also could you add another version of fmac_sequence_innermost_fmul to show that it still works if you swap the operands of the outermost fadd?
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
14197 | You don't need to test TmpFMA here. Optionally, this loop could be rewritten as: if (E) { do { ... } while (isFusedOp(TmpFMA)); } |
This is not your fault and need not block the patch. The choice of fma vs fmad depends on whether the combiner is running pre- or post-legalization, and it is a bit random because the pre-legalization combiner does not always re-run combines on new nodes that it creates, so some stuff is (wrongly) left for the post-legalizer combiner to clean up.
@deadalnix @RKSimon I debugged this to a case where running the pre-legalization combiner twice in a row would do extra combines that a single pass missed, and D127115 did not help. Is this to be expected? Is it worth debugging more to work out exactly why some combines were missed the first time?
Note that there's a gray area for fast-math-flags with transforms like this: we generally just check FMF on the final value in the sequence to determine if the fold is allowed.
I haven't seen many examples of mixed FMF in practice, but front-ends are becoming more flexible about that via pragma or other decorations, so I added a couple of AArch64 tests to demonstrate current behavior:
ddd27a3d3934
If this patch is reverted for some reason, those tests will need to be updated.
This seems to be a more limited case of your new reassociation, so can we remove this code now?