The patch addresses a single transform of selection DAG optimization, namely a few special cases of constant folding, extending the guard for isFast on the SDNodeFlags passed to getNode for fadd, fsub, fmul, fdiv and frem.
Diff Detail
Event Timeline
What is the test actually testing? I don't think there should be any difference from this in the floor lowering, and I don't really want a separate test file duplicating all of the lowering testing
I can simplify this somewhat, I will make a new test that hits the patterns for folding only for all ops of concern, then we will have a way to test with unsafe and without, using the new guard to ensure that the CHECK result is the same for both.
In the way of explanation of the floor test, I simply commented out unsafe and used only Flags.isFast() go guard the optimization, the original floor test broke. Matt, would you like me to include the overload for the floor test with the isFast path in the original file or is a specific functional lit test sufficient?
Added fast math flag capture from visitBinary so that getNode can make decisions in cases where SelectionDAGBuilder::visit has not yet been called. Also updated the test for just these cases and dropped the floor test.
lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp | ||
---|---|---|
2772–2775 ↗ | (On Diff #147183) | I don't think we want repeat this here (otherwise, we need to set the flags like this for all FPMO). This reminds me of a comment that @andreadb made (probably >3 years ago; I'm not finding it now) regarding DAGBuilder/DAGCombiner. IIRC, because of the way the DAG operates, we may need to repeat folds in the builder to ensure consistency. So I think the right fix is to duplicate the simplifications, so the unnecessary node is not created in the 1st place. |
test/CodeGen/X86/fp-fold.ll | ||
5–6 ↗ | (On Diff #147183) | This isn't quite true. In all cases, we do not need full 'fast' to enable these transforms. All of the transforms should show the minimal FMF requirement in instsimplify at this point, so that's what we should emulate here. Eg, x * 1.0 = x is true without any FMF. |
lib/CodeGen/SelectionDAG/SelectionDAG.cpp | ||
---|---|---|
4440–4462 | This code structure really doesn't make sense as-is: we putting independent cases together even though they share no common code. |
Cleaned up constant folding to follow flag rules used in other locations to guard optimization and reordered code for context. The tests also reflect the min flag usage and duality with unsafe.
lib/CodeGen/SelectionDAG/SelectionDAG.cpp | ||
---|---|---|
4442–4444 | This doesn't match instsimplify: // fadd X, 0 ==> X, when we know X is not -0 // fsub X, +0 ==> X (no flags needed) But make sure (add tests) for -0.0 too, so we get that right: // fadd X, -0 ==> X (no flags needed) // fsub X, -0 ==> X, when we know X is not -0 | |
4443–4445 | Again, these are not the correct predicates: // fmul nnan nsz X, 0 ==> 0 | |
test/CodeGen/X86/fp-fold.ll | ||
26–35 ↗ | (On Diff #147380) | Make sure I understand this - we need to change the code in DAGCombiner to fold this? Why is this case different than the others? It's probably better if you check in the baseline tests first, so we're only getting the functional diffs from this patch. |
These changes reflect what is in Transforms/InstSimplify/fast-math.ll
I have a baseline version of this test ready to roll, will check that in first.
As for the fmul_zero case, the call producing the getNode is from visitBinary, which cannot provide flags as they are created and passed in without any representation from the current Instruction (which does have the flags). I believe we need to add some new coverage in the DAGCombiner so that the case above is covered so we do not get to this scenario as the STRICT case is unoptimized. For now the test marks the short fall.
test/CodeGen/X86/fp-fold.ll | ||
---|---|---|
32–36 ↗ | (On Diff #147434) | This is wrong when x is -0.0: |
So something to keep in mind, because D46562 was split, part of its code landing here, we are missing flag based visit functionality, where unsafe behavior differs, so for this patch there will be some holes. Its the reason fmul_zero diverges and when you see the next upload, why fsub_zero_2 will also diverge in the tests. However, D46562 will attempt to deal with most of the issues that remain via sub flag usage in the visit functions.
Special cases for fsub, we will need mirrored functionality in visit for fsub via flags to collect on matching optimizations in unsafe. The same is true for other operations.
test/CodeGen/X86/fp-fold.ll | ||
---|---|---|
32–36 ↗ | (On Diff #147434) | A related question is that unsafe does do that optimization, is it incorrectly doing so? |
lib/CodeGen/SelectionDAG/SelectionDAG.cpp | ||
---|---|---|
4441–4445 | Still not quite right. | |
test/CodeGen/X86/fp-fold.ll | ||
32–36 ↗ | (On Diff #147434) | That goes back to how we interpret "-enable-unsafe-fp-math". My reading of the text: cl::desc("Enable optimizations that may decrease FP precision") says that yes, we have a bug (ignoring signed-zero is not covered by that parameter). But as we noted in one of the earlier reviews, clang only sets this when you pile together a bunch of independent FP loosening: if (!MathErrno && AssociativeMath && ReciprocalMath && !SignedZeros && !TrappingMath) CmdArgs.push_back("-menable-unsafe-fp-math"); ...so the bug is probably not easily visible for anyone besides compiler hackers? |
There are 4 cases each for fadd/fsub:
- 0.0
- -0.0
- nsz 0.0
- nsz -0.0
I added those tests in rL332780. Can you rebase with that?
But now I'm not sure why we're bothering with these folds in getNode(). We have to repeat them in DAGCombiner anyway, right? Isn't this the equivalent of doing simplifications in IRBuilder? Are we doing this because it's a big perf win (seems unlikely)?
I know we'll probably just kill this whole chunk of code in a subsequent patch, but the logic should be complete/correct. I think we want this for fadd/fsub:
case ISD::FADD: // fadd X, -0.0 ==> X (no flags needed) // fadd X, 0.0 ==> X, when we know X is not -0 // FIXME: Unsafe math doesn't imply no-signed-zeros. if (N2CFP && N2CFP->isZero()) if (N2CFP->isNegative() || getTarget().Options.UnsafeFPMath || Flags.hasNoSignedZeros()) return N1; break; case ISD::FSUB: // fsub X, 0.0 ==> X (no flags needed) // fsub X, -0.0 ==> X, when we know X is not -0 // FIXME: Unsafe math doesn't imply no-signed-zeros. if (N2CFP && N2CFP->isZero()) if (!N2CFP->isNegative() || getTarget().Options.UnsafeFPMath || Flags.hasNoSignedZeros()) return N1; break;
lib/CodeGen/SelectionDAG/SelectionDAG.cpp | ||
---|---|---|
4440 | Not particularly a comment on this patch, but I was wondering what's the reasoning for doing this in getNode? I've never liked how getNode attempts to optimize nodes on creation, especially since a lot of node visitors in DAGCombiner attempt the same folds (and kind of have to since a ReplaceNodeResults may trigger the same situation). Have you found an explanation? |
lib/CodeGen/SelectionDAG/SelectionDAG.cpp | ||
---|---|---|
4440 | I hinted at the only explanation I could think of: it's being done as a compile-time optimization. I doubt that's a meaningful win, so I wouldn't mind skipping this step entirely - just fix the folds in DAGCombiner and ignore this. |
Given the last round of comments, I am going to spend some time on the DAG combiner work in D46562 to see if I can remove the need for these changes. More later...
Not particularly a comment on this patch, but I was wondering what's the reasoning for doing this in getNode? I've never liked how getNode attempts to optimize nodes on creation, especially since a lot of node visitors in DAGCombiner attempt the same folds (and kind of have to since a ReplaceNodeResults may trigger the same situation). Have you found an explanation?