Honoring no signed zeroes is also available as a user control through clang separately regardless of fastmath or UnsafeFPMath context, DAG guards should reflect this context.
Details
Diff Detail
Event Timeline
lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
12059 | In this case, UnsafeFPMath holds reassociation context. |
Code changes look good. See inline to for some comments about the tests.
test/CodeGen/AMDGPU/enable-no-signed-zeros-fp-math.ll | ||
---|---|---|
1 | Here and other test files: is it possible to remove "-enable-unsafe-fp-math" and still get the same test results? If we can tighten up the constraints, that would help move us away from the global requirements. Another option would be to specify the function-level attribute only on the tests that still require more than 'nsz' to produce the expected test results. | |
test/CodeGen/AMDGPU/fneg-combines.ll | ||
222–227 | I don't know how to interpret this test diff: regression, improvement, or does this test no longer accomplish its original intent? |
test/CodeGen/AMDGPU/enable-no-signed-zeros-fp-math.ll | ||
---|---|---|
1 | Still another option (and moves us closer still to the goal of IR/node-level flags only): can we remove the global settings entirely, add FMF to the IR in these tests, and maintain their intent? |
test/CodeGen/AMDGPU/enable-no-signed-zeros-fp-math.ll | ||
---|---|---|
1 | We need the new flag or attribute to keep the results the same. I will look over the tests and see where (hopefully all) we can use attributes in place of the flags. | |
test/CodeGen/AMDGPU/fneg-combines.ll | ||
222–227 | Yes, now we no longer optimize this case, should I just remove the gcn-nsz-dag context for fneg_fadd_0? |
test/CodeGen/AMDGPU/fneg-combines.ll | ||
---|---|---|
222–227 | This is clearly a regression. |
test/CodeGen/AMDGPU/fneg-combines.ll | ||
---|---|---|
222–227 | I will debug this case, it looks like there's an additional dependency with the new code shape wrt fadd. |
The new code is actually better by 1 instruction, we just never completed the full match on the test. In the old path we had -enable-no-signed-zeros-fp-math on but no way to reach it for the zero fold of the fadd via llc as the flags are all user controlled. This should not be a regression.
test/CodeGen/AMDGPU/fneg-combines.ll | ||
---|---|---|
222–227 | with the change of // N0 + -0.0 --> N0 (also allowed with +0.0 and fast-math) ConstantFPSDNode *N1C = isConstOrConstSplatFP(N1, true); if (N1C && N1C->isZero()) if (N1C->isNegative() || Options.NoSignedZerosFPMath || Flags.hasNoSignedZeros()) return N0; we get this DAG: SelectionDAG has 21 nodes: t0: ch = EntryToken t2: f32,ch = CopyFromReg t0, Register:f32 %0 t27: f32 = fneg t28 t13: i1 = setcc t27, t2, setuge:ch t15: f32 = select t13, t2, t28 t17: i1 = setcc t15, ConstantFP:f32<0.000000e+00>, setule:ch t19: f32 = select t17, ConstantFP:f32<0.000000e+00>, ConstantFP:f32<nan> t21: ch,glue = CopyToReg t0, Register:f32 $vgpr0, t19 t4: f32,ch = CopyFromReg t0, Register:f32 %1 t7: f32 = fdiv ConstantFP:f32<1.000000e+00>, t4 t28: f32 = fmul nnan arcp contract reassoc t7, ConstantFP:f32<-0.000000e+00> t22: ch = RETURN_TO_EPILOG t21, Register:f32 $vgpr0, t21:1 for which we produce this assembler: fneg_fadd_0: ; @fneg_fadd_0 v_rcp_f32_e32 v0, s1 v_mov_b32_e32 v1, s0 v_mov_b32_e32 v2, 0x7fc00000 v_mul_f32_e32 v0, 0x80000000, v0 v_cmp_nlt_f32_e64 vcc, -v0, s0 v_cndmask_b32_e32 v0, v0, v1, vcc v_cmp_nlt_f32_e32 vcc, 0, v0 v_cndmask_b32_e64 v0, v2, 0, vcc with the code the way it is currently posed(unmodified): // N0 + -0.0 --> N0 (also allowed with +0.0 and fast-math) ConstantFPSDNode *N1C = isConstOrConstSplatFP(N1, true); if (N1C && N1C->isZero()) if (N1C->isNegative() || Options.UnsafeFPMath || Flags.hasNoSignedZeros()) return N0; we get this DAG: SelectionDAG has 21 nodes: t0: ch = EntryToken t2: f32,ch = CopyFromReg t0, Register:f32 %0 t4: f32,ch = CopyFromReg t0, Register:f32 %1 t7: f32 = fdiv ConstantFP:f32<1.000000e+00>, t4 t9: f32 = fmul t7, ConstantFP:f32<0.000000e+00> t11: f32 = fadd t9, ConstantFP:f32<0.000000e+00> t13: i1 = setcc t11, t2, setuge:ch t14: f32 = fneg t11 t15: f32 = select t13, t2, t14 t17: i1 = setcc t15, ConstantFP:f32<0.000000e+00>, setule:ch t19: f32 = select t17, ConstantFP:f32<0.000000e+00>, ConstantFP:f32<nan> t21: ch,glue = CopyToReg t0, Register:f32 $vgpr0, t19 t22: ch = RETURN_TO_EPILOG t21, Register:f32 $vgpr0, t21:1 for which we fold an fused multiply add, and produce this assembler: fneg_fadd_0: ; @fneg_fadd_0 v_rcp_f32_e32 v0, s1 v_bfrev_b32_e32 v1, 1 v_mov_b32_e32 v2, s0 v_mac_f32_e32 v1, v0, v1 v_cmp_nlt_f32_e64 vcc, -v1, s0 v_cndmask_b32_e32 v0, v1, v2, vcc v_mov_b32_e32 v1, 0x7fc00000 v_cmp_nlt_f32_e32 vcc, 0, v0 v_cndmask_b32_e64 v0, v1, 0, vcc |
test/CodeGen/AMDGPU/fneg-combines.ll | ||
---|---|---|
222–227 | I am trying to understand what does the existing ISA do, and it is: v_rcp_f32_e32 v0, s1 v0 = 1 / s1 v_bfrev_b32_e32 v1, 1 v1 = 0x8000000 = -0.0 v_mov_b32_e32 v2, s0 v2 = s0 v_mac_f32_e32 v1, v0, v1 v1 = v1 * v0 + v1 = v0 * -0.0 - 0.0 = 0 Instead of that fancy mac instruction that seems to be all now folded into v_mul_f32_e32 v0, 0x80000000, v0 v0 = v0 * -0.0 I.e. it is hardly practically performance relevant code. The comment above tells it used to assert, so I guess this is just a regression test. I have no objection for this test change. |
Note: test/CodeGen/AMDGPU/fneg-combines.ll needs rearchitecting, so i left it in options flag form, test/CodeGen/PowerPC/fmf-propagation.ll has portions that can be removed once we stop using the options flags and so i am leaving it in its current form with mods until that happens. test/CodeGen/X86/fp-fast.ll uses a subset of fmf that equate to the options flag that were used (let me know if you just want to generalize to fast or smaller subset), all the others use either fast or context relevant fmf and have been converted to not use options flags. Have a look and see what needs editing, currently this all passes testing.
test/CodeGen/AArch64/fadd-combines.ll | ||
---|---|---|
150–151 | This comment is incorrect now. MachineCombiner was relying on the function attribute? | |
test/CodeGen/PowerPC/fma-mutate.ll | ||
6 | Selectively using 2 different labels for the same RUN line confused me. That seems unnecessary because the function name and IR makes it clear what the difference in output is expected to be. | |
13–16 | Would it be better to auto-generate the complete output for these tests using utils/update_llc_test_checks.py? | |
test/CodeGen/PowerPC/qpx-recipest.ll | ||
1 | Same comment as above: | |
test/CodeGen/PowerPC/recipest.ll | ||
1 | Same comment as above: | |
test/CodeGen/X86/dagcombine-unsafe-math.ll | ||
64 | If 'contract' is required, that is unnecessary? Mark with a 'FIXME' note? | |
test/CodeGen/X86/fp-fast.ll | ||
8–9 | Same as earlier comment: | |
test/CodeGen/X86/fp-fold.ll | ||
1 | Same comment as earlier: Definitely use the auto-generation script when possible for x86 tests. |
added a TODO comment for machine combines in test/CodeGen/AArch64/fadd-combines.ll,
test/CodeGen/PowerPC/fma-mutate.ll, test/CodeGen/PowerPC/qpx-recipest.ll, test/CodeGen/PowerPC/recipest.ll, test/CodeGen/X86/fp-fold.ll now have just standard CHECK lines. The fmf contract flag was
removed from test/CodeGen/X86/dagcombine-unsafe-math.ll and test/CodeGen/X86/fp-fast.ll.
LGTM - thanks for the test file updates. See inline comment about the AArch64 test.
test/CodeGen/AArch64/fadd-combines.ll | ||
---|---|---|
152–154 | That comment seems wrong from the start. This form has better throughput than 3 chained adds. Ideally, this would be FMA? The '17' variable names in the check lines are wrong too: |
In this case, UnsafeFPMath holds reassociation context.