This is an archive of the discontinued LLVM Phabricator instance.

Migrate some more fadd and fsub cases away from UnsafeFPMath control to utilize NoSignedZerosFPMath options control
ClosedPublic

Authored by mcberg2017 on Jul 23 2019, 3:56 PM.

Details

Summary

Honoring no signed zeroes is also available as a user control through clang separately regardless of fastmath or UnsafeFPMath context, DAG guards should reflect this context.

Diff Detail

Event Timeline

mcberg2017 created this revision.Jul 23 2019, 3:56 PM
mcberg2017 marked an inline comment as done.Jul 23 2019, 3:59 PM
mcberg2017 added inline comments.
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
12059

In this case, UnsafeFPMath holds reassociation context.

updated with one case...

Code changes look good. See inline to for some comments about the tests.

test/CodeGen/AMDGPU/enable-no-signed-zeros-fp-math.ll
3

Here and other test files: is it possible to remove "-enable-unsafe-fp-math" and still get the same test results?

If we can tighten up the constraints, that would help move us away from the global requirements.

Another option would be to specify the function-level attribute only on the tests that still require more than 'nsz' to produce the expected test results.

test/CodeGen/AMDGPU/fneg-combines.ll
222–226 ↗(On Diff #211588)

I don't know how to interpret this test diff: regression, improvement, or does this test no longer accomplish its original intent?

spatel added inline comments.Jul 27 2019, 6:04 AM
test/CodeGen/AMDGPU/enable-no-signed-zeros-fp-math.ll
3

Still another option (and moves us closer still to the goal of IR/node-level flags only): can we remove the global settings entirely, add FMF to the IR in these tests, and maintain their intent?

mcberg2017 marked 2 inline comments as done.Jul 29 2019, 9:47 AM
mcberg2017 added inline comments.
test/CodeGen/AMDGPU/enable-no-signed-zeros-fp-math.ll
3

We need the new flag or attribute to keep the results the same. I will look over the tests and see where (hopefully all) we can use attributes in place of the flags.

test/CodeGen/AMDGPU/fneg-combines.ll
222–226 ↗(On Diff #211588)

Yes, now we no longer optimize this case, should I just remove the gcn-nsz-dag context for fneg_fadd_0?

spatel added inline comments.
test/CodeGen/AMDGPU/fneg-combines.ll
222–226 ↗(On Diff #211588)

Someone with AMDGPU knowledge should answer that.
cc'ing @arsenm @nhaehnle @foad @rampitec

rampitec added inline comments.Jul 29 2019, 10:12 AM
test/CodeGen/AMDGPU/fneg-combines.ll
222–226 ↗(On Diff #211588)

This is clearly a regression.

mcberg2017 marked an inline comment as done.Jul 29 2019, 10:31 AM
mcberg2017 added inline comments.
test/CodeGen/AMDGPU/fneg-combines.ll
222–226 ↗(On Diff #211588)

I will debug this case, it looks like there's an additional dependency with the new code shape wrt fadd.

mcberg2017 marked an inline comment as done.Jul 29 2019, 12:26 PM

The new code is actually better by 1 instruction, we just never completed the full match on the test. In the old path we had -enable-no-signed-zeros-fp-math on but no way to reach it for the zero fold of the fadd via llc as the flags are all user controlled. This should not be a regression.

test/CodeGen/AMDGPU/fneg-combines.ll
222–226 ↗(On Diff #211588)

with the change of

// N0 + -0.0 --> N0 (also allowed with +0.0 and fast-math)
ConstantFPSDNode *N1C = isConstOrConstSplatFP(N1, true);
if (N1C && N1C->isZero())
  if (N1C->isNegative() || Options.NoSignedZerosFPMath || Flags.hasNoSignedZeros())
    return N0;

we get this DAG:

SelectionDAG has 21 nodes:

t0: ch = EntryToken
t2: f32,ch = CopyFromReg t0, Register:f32 %0
          t27: f32 = fneg t28
        t13: i1 = setcc t27, t2, setuge:ch
      t15: f32 = select t13, t2, t28
    t17: i1 = setcc t15, ConstantFP:f32<0.000000e+00>, setule:ch
  t19: f32 = select t17, ConstantFP:f32<0.000000e+00>, ConstantFP:f32<nan>
t21: ch,glue = CopyToReg t0, Register:f32 $vgpr0, t19
    t4: f32,ch = CopyFromReg t0, Register:f32 %1
  t7: f32 = fdiv ConstantFP:f32<1.000000e+00>, t4
t28: f32 = fmul nnan arcp contract reassoc t7, ConstantFP:f32<-0.000000e+00>
t22: ch = RETURN_TO_EPILOG t21, Register:f32 $vgpr0, t21:1

for which we produce this assembler:

fneg_fadd_0: ; @fneg_fadd_0
; %bb.0: ; %.entry

v_rcp_f32_e32 v0, s1
v_mov_b32_e32 v1, s0
v_mov_b32_e32 v2, 0x7fc00000
v_mul_f32_e32 v0, 0x80000000, v0
v_cmp_nlt_f32_e64 vcc, -v0, s0
v_cndmask_b32_e32 v0, v0, v1, vcc
v_cmp_nlt_f32_e32 vcc, 0, v0
v_cndmask_b32_e64 v0, v2, 0, vcc

with the code the way it is currently posed(unmodified):

// N0 + -0.0 --> N0 (also allowed with +0.0 and fast-math)
ConstantFPSDNode *N1C = isConstOrConstSplatFP(N1, true);
if (N1C && N1C->isZero())
  if (N1C->isNegative() || Options.UnsafeFPMath || Flags.hasNoSignedZeros())
    return N0;

we get this DAG:

SelectionDAG has 21 nodes:

t0: ch = EntryToken
t2: f32,ch = CopyFromReg t0, Register:f32 %0
      t4: f32,ch = CopyFromReg t0, Register:f32 %1
    t7: f32 = fdiv ConstantFP:f32<1.000000e+00>, t4
  t9: f32 = fmul t7, ConstantFP:f32<0.000000e+00>
t11: f32 = fadd t9, ConstantFP:f32<0.000000e+00>
        t13: i1 = setcc t11, t2, setuge:ch
        t14: f32 = fneg t11
      t15: f32 = select t13, t2, t14
    t17: i1 = setcc t15, ConstantFP:f32<0.000000e+00>, setule:ch
  t19: f32 = select t17, ConstantFP:f32<0.000000e+00>, ConstantFP:f32<nan>
t21: ch,glue = CopyToReg t0, Register:f32 $vgpr0, t19
t22: ch = RETURN_TO_EPILOG t21, Register:f32 $vgpr0, t21:1

for which we fold an fused multiply add, and produce this assembler:

fneg_fadd_0: ; @fneg_fadd_0
; %bb.0: ; %.entry

v_rcp_f32_e32 v0, s1
v_bfrev_b32_e32 v1, 1
v_mov_b32_e32 v2, s0
v_mac_f32_e32 v1, v0, v1
v_cmp_nlt_f32_e64 vcc, -v1, s0
v_cndmask_b32_e32 v0, v1, v2, vcc
v_mov_b32_e32 v1, 0x7fc00000
v_cmp_nlt_f32_e32 vcc, 0, v0
v_cndmask_b32_e64 v0, v1, 0, vcc
rampitec added inline comments.Jul 29 2019, 12:51 PM
test/CodeGen/AMDGPU/fneg-combines.ll
222–226 ↗(On Diff #211588)

I am trying to understand what does the existing ISA do, and it is:

v_rcp_f32_e32 v0, s1                     v0 = 1 / s1
v_bfrev_b32_e32 v1, 1                    v1 = 0x8000000 = -0.0
v_mov_b32_e32 v2, s0                     v2 = s0
v_mac_f32_e32 v1, v0, v1                 v1 = v1 * v0 + v1 = v0 * -0.0 - 0.0 = 0

Instead of that fancy mac instruction that seems to be all now folded into

v_mul_f32_e32 v0, 0x80000000, v0         v0 = v0 * -0.0

I.e. it is hardly practically performance relevant code. The comment above tells it used to assert, so I guess this is just a regression test.
Given no signed zeroes this is as good as just v0 = 0, but that's a different matter.

I have no objection for this test change.

Note: test/CodeGen/AMDGPU/fneg-combines.ll needs rearchitecting, so i left it in options flag form, test/CodeGen/PowerPC/fmf-propagation.ll has portions that can be removed once we stop using the options flags and so i am leaving it in its current form with mods until that happens. test/CodeGen/X86/fp-fast.ll uses a subset of fmf that equate to the options flag that were used (let me know if you just want to generalize to fast or smaller subset), all the others use either fast or context relevant fmf and have been converted to not use options flags. Have a look and see what needs editing, currently this all passes testing.

spatel added inline comments.Jul 31 2019, 3:44 AM
test/CodeGen/AArch64/fadd-combines.ll
150–151

This comment is incorrect now. MachineCombiner was relying on the function attribute?

test/CodeGen/PowerPC/fma-mutate.ll
6

Selectively using 2 different labels for the same RUN line confused me. That seems unnecessary because the function name and IR makes it clear what the difference in output is expected to be.

14–17

Would it be better to auto-generate the complete output for these tests using utils/update_llc_test_checks.py?

test/CodeGen/PowerPC/qpx-recipest.ll
2

Same comment as above:
Selectively using 2 different labels for the same RUN line confused me. That seems unnecessary because the function name and IR makes it clear what the difference in output is expected to be.

test/CodeGen/PowerPC/recipest.ll
2

Same comment as above:
Selectively using 2 different labels for the same RUN line confused me. That seems unnecessary because the function name and IR makes it clear what the difference in output is expected to be.

test/CodeGen/X86/dagcombine-unsafe-math.ll
65

If 'contract' is required, that is unnecessary? Mark with a 'FIXME' note?

test/CodeGen/X86/fp-fast.ll
10–11

Same as earlier comment:
If 'contract' is required, that is unnecessary? Mark with a 'FIXME' note?

test/CodeGen/X86/fp-fold.ll
2–3

Same comment as earlier:
Selectively using 2 different labels for the same RUN line confused me. That seems unnecessary because the function name and IR makes it clear what the difference in output is expected to be.

Definitely use the auto-generation script when possible for x86 tests.

added a TODO comment for machine combines in test/CodeGen/AArch64/fadd-combines.ll,
test/CodeGen/PowerPC/fma-mutate.ll, test/CodeGen/PowerPC/qpx-recipest.ll, test/CodeGen/PowerPC/recipest.ll, test/CodeGen/X86/fp-fold.ll now have just standard CHECK lines. The fmf contract flag was
removed from test/CodeGen/X86/dagcombine-unsafe-math.ll and test/CodeGen/X86/fp-fast.ll.

spatel accepted this revision.Jul 31 2019, 1:22 PM

LGTM - thanks for the test file updates. See inline comment about the AArch64 test.

test/CodeGen/AArch64/fadd-combines.ll
152–154

That comment seems wrong from the start. This form has better throughput than 3 chained adds. Ideally, this would be FMA?
2.0 * x + 101.0

The '17' variable names in the check lines are wrong too:
1109917696 --> 0x42280000 --> 42.0

This revision is now accepted and ready to land.Jul 31 2019, 1:22 PM
This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptJul 31 2019, 3:01 PM