This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] fcanonicalize elimination optimization
ClosedPublic

Authored by rampitec on Jul 10 2017, 11:49 AM.

Details

Summary

We are using multiplication by 1.0 to flush denormals and quiet sNaNs.
That is possible to omit this multiplication if source of the
fcanonicalize instruction is known to be flushed/quieted, i.e.
if it comes from another instruction known to do the normalization
and we are using IEEE mode to quiet sNaNs.

Diff Detail

Repository
rL LLVM

Event Timeline

rampitec created this revision.Jul 10 2017, 11:49 AM
arsenm added inline comments.Jul 10 2017, 12:10 PM
lib/Target/AMDGPU/SIISelLowering.cpp
4638 ↗(On Diff #105896)

fcanonicalize should have the idempotent optimization applied so there shouldn't be a need to handle it

4662 ↗(On Diff #105896)

We don't handle this

4665–4668 ↗(On Diff #105896)

The output is never flushed anywhere?

4669–4670 ↗(On Diff #105896)

We don't handle these

4671–4672 ↗(On Diff #105896)

I don't think this is true, but should have a named check in the subtarget

4678–4680 ↗(On Diff #105896)

We constant fold this already so this shouldn't be necessary

rampitec added inline comments.Jul 10 2017, 12:21 PM
lib/Target/AMDGPU/SIISelLowering.cpp
4638 ↗(On Diff #105896)

It is not:

v_mul_f32_e32 v2, 1.0, v2
v_mul_f32_e32 v2, 1.0, v2
4662 ↗(On Diff #105896)

We do not handle this now, but may handle later. The idea is still the same as with FSIN and FCOS.

4665–4668 ↗(On Diff #105896)

GFX9 flushes.

4669–4670 ↗(On Diff #105896)

This again only now. I had a use of it in HSAIL and may actually port it.

4671–4672 ↗(On Diff #105896)

I would rather think about denorm support flag in TD for every single instruction wrt subtarget. Why add just a single one?

4678–4680 ↗(On Diff #105896)

This is necessary. We may constant fold it to remove fcanonicalize on a constant, but consider:

canonicalize(fmax(x, 0.0f));

Without above code it would not be handled because we have to look though arguments of min/max/fneg/fabs. I have a test for it actually.

In particular it is needed to fold an extremely common case like max(max, ...).

rampitec added inline comments.Jul 10 2017, 12:40 PM
lib/Target/AMDGPU/SIISelLowering.cpp
4638 ↗(On Diff #105896)

And I do not think there are idempotent SDNodes.

arsenm added inline comments.Jul 10 2017, 12:53 PM
lib/Target/AMDGPU/SIISelLowering.cpp
4638 ↗(On Diff #105896)

I don't think it's done now, but fcanonicalize (fcanonicalize x) should fold to just one canonicalize

4665–4668 ↗(On Diff #105896)

That is broken. If that is the case we probably shouldn't be using the regular minnum/maxnum intrinsics without denormals, and then it's an optimization to fold canonicalize (min/max) to the weird target behavior.

4671–4672 ↗(On Diff #105896)

We don't need a full fledged subtarget feature, just put the generation check in a function with name/description rather than adding more random looking generation checks

rampitec added inline comments.Jul 10 2017, 1:10 PM
lib/Target/AMDGPU/SIISelLowering.cpp
4638 ↗(On Diff #105896)

...and that is what I am doing here.

4665–4668 ↗(On Diff #105896)

The library is common for different targets.

4671–4672 ↗(On Diff #105896)

I'm not sure I follow. Could you please describe a name of such check?

rampitec updated this revision to Diff 105910.Jul 10 2017, 1:23 PM

Rebase to master.
Updated test for commuted instruction after latest changes.

arsenm added inline comments.Jul 10 2017, 1:32 PM
lib/Target/AMDGPU/SIISelLowering.cpp
4665–4668 ↗(On Diff #105896)

We need to fix that then. If it's not returning exactly one of the inputs it's not implementing the IEEE min/max

4671–4672 ↗(On Diff #105896)

hasAddr64() or hasMed3_16()

rampitec added inline comments.Jul 10 2017, 1:34 PM
lib/Target/AMDGPU/SIISelLowering.cpp
4638 ↗(On Diff #105896)

In fact even if a general folding is implemented in DAGCombiner, I believe it shall stay here as well. It is the same idea as with constants, something else which also normalizes can be combined with it.

rampitec added inline comments.Jul 10 2017, 1:36 PM
lib/Target/AMDGPU/SIISelLowering.cpp
4671–4672 ↗(On Diff #105896)

hasNormalizingMinMax()? It returns us to the initial point.

rampitec added inline comments.Jul 10 2017, 6:32 PM
lib/Target/AMDGPU/SIISelLowering.cpp
4665–4668 ↗(On Diff #105896)

In fact IEEE754-2008 (as described for maxnum/minnum llvm ir) tells to return canonicalized numbers, so it is the opposite: GFX9 is finally IEEE compliant.

rampitec updated this revision to Diff 106062.Jul 11 2017, 10:34 AM

GFX9 handles denorms in min/max instructions, although its behavior in respect to signaling NaN handling is not identical if they are quieted or not before the instruction. That is still possible to perform GFX9 specific part of min/max handling given non-nans flag.
However, given the controversy and that GFX9 part of the change is non-essential I have removed specialization here, replacing it with todo. It can be done separately when needed.

rampitec marked 12 inline comments as done.Jul 11 2017, 10:58 AM
rampitec added inline comments.
lib/Target/AMDGPU/SIISelLowering.cpp
4665–4668 ↗(On Diff #105896)

This part is split-off the patch.

arsenm added inline comments.Jul 11 2017, 11:23 AM
lib/Target/AMDGPU/SIISelLowering.cpp
4636 ↗(On Diff #106062)

Since FMAD always flushes I don't think it's OK to handle it

4696–4699 ↗(On Diff #106062)

This is really a check for whether SNaNs are enabled. We have the isKnownNeverSNan check for this already. I think we need to fix having separate FPExceptions and enableIEEEBit subtarget checks

4671–4672 ↗(On Diff #105896)

The SC name was SupportsMinMaxDenormModes. Either way works

test/CodeGen/AMDGPU/fcanonicalize-elimination.ll
2 ↗(On Diff #106062)

This needs a run line for gfx9 with denormals disabled too

rampitec marked 4 inline comments as done.Jul 11 2017, 1:22 PM
rampitec added inline comments.
lib/Target/AMDGPU/SIISelLowering.cpp
4636 ↗(On Diff #106062)

FMAD - Perform a * b + c, while getting the same result as the separately rounded operations.
It is essentially v_mac_f32 which always flushes, but if denorms are disabled it is lowered as fma.
So handling it here is correct, and there is a test test_fold_canonicalize_fmuladd_value_f32 for it.

4671–4672 ↗(On Diff #105896)

This code part is dropped for now.

rampitec updated this revision to Diff 106086.Jul 11 2017, 1:24 PM
rampitec marked an inline comment as done.

Switched to use of isKnownNeverSNan() and added test for it.
Added test run line for GFX9 without denorms.

arsenm accepted this revision.Jul 12 2017, 1:41 PM

LGTM. Can you file a bug for the broken minnum/maxnum denormal handling

lib/Target/AMDGPU/SIISelLowering.cpp
4706 ↗(On Diff #106086)

This would read more accurately as IsIEEEMode || isKnownNeverSNan.

This revision is now accepted and ready to land.Jul 12 2017, 1:41 PM
rampitec updated this revision to Diff 106298.Jul 12 2017, 1:47 PM
rampitec marked an inline comment as done.

Reorganized IEEE/knowNeverSNan condition per review.

This revision was automatically updated to reflect the committed changes.

LGTM. Can you file a bug for the broken minnum/maxnum denormal handling

Bug filed.