Page MenuHomePhabricator

[DAGCombiner] do not fold (fmul (fadd X, 1), Y) -> (fmad X, Y, Y) by default
ClosedPublic

Authored by nhaehnle on Nov 14 2016, 3:34 AM.

Details

Summary

When X = 0 and Y = inf, the original code produces inf, but the transformed
code produces nan. So this transform (and its relatives) should only be
used when the no-infs-fp-math flag is explicitly enabled.

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=98578

Diff Detail

Repository
rL LLVM

Event Timeline

nhaehnle updated this revision to Diff 77787.Nov 14 2016, 3:34 AM
nhaehnle retitled this revision from to [DAGCombiner] do not fold (fmul (fadd X, 1), Y) -> (fmad X, Y, Y) by default.
nhaehnle updated this object.
nhaehnle added a subscriber: llvm-commits.
spatel edited edge metadata.Nov 14 2016, 8:21 AM

Please correct me if I'm not understanding, but I think we have to answer: what is the expected behavior if someone uses -fassociative-math and -fno-finite-math-only together?

[I'm purposely using the clang flags here because the implementation just gets muddier/broken as we proceed down to the DAG between IR FMF/SDNodeFlags vs. target options vs. function-level attributes. Never mind that the clang flags may not work as intended ( https://llvm.org/bugs/show_bug.cgi?id=27372 )]

So what does it mean if the user told us to do reassociation transforms but also told us that we should respect INF/NAN?

As this patch is written, the respect for INF/NAN should override the ability to reassociate?

The optimization makes use of the distributive law, not the associative law. As such I don't think there should be an immediate connection to a flag named -fassociative-math. But maybe that just means that the flag is badly named :-)

Personally, I don't care about Clang flags since I'm worried about our OpenGL frontend which uses LLVM directly, but it seems to me that there are meaningful associativity transforms that can be done even when infs are possible. Are (a + b) + c --> a + (b + c) and (a * b) * c --> a * (b * c) problematic if any of the variables are inf or nan? I don't think so, but perhaps I'm missing something.

The optimization makes use of the distributive law, not the associative law. As such I don't think there should be an immediate connection to a flag named -fassociative-math. But maybe that just means that the flag is badly named :-)

Personally, I don't care about Clang flags since I'm worried about our OpenGL frontend which uses LLVM directly, but it seems to me that there are meaningful associativity transforms that can be done even when infs are possible. Are (a + b) + c --> a + (b + c) and (a * b) * c --> a * (b * c) problematic if any of the variables are inf or nan? I don't think so, but perhaps I'm missing something.

Good points. At the very least, we should rename this function (visitFMULForFMACombine) and/or add code comments to make it clear that we're only dealing with the case of a '+/-1.0' FADD/FSUB operand. I think that should be done ahead of this patch as an NFC commit.

So let's use the codegen definitions since there's no hope of sorting out the connection to the higher-level definitions in this patch. :)

  1. FPOpFusion::Fast - Enable fusion of FP ops wherever it's profitable.
  1. UnsafeFPMath - This flag is enabled when the -enable-unsafe-fp-math flag is specified on the command line. When this flag is off (the default), the code generator is not allowed to produce results that are "less precise" than IEEE allows...UnsafeFPMath implies LessPreciseFPMAD.
  1. LessPreciseFPMADOption - This flag is enabled when the -enable-fp-mad is specified on the command line. When this flag is off (the default), the code generator is not allowed to generate mad (multiply add) if the result is "less precise" than doing those operations individually.
  1. NoInfsFPMath - This flag is enabled when the -enable-no-infs-fp-math flag is specified on the command line. When this flag is off (the default), the code generator is not allowed to assume the FP arithmetic arguments and results are never +-Infs.

Does NoInfsFPMath override FPOpFusionFast? Or do we need another enum value/flag to answer that question? Are there other transforms that need to be aware of this interaction?

I mentioned this patch in the context of a larger llvm-dev discussion about how to specify relaxed FP:
http://lists.llvm.org/pipermail/llvm-dev/2016-November/107140.html

So let's use the codegen definitions since there's no hope of sorting out the connection to the higher-level definitions in this patch. :)

  1. FPOpFusion::Fast - Enable fusion of FP ops wherever it's profitable.
  2. UnsafeFPMath - This flag is enabled when the -enable-unsafe-fp-math flag is specified on the command line. When this flag is off (the default), the code generator is not allowed to produce results that are "less precise" than IEEE allows...UnsafeFPMath implies LessPreciseFPMAD.
  3. LessPreciseFPMADOption - This flag is enabled when the -enable-fp-mad is specified on the command line. When this flag is off (the default), the code generator is not allowed to generate mad (multiply add) if the result is "less precise" than doing those operations individually.
  4. NoInfsFPMath - This flag is enabled when the -enable-no-infs-fp-math flag is specified on the command line. When this flag is off (the default), the code generator is not allowed to assume the FP arithmetic arguments and results are never +-Infs.

    Does NoInfsFPMath override FPOpFusionFast? Or do we need another enum value/flag to answer that question? Are there other transforms that need to be aware of this interaction?

No, I think they're somewhat orthogonal and supplement each other in this particular case.

As I understand it, FPOpFusion::Fast says: enable fusion wherever it's profitable *even if rounding is slightly changed* (in practice, the intermediate rounding step is skipped) -- this should be documented.

Generally, whenever something comes with the tradeoff of less or slightly changed precision, you don't expect massive changes in behaviour with inf/nans.

And there are plenty of practical optimizations that FPOpFusion::Fast enables (mul + add --> fma), just this particular case here should require the user to say both "I'm fine with slightly different precision" and "I'm fine with massively different behavior in the face of +/-Inf".

I started an llvm-dev thread specifically for this case:
http://lists.llvm.org/pipermail/llvm-dev/2016-November/107276.html

  1. The consensus is that we cannot do this transform: x * (y + 1) --> x * y + x ...using just FPOpFusion::Fast, so we're confident that the current codegen is buggy.
  1. I think everyone agrees that we can do the transform if we have both TargetOptions::UnsafeFPMath and TargetOptions::NoInfsFPMath.
  1. There's an unanswered question of whether we can do this with FPOpFusion::Fast with TargetOptions::NoInfsFPMath. That depends on whether x * y + x is always more precise than the original x * (y + 1).
  1. There's an unanswered question of whether we can do this with FPOpFusion::Fast with TargetOptions::NoInfsFPMath. That depends on whether x * y + x is always more precise than the original x * (y + 1).

Here's an example with binary floating point numbers with two bits of mantissa.

x = 1.01
y = 111

x * (y + 1) = 1.01 * 1000 = 1010 (this is the exact result; no rounding occurs at any step)

x * y + x = 1000.11 + 1.01 =r 1000 + 1.01 = 1001.01 =r 1000 (with rounding towards zero)

The example relies on rounding towards zero at least in the second step, but it does seem to disqualify the transform at least for FMAD (i.e. with an intermediate rounding step). I suspect FMA (without intermediate rounding) is fine, but I have no proof.

nhaehnle updated this revision to Diff 79758.Nov 30 2016, 8:52 AM
nhaehnle edited edge metadata.

Rebased on D27260.

Tentatively disable the FMAD variant unless unsafe-math is enabled. I
haven't yet checked/updated the tests, will follow up if the change in
general is agreed on.

Rebased on D27260.

Tentatively disable the FMAD variant unless unsafe-math is enabled. I
haven't yet checked/updated the tests, will follow up if the change in
general is agreed on.

The logic looks correct to me, but I'd change the code to make it (hopefully) clearer:

// Floating-point multiply-add without intermediate rounding.
bool UseFMA = (Options.UnsafeFPMath || Options.AllowFPOpFusion == FPOpFusion::Fast) && 
                                  TLI.isFMAFasterThanFMulAndFAdd(VT) &&
                                  (!LegalOperations || TLI.isOperationLegalOrCustom(ISD::FMA, VT));

// Floating-point multiply-add with intermediate rounding. This can result
// in a less precise result due to the changed rounding order.
bool UseFMAD = Options.UnsafeFPMath && 
                                     (LegalOperations && TLI.isOperationLegal(ISD::FMAD, VT));

Depending on the formatting, maybe that's even harder to read though...

If we can't prove that the FMA (without intermediate rounding) version is safe, then we should conservatively disable its FPOpFusion::Fast option for now.

nhaehnle updated this revision to Diff 79910.Dec 1 2016, 7:30 AM
nhaehnle edited edge metadata.

Rearrange the logic. It looks quite readable to me this way, and
clang-format-diff agrees with the formatting.

Thinking about the FMA case again, isn't it actually obvious? At least today
I'm quite convinced by the following argument:

The mathematically exact result of x * (y + 1) is equal to that of `x * y +
x`. FMA produces the best rounding of this mathematically exact result. So
whatever happens to the rounding in (fmul x (fadd y 1.0)), the FMA variant
can only be more accurate.

Not sure why I didn't think of that before...

Tests are all passing with the changes from this patch, except one
unfortunate code quality regression in AMDGPU that I think should be
discussed separately.

spatel accepted this revision.Dec 1 2016, 8:28 AM
spatel edited edge metadata.
spatel added subscribers: delena, craig.topper.

Rearrange the logic. It looks quite readable to me this way, and
clang-format-diff agrees with the formatting.

Thinking about the FMA case again, isn't it actually obvious? At least today
I'm quite convinced by the following argument:

The mathematically exact result of x * (y + 1) is equal to that of `x * y +
x`. FMA produces the best rounding of this mathematically exact result. So
whatever happens to the rounding in (fmul x (fadd y 1.0)), the FMA variant
can only be more accurate.

That sounds good to me. Is that quoted from a different thread?

As a follow-up patch, we'll want to add RUN lines to the x86 test files to add coverage back for the hoped-for FMA codegen (cc'ing @craig.topper and @delena because I think they may have added those tests).

LGTM.

This revision is now accepted and ready to land.Dec 1 2016, 8:28 AM
This revision was automatically updated to reflect the committed changes.