This is an archive of the discontinued LLVM Phabricator instance.

[Clang][Driver] Fix -ffast-math/-ffp-contract interaction
Needs ReviewPublic

Authored by wristow on Jan 13 2020, 7:19 PM.

Details

Summary

Fused Multiply Add (FMA) was not always being disabled when the switch -ffp-contract=off was used. More specifically, FMA is enabled when -ffp-contract=fast is used, and it also is enabled implicitly with -ffast-math. The combination:

-ffast-math -ffp-contract=off

is intended to leave most of fast-math enabled (for example, leave reassociation, reciprocal transformations, etc.) enabled, but disable the use of FMA. However, FMA was incorrectly left enabled with the above switch combination. This commit fixes this, allowing users to enable most of the fast-math optimizations, while disabling the FMA feature.

Diff Detail

Event Timeline

wristow created this revision.Jan 13 2020, 7:19 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 13 2020, 7:19 PM
hfinkel added inline comments.
clang/lib/Driver/ToolChains/Clang.cpp
2792

I think that you just need

&& !FPContract.equals("off")

or

&& !(FPContract.equals("off") || FPContract.equals("on"))

of which I think the latter. fp-contract=no is also not a fast-math-compatible setting in that regard, right?

2823

So now we pass it twice, because we also pass it here?

2845

You want the check here, I think, and not below so that if parts of fast-math are disabled __FAST_MATH__ is not set. That seems to be what the comment/code currently do, although what does GCC do in this regard?

wristow marked 3 inline comments as done.Jan 14 2020, 2:38 PM

Thanks for the quick feedback @hfinkel

clang/lib/Driver/ToolChains/Clang.cpp
2792

Eliminating the FPContractDisabled variable, and instead checking that the value isn't "off" (or maybe isn't ("off" or "on")), sounds good. I'm not 100% sure of whether just "off" or both "off" and "on" ought to be checked. More on that in the discussion about __FAST_MATH__ in one of your other comments.

2823

Whoops. This one was here originally, so I'll leave it here, and delete the one I needlessly added, above.

2845

I had originally included the check with the set of conditions of line 2763, but then the warn_drv_overriding_flag_option warning (below) was missed in the case of -ffp-model=fast -ffp-contract=off. So I think the check does need to remain inside the above conditional.

Regarding the __FAST_MATH__ aspect and whether ffp-contract=on also ought to suppress the -ffast-math emission (and what GCC does), this is somewhat fuzzy for me. After getting your comments, I'm leaning toward including both the "off" and "on" in the check (rather than only disabling it in the case of "off", as the current patch does). But I want to at least note my observations here, to make the current situation with both Clang and GCC clear, in case others want to chime in.

Somewhat surprisingly to me, GCC's behavior wrt __FAST_MATH__ seems buggy in general, so it's not a good base to compare against. For example, the following set of switches result in the __FAST_MATH__ macro being defined (although I expected they would have disabled it):

-ffast-math -fno-associative-math -fno-reciprocal-math -frounding-math

(I've just tested GCC 7.4.0.)

I could have sworn that in the past that for GCC I had seen this sort of disabling of part of the "fast-math set" result in the disabling of the def of __FAST_MATH__. As you'd expect with Clang (from the comment in our code code), the above switches do suppress the definition of __FAST_MATH__.

That said, GCC does suppress the __FAST_MATH__ def in some situations. E.g., both GCC and Clang suppress it with the following switches:

-ffast-math -fmath-errno
-ffast-math -ftrapping-math
-ffast-math -fsigned-zeros

In any case, for the -ffp-contract switch, none of the settings suppress the def of __FAST_MATH__ with GCC. That is, all of the following result in __FAST_MATH__ being defined:

-ffast-math -ffp-contract=fast
-ffast-math -ffp-contract=on
-ffast-math -ffp-contract=off

I think that's buggy (but hard to know if it's -ffp-contract specific, or part of the general buggyness of GCC and __FAST_MATH__). Clang (prior to the change described here) also defines __FAST_MATH__ in all of these -ffp-contract situations. But I think that's just part of the bug I'm trying to address here.

Bottom line: The intended semantics are somewhat fuzzy to me. I'd say:

-ffast-math -ffp-contract=fast      // should leave __FAST_MATH__ defined
-ffast-math -ffp-contract=off       // should not define __FAST_MATH__
-ffast-math -ffp-contract=on        // unsure ????

For that last one ("unsure ???"), in the current patch I was leaving __FAST_MATH__ defined. But I'm now leaning toward also suppressing the __FAST_MATH__ def in that case. Unless anyone disagrees, I'll include that change when I update the patch.

wristow updated this revision to Diff 238143.Jan 14 2020, 4:52 PM
wristow retitled this revision from ix -ffast-math/-ffp-contract interaction to Fix -ffast-math/-ffp-contract interaction.

Addressed comments from @hfinkel .

wristow marked 2 inline comments as done.Jan 14 2020, 4:54 PM
wristow marked an inline comment as done.Jan 14 2020, 4:58 PM
spatel added inline comments.Jan 15 2020, 11:41 AM
llvm/test/CodeGen/PowerPC/fmf-propagation.ll
201–203 ↗(On Diff #238143)

I was ok with the reasoning up to here, but this example lost me.
Why does 'contract' alone allow splitting an fma?
Is 'contract' relevant on anything besides fadd (with an fmul operand)?

wristow marked an inline comment as done.Jan 15 2020, 5:32 PM
wristow added inline comments.
llvm/test/CodeGen/PowerPC/fmf-propagation.ll
201–203 ↗(On Diff #238143)

I have to admit I'm handwaving here. I don't really understand why 'contract' alone allows this simpliciation:
fma(X, 7.0, X * 42.0) --> X * 49.0

to happen. But either that's correct, or it's a separate bug (and I waved my hands about explaining more detail).

The short story is that even without my proposed change, having only 'contract' on the FMA intrinsic results in this being simplified to X * 49 (and also having only 'reassoc' did). The original comment:

; This is the minimum FMF needed for this transform - the FMA allows reassociation.

is incomplete/misleading. A more complete comment (relevant before my change) would have been:

; This is the minimum FMF needed for this transform - either 'reassoc' or 'contract' applied to the FMA intrinsic allows reassociation.

And with my change, the result is that 'reassoc' is no longer relevant. But since I don't really understand why it should be simplified with only contract, I should put a TODO comment in. That is, to me it seems like both 'contract' and 'reassoc' should be needed for this simplification to happen (whereas previously it worked with either of them individually). So maybe change this comment to:

; This is the minimum FMF needed for this transform - applying 'contract' to the FMA intrinsic allows reassociation.
; TODO: It seems 'contract' and 'reassoc' combined should be needed for this transform.  Why does it work with only 'contract'?
mcberg2017 added a comment.EditedJan 15 2020, 5:44 PM

We crossed that bridge internally at Apple a while ago, meaning I have some code debt for cleaning up open source for fma formation that uses contract and reassoc differently than we do today, both together and separately, case by case.

We crossed that bridge internally at Apple a while ago, meaning I have some code debt for cleaning up open source for fma formation that uses contract and reassoc differently than we do today, both together and separately, case by case.

Ah, great. I don't have a preference about the order of patches (ie, is it better to make this change first or do the backend cleanup first), so whatever works better.

But it would be better to have all of the baseline tests in place, so we know where we stand currently. So I'd prefer to have the PPC and x86 tests pre-committed to master (change test comments as needed to match current/future reality).

We could also decouple the clang part of this patch from the backend change IIUC. But do we need another change (or at least tests) to show how the driver difference translates in the clang front-end to LLVM FMF and/or function attributes?

clang/test/Driver/fast-math.c
183–184

Do we need another pair of tests for -ffp-contract=on?

I should do the other DAG combiner fma changes after this is wrapped up.

This all sounds good to me.

So to make sure we're all on the same page, my understanding is that the plan forward is:

  1. Make the Clang change first (including adding another pair of tests for -ffp-contract=on).
  2. Update the LLVM tests illustrating the current baseline state.
  3. Make the LLVM change shown here, and update the tests with the fixes.
  4. Bring in the DAG combiner work that Michael has done internally at Apple.

Points 1, 2, and 3 are essentially the points in the patch posted here, so I'll do that. And of course Michael will then take on point 4.

Is that the plan? If yes, I'll transition this item to just be the Clang pieces, and I'll spin off a new one to do the LLVM portion of what is posted here.

Sanjay, regarding:

But it would be better to have all of the baseline tests in place, so we know where we stand currently.

I'm interpreting that as applying to the LLVM tests. That is, the Clang change, along with the updated tests, can be in one commit, rather than first updating the Clang driver tests with the current state. If you'd prefer two commits on the Clang side as well, let me know.

This all sounds good to me.

So to make sure we're all on the same page, my understanding is that the plan forward is:

  1. Make the Clang change first (including adding another pair of tests for -ffp-contract=on).
  2. Update the LLVM tests illustrating the current baseline state.
  3. Make the LLVM change shown here, and update the tests with the fixes.
  4. Bring in the DAG combiner work that Michael has done internally at Apple.

Points 1, 2, and 3 are essentially the points in the patch posted here, so I'll do that. And of course Michael will then take on point 4.

Is that the plan? If yes, I'll transition this item to just be the Clang pieces, and I'll spin off a new one to do the LLVM portion of what is posted here.

SGTM

Sanjay, regarding:

But it would be better to have all of the baseline tests in place, so we know where we stand currently.

I'm interpreting that as applying to the LLVM tests. That is, the Clang change, along with the updated tests, can be in one commit, rather than first updating the Clang driver tests with the current state. If you'd prefer two commits on the Clang side as well, let me know.

One commit for the clang changes should be ok; it's a very small diff. But I'm still not sure if the driver change induces frontend diffs that we should make visible via tests.

One commit for the clang changes should be ok; it's a very small diff. But I'm still not sure if the driver change induces frontend diffs that we should make visible via tests.

The only thing I can think of is that it changes whether/when __FAST_MATH__ is defined. But that'll be indirectly tested, via the updated tests in "Driver/fast-math.c", which will verify that "-ffast-math" is passed appropriately (and the __FAST_MATH__ dependency on "-ffast-math" is already tested in "Preprocessor/predefined-macros.c").

In addition to adding the second pair of driver tests you suggested for -ffp-contract=on, I'll also add another pair for -ffp-contract=fast to verify that "-ffast-math" is passed in that case (no change in behavior for that pair, but just confirming it continues to work correctly).

wristow updated this revision to Diff 238595.Jan 16 2020, 1:19 PM
wristow retitled this revision from Fix -ffast-math/-ffp-contract interaction to [Clang][Driver] Fix -ffast-math/-ffp-contract interaction.

Changing this to address only the Clang driver aspect, as discussed in the comments. Will spin off a separate patch for the LLVM side.

spatel added a comment.EditedJan 17 2020, 5:56 AM

This follows the reasoning from the earlier discussion, but after re-reading the gcc comment in particular, I'm wondering whether this is what we really want to do...
If FAST_MATH is purely here for compatibility with gcc, then should we mimic gcc behavior in setting that macro even if we think it's buggy?
Ie, when we translate these settings to LLVM's FMF, we can still override the -ffast-math flag by checking the -ffp-contract flag (if I'm seeing it correctly, the existing code will pass that alongside -ffast-math when contract is set to on/off).

clang/lib/Driver/ToolChains/Clang.cpp
2788

This comment doesn't match the code any more. Should be "If -ffp-contract=off/on, then..."

clang/test/Driver/fast-math.c
184

typo: conteact

Adding potential FP reviewers for question of gcc- (potentially-buggy-) compatibility.

wristow updated this revision to Diff 238933.EditedJan 17 2020, 5:26 PM

Updated patch to correct a comment and fix a typo.

Regarding the point from @spatel :

This follows the reasoning from the earlier discussion, but after re-reading the gcc comment in particular, I'm wondering whether this is what we really want to do...
If __FAST_MATH__ is purely here for compatibility with gcc, then should we mimic gcc behavior in setting that macro even if we think it's buggy?

Thinking about it, yes, I think we should mimic GCC. FTR, I've tried some older GCCs via Compiler Explorer, and found them to be consistent in that removing some aspects of -ffast-math disables the generation of __FAST_MATH__ and removing other aspects of it does not. So my previous recollection ("I could have sworn that in the past that for GCC I had seen this sort of disabling of part of the "fast-math set" result in the disabling of the def of __FAST_MATH__.") is wrong. Bottom line, maybe it's not a bug in GCC, and instead it's the desired GCC behavior (just that I cannot find specific documentation of precisely which settings disable the generation of it).

However, I'd call that a bigger question than this -ffp-contract aspect of __FAST_MATH__, and a separate bug, that can be fixed separately.

Ie, when we translate these settings to LLVM's FMF, we can still override the -ffast-math flag by checking the -ffp-contract flag (if I'm seeing it correctly, the existing code will pass that alongside -ffast-math when contract is set to on/off).

How to handle this seems like an implementation question. The code here is just deciding whether or not we intend to pass "-ffast-math" to cc1 (it isn't directly defining __FAST_MATH__). If we do pass it to cc1, then "clang/lib/Frontend/InitPreprocessor.cpp" will pre-define __FAST_MATH__:

if (LangOpts.FastMath)
  Builder.defineMacro("__FAST_MATH__");

It seems to me that a straightforward way to deal with this question is to make the above test more elaborate (that is, if LangOpts.FastMath is set, OR whatever the appropriate subset of fast-math switches are set). If we do that, we don't need to deal with the "umbrella" aspect of "-ffast-math", which gets messy.

Which is to say the approach here can stay the same as it currently is (-ffp-contract=off/on suppressing the passing of "-ffast-math" to cc1). Although the comment about __FAST_MATH__ here in "Clang.cpp" should be changed, if we take this approach.

How to handle this seems like an implementation question. The code here is just deciding whether or not we intend to pass "-ffast-math" to cc1 (it isn't directly defining __FAST_MATH__). If we do pass it to cc1, then "clang/lib/Frontend/InitPreprocessor.cpp" will pre-define __FAST_MATH__:

if (LangOpts.FastMath)
  Builder.defineMacro("__FAST_MATH__");

It seems to me that a straightforward way to deal with this question is to make the above test more elaborate (that is, if LangOpts.FastMath is set, OR whatever the appropriate subset of fast-math switches are set). If we do that, we don't need to deal with the "umbrella" aspect of "-ffast-math", which gets messy.

Which is to say the approach here can stay the same as it currently is (-ffp-contract=off/on suppressing the passing of "-ffast-math" to cc1). Although the comment about __FAST_MATH__ here in "Clang.cpp" should be changed, if we take this approach.

That sounds fine to me, so could update that comment/this patch.

Let's give this a couple of days after updating to see if anyone else has feedback though. I'm never exactly sure what our responsibility is with respect to gcc compatibility.

I've had a quick look at GCC, and it seems there's a couple of different issues.

First of all, -ffast-math is a "collective" flag that has no separate meaning except setting a bunch of other flags. Currently, these flags are: -fno-math-errno, -funsafe-math-optimizations (which is itself a collective flag enabling -fno-signed-zeros, -fno-trapping-math, -fassociative-math, and -freciprocal-math), -ffinite-math-only, -fno-rounding-math, -fno-signaling-nans, -fcx-limited-range, and -fexcess-precision=fast.

When deciding whether to define __FAST_MATH__, GCC will not specifically look at whether or not the -ffast-math option itself was given, but whether the status of those other flags matches what would have been set by -ffast-math. However, it does not include all of the flags; currently only -fmath-errno, -funsafe-math-optimizations, -fsigned-zeros, -ftrapping-math, -ffinite-math-only, and -fexcess-precision are checked, while -fassociative-math, -freciprocal-math, -frounding-math, -fsignaling-nans, and -fcx-limited-range are ignored.

I'm not sure whether this is deliberate (but it seems weird) or just a bug. I can ask the GCC developers ...

A separate question is the interaction of -ffast-math with -ffp-contract=. Currently, there is no such interaction whatsoever in GCC: -ffast-math does not imply any particular -ffp-contract= setting, and vice versa the -ffp-contract= setting is not considered at all when defining __FAST_MATH__. This seems at least internally consistent.

I'm not sure whether this is deliberate (but it seems weird) or just a bug. I can ask the GCC developers ...

Please do. If there's a rationale, we should know.

A separate question is the interaction of -ffast-math with -ffp-contract=. Currently, there is no such interaction whatsoever in GCC: -ffast-math does not imply any particular -ffp-contract= setting, and vice versa the -ffp-contract= setting is not considered at all when defining __FAST_MATH__. This seems at least internally consistent.

That's interesting, and as you said, internally consistent behavior in GCC. I think it would be a fine idea for us to do the same thing.

Looking into this point, I see that (ignoring fastmath for the moment) our default -ffp-contract behavior is different than GCC's. GCC enables FMA by default when optimization is high enough ('-O2') without any special switch needed. For example, taking an architecture that supports FMA (Haswell), GCC has the following behavior:

float test(float a, float b, float c)
{
  // FMA is enabled by default for GCC (on Haswell), so done at -O2:
  //    gcc -S -O2 -march=haswell test.c  # FMA happens
  //    $ gcc -S -march=haswell test.c ; egrep 'mul|add' test.s
  //            vmulss  -12(%rbp), %xmm0, %xmm0
  //            vaddss  -4(%rbp), %xmm0, %xmm0
  //    $ gcc -S -O2 -march=haswell test.c ; egrep 'mul|add' test.s
  //            vfmadd231ss     %xmm2, %xmm1, %xmm0
  //    $
  return a + (b * c);
}

As we'd expect, GCC does disable FMA with -ffp-contract=off (this is irrespective of whether -ffast-math was specified). Loosely, GCC's behavior can be summarized very simply on this point as:
Suppress FMA when -ffp-contract=off.
(As an aside, GCC's behavior with -ffp-contract=on is non-intuitive to me. It relates to the FP_CONTRACT pragma, which as far as I can see is ignored by GCC.)

In contrast, we do not enable FMA by default (via general optimization, such as '-O2'). For example:

$ clang -S -O2 -march=haswell test.c ; egrep 'mul|add' test.s
        vmulss  %xmm2, %xmm1, %xmm1
        vaddss  %xmm0, %xmm1, %xmm0
        .addrsig
$

I think that whether we want to continue doing that (or instead, enable it at '-O2', like GCC does), is a separate issue. I can see arguments either way.

We do enable FMA with -ffp-contract=fast, as desired (and also with -ffp-contract=on). And we do "leave it disabled" with -ffp-contract=off (as expected).

Now, bringing fastmath back into the discussion, we do enable FMA with -fffast-math. If we decide to continue leaving it disabled by default, then enabling it with -ffast-math seems perfectly sensible. (If we decide to enable it by default when optimization is high enough, like GCC, then turning on -ffast-math should not disable it of course.)

The problem I want to address here is that if the compiler is a mode where FMA is enabled (whether that's at '-O2' "by default", or whether it's because the user turned on -ffast-math), then appending -ffp-contract=off should disable FMA. I think this patch (along with the small change to "DAGCombiner.cpp", in an earlier version of this patch) is a reasonable approach to solving that. I'd say this patch/review/discussion has raised two additional questions:

  1. Under what conditions should __FAST_MATH__ be defined?
  2. Should we enable FMA "by default" at (for example) '-O2'?

I think these additional questions are best addressed separately. My 2 cents are that for (1), mimicking GCC's behavior seems reasonable (although that's assuming we don't find out that GCC's __FAST_MATH__ behavior is a bug). And for (2), I don't have a strong preference.

  1. Should we enable FMA "by default" at (for example) '-O2'?

We recently introduced a new option "-ffp-model=[precise|fast|strict], which is supposed to serve as an umbrella option for the most common combination of options. I think our default behavior should be equivalent to -ffp-model=precise, which enables contraction. It currently seems to enable -ffp-contract=fast in the precise model (https://godbolt.org/z/c6w8mJ). I'm not sure that's what it should be doing, but whatever the precise model does should be our default.

  1. Should we enable FMA "by default" at (for example) '-O2'?

We recently introduced a new option "-ffp-model=[precise|fast|strict], which is supposed to serve as an umbrella option for the most common combination of options. I think our default behavior should be equivalent to -ffp-model=precise, which enables contraction. It currently seems to enable -ffp-contract=fast in the precise model (https://godbolt.org/z/c6w8mJ). I'm not sure that's what it should be doing, but whatever the precise model does should be our default.

That makes sense to me. So -ffp-model/-ffp-contract interaction should be examined, and possibly adjusted. If an adjustment is needed, I think it makes sense to handle that interaction separately.

I'm going to update this patch to adjust the comment that @spatel and I discussed earlier, so it no longer implies that __FAST_MATH__ is defined only when "all of" -ffast-math is enabled.

wristow updated this revision to Diff 240305.Jan 24 2020, 3:18 PM

Update a comment to remove the discussion about enabling the __FAST_MATH__ preprocessor macro. The handling of the setting of __FAST_MATH__ is done in "clang/lib/Frontend/InitPreprocessor.cpp", and once we decide on the set of conditions that enable that definition, we can add an appropriate comment there.

clang/lib/Driver/ToolChains/Clang.cpp
2792

I think this would read better as "... && !FPContract.equals("off") && !FPContract.equals("on")" The '||' in the middle of all these '&&'s combined with the extra parens from the function call trips me up.

2846–2847

As above, I'd prefer "!off && !on".

2854

This is a bit of an oddity in our handling. The FPContract local variable starts out initialized to an empty string. So, what we're saying here is that if you've used all the individual controls to set the rest of the fast math flags, we'll turn on fp-contract=fast also. That seems wrong. If you use "-ffast-math", FPContract will have been set to "fast" so that's not applicable here.

I believe this means

-fno-honor-infinities -fno-honor-nans -fno-math-errno -fassociative-math -freciprocal-math -fno-signed-zeros -fno-trapping-math -fno-rounding-math

will imply "-ffp-contract=fast". Even worse:

-ffp-contract=off -fno-fast-math -fno-honor-infinities -fno-honor-nans -fno-math-errno -fassociative-math -freciprocal-math -fno-signed-zeros -fno-trapping-math -fno-rounding-math

will imply "-ffp-contract=fast" because "-fno-fast-math" resets FPContract to empty.

I think we should initialize FPContract to whatever we want to be the default (on?) and remove the special case for empty here. Also, -fno-fast-math should either not change FPContract at all or set it to "off". Probably the latter since we're saying -ffast-math includes -ffp-contract=on.

clang/test/Driver/fast-math.c
196

What about "-ffp-contract=off -ffast-math"? The way the code is written that will override the -ffp-contract option. That's probably what we want, though it might be nice to have a warning.

wristow marked 4 inline comments as done.Jan 24 2020, 7:18 PM
wristow added inline comments.
clang/lib/Driver/ToolChains/Clang.cpp
2792

Sounds good. Will do.

2846–2847

As above, will do.

2854

This is a bit of an oddity in our handling.

Yes it is!

This is certainly getting more complicated than I had originally expected. I feel there isn't a clear spec on what we want in terms of whether FMA should be enabled "automatically" at (for example) '-O2', and/or whether it should be enabled by -ffast-math. I'm very willing to make whatever change is needed here, to meet whatever spec we all ultimately agree on.

So I think this patch should be put on hold until we decide on these wider aspects.

One minor note about:

... Probably the latter since we're saying -ffast-math includes -ffp-contract=on.

I think that is intended to say "-ffast-math includes -ffp-contract=fast". The semantics of -ffp-contract=on are another part that seems unclear (or at least, more complicated, since it then gets into the FP_CONTRACT pragma).

clang/test/Driver/fast-math.c
196

Yes, currently -fffast-math will override that earlier -ffp-contract=off settings. It's unclear to me whether we're ultimately intending for that to be the behavior (because GCC doesn't do that, as @uweigand noted). I guess this is another reason to hold off for a bit, until we figure out the wider spec.

About:

This is a bit of an oddity in our handling.

Yes it is!

This is certainly getting more complicated than I had originally expected. I feel there isn't a clear spec on what we want in terms of whether FMA should be enabled "automatically" at (for example) '-O2', and/or whether it should be enabled by -ffast-math. I'm very willing to make whatever change is needed here, to meet whatever spec we all ultimately agree on.

So I think this patch should be put on hold until we decide on these wider aspects.

Thinking about this a bit more, one thing that I believe there is a clear spec on, is that when -ffp-contract=off is "the last switch related to FMA", then FMA ought to be disabled. So for example, -ffast-math -ffp-contract=off should result in FMA being disabled, irrrspective of how we decide on the "wider aspects" of the spec mentioned above. (FTR, this inability to enable FastMath in general, but explicitly disable FMA, was a problem one of our customers reported, which is what initially motivated this patch.)

Note that the points about the "oddity" of "-ffp-contract=fast" being implied by the two long sets of switches in this comment are unrelated to the change proposed here. That is, that oddity is the current behavior of Clang.

So one approach we could take is to first fix the problem of when -ffp-contract=off is the "the last switch related to FMA", it isn't disabling FMA in some cases. And then as a followup, address the oddities that have been noted here. If we decide to take this two-step approach, I'm happy to take on the followup work (once we decide what the spec should be).

wristow updated this revision to Diff 240689.Jan 27 2020, 2:50 PM

Used the clearer '!off && !on' (rather than '!(off || on)') in the checks.

Added more tests that note the current situation that -ffast-math enables FMA. overriding an earlier switch that had disabled it (included a "TODO" comment in these tests, since it isn't clear if that behavior is what we will ultimately stick with).

wristow marked 4 inline comments as done.Jan 27 2020, 2:54 PM
wristow added inline comments.
clang/test/Driver/fast-math.c
196

Added more tests that illustrate this behavior of "-ffp-contract=off/on -ffast-math" overriding and enabling FMA. (I haven't added a warning. Holding off until we decide on other questions discussed in this review.)

I'm not sure whether this is deliberate (but it seems weird) or just a bug. I can ask the GCC developers ...

Please do. If there's a rationale, we should know.

Sorry for the delay ... I've raised that question on the GCC list, and the opinion seems to be that the current behavior is indeed a bug -- however there's still discussion ongoing on what the proper fix ought to be. I'll report back once we've agreed on a solution on the GCC side ...

Revisiting this patch. I think that before fixing the -ffp-contract=off problem I originally raised here, there are two questions that have come up in this discussion that we should first resolve. Specifically:

(1) Do we want to enable FMA transformations by default (at appropriate optimization levels), like GCC does? That is, should FMA be done for targets that support it with a command as simple as the following?

clang -O2 -c test.c

FTR, currently we don't do FMA with simply -O2. We do enable FMA if -ffast-math is added, or (of course) if FMA is explicitly enabled via -ffp-contract=fast. Also note that with GCC, FMA is completely orthogonal to -ffast-math. For example:

gcc -O2 -ffp-contract=off -ffast-math test.c  # '-ffast-math' does not turn FMA "back on"

(2) Under what conditions do we want to predefine the __FAST_MATH__ macro? That is, should we continue with our current policy (essentially, define it when "all the fast-math-flags are enabled"), or do we want to do precisely what GCC does (define it when some particular subset of the fast-math-flags are enabled)?

My vote for (1) is yes. To add to this, I would make the -ffp-contract setting independent of -ffast-math, as GCC does. But to be explicit, enabling FMA at -O2 is a change in behavior that may be unexpected for some users -- so I can imagine some users objecting to this change.

My vote for (2) is to keep our current behavior. This approach seems more sensible to me than the GCC behavior. And we're not boxing ourselves into anything by keeping the current behavior (that is, we can just as easily change this in the future if we find the GCC behavior is better for some reason). Among other things, this means that whether we define __FAST_MATH__ will continue to not be impacted by the -ffp-contract setting -- that makes sense to me, especially if we make FMA orthogonal to -ffast-math, as discussed in (1).

One related point. Our documentation (UsersManual.rst) for -ffp-model=precise says:

Disables optimizations that are not value-safe on floating-point data, although FP contraction (FMA) is enabled (`-ffp-contract=fast`). This is the default behavior.

So this explicitly says that FMA is enabled by default (consistent with GCC in this regard). However, that doesn't match our implementation, as we have FMA disabled by default, as noted in this discussion. Concretely, the following two commands are not equivalent:

clang -c -O2 test.c                         # FMA is disabled
clang -c -O2 -ffp-model=precise test.c      # FMA is enabled

If we decide to enable FMA by default (that is, if we agree on "yes" for (1)), then the current -ffp-model=precise description will become correct. But if we decide to continue to disable FMA by default, then we'll need to change the above description (possibly add another value to -ffp-model=<value> that will really be the default).

spatel added a subscriber: mibintc.Mar 12 2020, 9:50 AM

Revisiting this patch. I think that before fixing the -ffp-contract=off problem I originally raised here, there are two questions that have come up in this discussion that we should first resolve. Specifically:

(1) Do we want to enable FMA transformations by default (at appropriate optimization levels), like GCC does? That is, should FMA be done for targets that support it with a command as simple as the following?

clang -O2 -c test.c

This has been discussed/tried a few times including very recently. I'm not sure where we stand currently, but here's some background:
https://reviews.llvm.org/rL282259
D74436 - cc @mibintc

And (sorry for the various flavors of links to the dev-lists, but that's how it showed up in the search results):
http://lists.llvm.org/pipermail/llvm-dev/2017-March/111129.html
http://lists.llvm.org/pipermail/cfe-dev/2020-February/064645.html
https://groups.google.com/forum/#!msg/llvm-dev/nBiCD5KvYW0/yfjrVzR7DAAJ
https://groups.google.com/forum/#!topic/llvm-dev/Aev2jQYepKQ
http://clang-developers.42468.n3.nabble.com/RFC-FP-contract-on-td4056277.html

Revisiting this patch. I think that before fixing the -ffp-contract=off problem I originally raised here, there are two questions that have come up in this discussion that we should first resolve. Specifically:

(1) Do we want to enable FMA transformations by default (at appropriate optimization levels), like GCC does? That is, should FMA be done for targets that support it with a command as simple as the following?

clang -O2 -c test.c

This has been discussed/tried a few times including very recently. I'm not sure where we stand currently, but here's some background:
https://reviews.llvm.org/rL282259
D74436 - cc @mibintc

And (sorry for the various flavors of links to the dev-lists, but that's how it showed up in the search results):
http://lists.llvm.org/pipermail/llvm-dev/2017-March/111129.html
http://lists.llvm.org/pipermail/cfe-dev/2020-February/064645.html
https://groups.google.com/forum/#!msg/llvm-dev/nBiCD5KvYW0/yfjrVzR7DAAJ
https://groups.google.com/forum/#!topic/llvm-dev/Aev2jQYepKQ
http://clang-developers.42468.n3.nabble.com/RFC-FP-contract-on-td4056277.html

A few weeks ago there was a discussion on cfe-dev and llvm-dev initiated by @andrew.w.kaylor and while the response wasn't unanimous, there was an overwhelming sentiment (in my opinion) that it is correct and desireable to enable ffp-contract=on by default even at -O0. I had submitted a patch to do that. However the patch caused performance regressions in about 20 LNT tests and the patch was reverted, the regression was seen on multiple architecture. I can send provide a few more details about the performance problems, I don't have a solution to the LNT regression.

This patch, which hasn't been committed, contains modifications to the UserManual with many details concerning 'floating point semantic modes" and the relation to floating point command line options. This is from a discussion that @andrew.w.kaylor initiated on the discussion lists. https://reviews.llvm.org/D74436

Revisiting this patch. I think that before fixing the -ffp-contract=off problem I originally raised here, there are two questions that have come up in this discussion that we should first resolve. Specifically:

(1) Do we want to enable FMA transformations by default (at appropriate optimization levels), like GCC does? That is, should FMA be done for targets that support it with a command as simple as the following?

clang -O2 -c test.c

This has been discussed/tried a few times including very recently. I'm not sure where we stand currently, but here's some background:
https://reviews.llvm.org/rL282259
D74436 - cc @mibintc

And (sorry for the various flavors of links to the dev-lists, but that's how it showed up in the search results):
http://lists.llvm.org/pipermail/llvm-dev/2017-March/111129.html
http://lists.llvm.org/pipermail/cfe-dev/2020-February/064645.html
https://groups.google.com/forum/#!msg/llvm-dev/nBiCD5KvYW0/yfjrVzR7DAAJ
https://groups.google.com/forum/#!topic/llvm-dev/Aev2jQYepKQ
http://clang-developers.42468.n3.nabble.com/RFC-FP-contract-on-td4056277.html

A few weeks ago there was a discussion on cfe-dev and llvm-dev initiated by @andrew.w.kaylor and while the response wasn't unanimous, there was an overwhelming sentiment (in my opinion) that it is correct and desireable to enable ffp-contract=on by default even at -O0. I had submitted a patch to do that. However the patch caused performance regressions in about 20 LNT tests and the patch was reverted, the regression was seen on multiple architecture. I can send provide a few more details about the performance problems, I don't have a solution to the LNT regression.

Thanks for the update! I saw your questions about digging into the LNT regressions, but I don't have the answers (ping that message in case someone out there missed it?).
Let's break this into pieces and see if we can make progress apart from that:

  1. Where is the list of LNT tests that showed problems?
  2. Were the regressions on x86 bots? If so, many developers can likely repro those locally.
  3. For regressions on other targets, we can guess at the problems by comparing asm (or recruit someone with the target hardware?).

@David.Bolvansky told me “https://lnt.llvm.org/ -> Test suite nts -> watch for

LNT-Broadwell-AVX2-O3clang_PRODx86_64:1364”

The fails were seen on aarch too and a couple other arch. AFAIK the old results are no longer availble. i scraped the list of fails, pasting here:
home/ssglocal/clang-cmake-x86_64-sde-avx512-linux/clang-cmake-x86_64-sde-avx512-linux/test/sandbox/build/report.simple.csv
Importing 'report.json'
Import succeeded.

  • Tested: 2560 tests --

FAIL: MultiSource/Applications/oggenc/oggenc.execution_time (513 of 2560)
FAIL: MultiSource/Benchmarks/DOE-ProxyApps-C++/CLAMR/CLAMR.execution_time (514 of 2560)
FAIL: MultiSource/Benchmarks/DOE-ProxyApps-C++/HPCCG/HPCCG.execution_time (515 of 2560)
FAIL: MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.execution_time (516 of 2560)
FAIL: SingleSource/Benchmarks/Linpack/linpack-pc.execution_time (517 of 2560)
FAIL: SingleSource/Benchmarks/Misc-C++/Large/sphereflake.execution_time (518 of 2560)
FAIL: SingleSource/Benchmarks/Polybench/datamining/correlation/correlation.execution_time (519 of 2560)
FAIL: SingleSource/Benchmarks/Polybench/datamining/covariance/covariance.execution_time (520 of 2560)
FAIL: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/2mm/2mm.execution_time (521 of 2560)
FAIL: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/3mm/3mm.execution_time (522 of 2560)
FAIL: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/atax/atax.execution_time (523 of 2560)
FAIL: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/bicg/bicg.execution_time (524 of 2560)
FAIL: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/gemver/gemver.execution_time (525 of 2560)
FAIL: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/gesummv/gesummv.execution_time (526 of 2560)
FAIL: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/symm/symm.execution_time (527 of 2560)
FAIL: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/trisolv/trisolv.execution_time (528 of 2560)
FAIL: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/trmm/trmm.execution_time (529 of 2560)
FAIL: SingleSource/Benchmarks/Polybench/linear-algebra/solvers/gramschmidt/gramschmidt.execution_time (530 of 2560)
FAIL: SingleSource/Benchmarks/Polybench/stencils/adi/adi.execution_time (531 of 2560)
FAIL: SingleSource/UnitTests/Vector/SSE/sse_expandfft.execution_time (532 of 2560)
FAIL: SingleSource/UnitTests/Vector/SSE/sse_stepfft.execution_time (533 of 2560)

I may be wrong, but i suspect those failures aren't actually due to the fact
that we pessimize optimizations with this change, but that the whole execution
just fails. Can you try running test-suite locally? Do tests themselves actually pass,
ignoring the question of their performance?

I may be wrong, but i suspect those failures aren't actually due to the fact
that we pessimize optimizations with this change, but that the whole execution
just fails. Can you try running test-suite locally? Do tests themselves actually pass,
ignoring the question of their performance?

I find the LNT output very hard to decipher, but I thought that all of the failures on the Broadwell (x86) LNT bot were just performance regressions. There were many perf improvements and also many regressions. I investigated the top regression and found that the loop unroller made a different decision when the llvm.fmuladd intrinsic was used than it did for separate mul and add operations. The central loop of the test was unrolled eight times rather than four. Broadwell gets less benefit from FMA than either Haswell or Skylake, so any other factors that might drop performance are less likely to be mitigated by having fused these operations. In a more general sense, I don't see any reason apart from secondary changes in compiler behavior like this that allowing FMA would cause performance to drop.

At least one other target had execution failures caused by Melanie's change, but I understood it to be simply exposing a latent problem in the target-specific code.

Herald added a project: Restricted Project. · View Herald TranscriptAug 18 2023, 6:35 AM