This is an archive of the discontinued LLVM Phabricator instance.

[CodeGen] Generate constrained fp intrinsics depending on FPOptions
Needs ReviewPublic

Authored by sepavloff on Aug 12 2019, 9:25 AM.

Details

Summary

If the value of FPOption is modified, for example by using pragma
'clang fp', create calls to constrained fp intrinsics with metadata
arguments corresponding to the selected rounding mode and exception
behavior.

Event Timeline

sepavloff created this revision.Aug 12 2019, 9:25 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 12 2019, 9:25 AM
kpn added a comment.Aug 12 2019, 9:35 AM

Does this work for anything that uses TreeTransform, like C++ templates?

Also, if any constrained intrinsics are used in a function then the entire function needs to be constrained. Is this handled anywhere?

sepavloff updated this revision to Diff 214857.Aug 13 2019, 9:31 AM

Added tests for 'pragma clang fp' in template instantiations

In D66092#1625380, @kpn wrote:

Does this work for anything that uses TreeTransform, like C++ templates?

Added such tests.

Also, if any constrained intrinsics are used in a function then the entire function needs to be constrained. Is this handled anywhere?

If we decided to make the entire function constrained, it should be done somewhere in IR transformations, because inlining may mix function bodies with different fp options.

In D66092#1625380, @kpn wrote:

Also, if any constrained intrinsics are used in a function then the entire function needs to be constrained. Is this handled anywhere?

If we decided to make the entire function constrained, it should be done somewhere in IR transformations, because inlining may mix function bodies with different fp options.

Kevin is right. We have decided that if constrained intrinsics are used anywhere in a function they must be used throughout the function. Otherwise, there would be nothing to prevent the non-constrained FP operations from migrating across constrained operations and the handling could get botched. The "relaxed" arguments ("round.tonearest" and "fpexcept.ignore") should be used where the default settings would apply. The front end should also be setting the "strictfp" attribute on calls within a constrained scope and, I think, functions that contain constrained intrinsics.

We will need to teach the inliner to enforce this rule if it isn't already doing so, but if things aren't correct coming out of the front end an incorrect optimization could already happen before we get to the inliner. We always rely on the front end producing IR with fully correct semantics.

In D66092#1625380, @kpn wrote:

Also, if any constrained intrinsics are used in a function then the entire function needs to be constrained. Is this handled anywhere?

If we decided to make the entire function constrained, it should be done somewhere in IR transformations, because inlining may mix function bodies with different fp options.

Kevin is right. We have decided that if constrained intrinsics are used anywhere in a function they must be used throughout the function. Otherwise, there would be nothing to prevent the non-constrained FP operations from migrating across constrained operations and the handling could get botched. The "relaxed" arguments ("round.tonearest" and "fpexcept.ignore") should be used where the default settings would apply. The front end should also be setting the "strictfp" attribute on calls within a constrained scope and, I think, functions that contain constrained intrinsics.

We will need to teach the inliner to enforce this rule if it isn't already doing so, but if things aren't correct coming out of the front end an incorrect optimization could already happen before we get to the inliner. We always rely on the front end producing IR with fully correct semantics.

Replacement of floating point operations with constrained intrinsics seems more an optimization helper then a semantic requirement. IR where constrained operations are mixed with unconstrained is still valid in sense of IR specification. Tools that use IR for something other than code generation may don't need such replacement. If the replacement is made by a separate pass, such tool can turn it off, but if it is a part of clang codegen, there is no simple solution, the tool must be reworked.

Another issue is non-standard rounding. It can be represented by constrained intrinsics only. The rounding does not require restrictions on code motion, so mixture of constrained and unconstrained operation is OK. Replacement of all operations with constrained intrinsics would give poorly optimized code, because compiler does not optimize them. It would be a bad thing if a user adds the pragma to execute a statement with specific rounding mode and loses optimization.

Using dedicated pass to shape fp operations seems a flexible solution. It allows to implement things like #pragma STDC FENV_ROUND without teaching all passes to work with constrained intrinsics.

Replacement of floating point operations with constrained intrinsics seems more an optimization helper then a semantic requirement. IR where constrained operations are mixed with unconstrained is still valid in sense of IR specification.

The thing that makes the IR semantically incomplete is that there is nothing there to prevent incorrect code motion of the non-constrained operations. Consider this case:

if (someCondition) {
  #pragma clang fp rounding(downward)
  fesetround(FE_DOWNWARD);
  x = y/z;
  fesetround(FE_TONEAREST);
}
a = b/c;

If you generate a regular fdiv instruction for the 'a = b/c;' statement, there is nothing that would prevent it from being hoisted above the call to fesetround() and so it might be rounded incorrectly.

Another issue is non-standard rounding. It can be represented by constrained intrinsics only. The rounding does not require restrictions on code motion, so mixture of constrained and unconstrained operation is OK. Replacement of all operations with constrained intrinsics would give poorly optimized code, because compiler does not optimize them. It would be a bad thing if a user adds the pragma to execute a statement with specific rounding mode and loses optimization.

I agree that loss of optimization would be a bad thing, but I think it's unavoidable. By using non-default rounding modes the user is implicitly accepting some loss of optimization. This may be more than they would have expected, but I can't see any way around it.

The thing that makes the IR semantically incomplete is that there is nothing there to prevent incorrect code motion of the non-constrained operations. Consider this case:

if (someCondition) {
  #pragma clang fp rounding(downward)
  fesetround(FE_DOWNWARD);
  x = y/z;
  fesetround(FE_TONEAREST);
}
a = b/c;

If you generate a regular fdiv instruction for the 'a = b/c;' statement, there is nothing that would prevent it from being hoisted above the call to fesetround() and so it might be rounded incorrectly.

This is a good example, as it demonstrates intended usage of the pragma: there is a big program, in which some small pieces must be executed in some special way. Some notes:

  • User expects that small change confined to selected block is local, it does not affects the code outside the block. The specification of pragma just ensures it. If the change affects the entire function (and possibly other functions that use it), it is felt as something wrong.
  • The pragma usage is different from intended. The purpose of #pragma clang fp rounding is to model C2x #pragma STDC FENV_ROUND (http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2347.pdf, 7.6.2). Such pragma would set rounding mode at the beginning of the block and restore previous state at the end. That is the code should look like:
if (someCondition) {
  #pragma clang fp rounding(downward)
  x = y/z;
}
a = b/c;
  • What is the issue with moving a = b/c? If it moves ahead of if statement it seems OK, because the rounding mode is the same in that point. It cannot be moved inside the block (where rounding mode is different) because it breaks semantics. We could consider another example:
for (i = …) {
  #pragma clang fp rounding(downward)
  a[i] = x/y;
}

If x and y are loop invariants, x/y could be hoisted out of the loop. However on IR level it would be moved as constrained intrinsic, so semantic would preserve.
The issue arises only when an expression is moved inside the block where specific rounding mode is in effect. Something like this:

z = x*y;
for (i = …) {
  #pragma clang fp rounding(downward)
  a[i] += z;
}

And for some reason z=x*y is inserted into the loop. In such cases the node, that comes from outside the block, must be transformed.

  • There must be more than one way to prevent undesirable moves. For instance, fence node may be extended so that it prevented moving floating operation across it, and they may be used to organize a region where specific floating point environment is in act.

Another issue is non-standard rounding. It can be represented by constrained intrinsics only. The rounding does not require restrictions on code motion, so mixture of constrained and unconstrained operation is OK. Replacement of all operations with constrained intrinsics would give poorly optimized code, because compiler does not optimize them. It would be a bad thing if a user adds the pragma to execute a statement with specific rounding mode and loses optimization.

I agree that loss of optimization would be a bad thing, but I think it's unavoidable. By using non-default rounding modes the user is implicitly accepting some loss of optimization. This may be more than they would have expected, but I can't see any way around it.

Nowadays there are many architectures designed for machine learning tasks. The usually operate on short data (half, bfloat16 etc), in which precision is relatively low. Rounding control in this case is much more important that on big cores. Kernel writers do fancy things using appropriate rounding modes for different pieces of code to gain accuracy. Such processors may encode rounding mode in their instructions. Cost of using specific rounding mode is zero. Loss of performance in this use case is not excusable.

In any case impact on performance must be minimized.

  • What is the issue with moving a = b/c? If it moves ahead of if statement it seems OK, because the rounding mode is the same in that point. It cannot be moved inside the block (where rounding mode is different) because it breaks semantics.

It may be that the optimizer can prove that 'someCondition' is always true and it will eliminate the if statement and there is nothing to prevent the operation from migrating between the calls that change the rounding mode.

This is my main point -- "call i32 @fesetround" does not act as a barrier to an fdiv instruction (for example), but it does act as a barrier to a constrained FP intrinsic. It is not acceptable, for performance reasons in the general case, to have calls act as barriers to unconstrained FP operations. Therefore, to keep everything semantically correct, it is necessary to use constrained intrinsics in any function where the floating point environment may be changed.

I agree that impact on performance must be minimized, but this is necessary for correctness.

It took some digging, but I finally found the e-mail thread where we initially agreed that we can't mix constrained FP intrinsics and non-constrained FP operations within a function. Here it is: http://lists.llvm.org/pipermail/cfe-dev/2017-August/055325.html