Page MenuHomePhabricator

Experimental pass to convert all floating point operations to the equivalent constrained intrinsics
Needs ReviewPublic

Authored by andrew.w.kaylor on Feb 9 2018, 2:02 PM.

Details

Summary

I'm not sure this should actually be committed. I'm mostly putting it here as a way to make it available to anyone who is interested.

If anyone would like to see it added to the code base, feel free to make comments and I'll clean it up accordingly.

Right now, it only supports the legacy pass manager. Otherwise, I think it works OK.

Diff Detail

Event Timeline

andrew.w.kaylor created this revision.Feb 9 2018, 2:02 PM
kpn added a subscriber: kpn.Feb 20 2018, 9:32 AM
kpn added a comment.Nov 1 2018, 10:26 AM

Isn't this pass needed since we don't have a way to prevent code movement between blocks using the pragma FENV_ACCESS ON and blocks that aren't? I suggest having it check for use of constrained intrinsics before making any changes.

And if it was a function pass we could rerun it after certain optimizations like, say, the inliner.

Thoughts? Anyone?

I don't think this will work. The ConstantFolder could fold away traps early in the IRBuilder. We also wouldn't catch the FNeg/FSub case either.

I think Clang would need to explicitly generate the constrained intrinsics.

Here's a quick example. You can see in the IR that the ConstantFolder picks up the FDiv as undefined:

#include <stdio.h>

int main() {
  float x = 5.0;
  float y = 0.0;
  float z = x/y;
  return z;
}

and

; Function Attrs: norecurse nounwind readnone uwtable
define dso_local i32 @main() local_unnamed_addr #0 !dbg !6 {
entry:
  ret i32 undef, !dbg !8
}
kpn added a comment.Nov 1 2018, 10:48 AM

Oh, of course FNeg will need to be emitted by the front end. And couldn't we turn off constant folding in the IRBuilder when necessary?

Plus, we still have to deal with inlining being done llvm-side. We don't know in clang what llvm will inline. And we don't want an inlined function's instructions moved in between use of constrained intrinsics, for example.

In D43142#1284190, @kpn wrote:

And couldn't we turn off constant folding in the IRBuilder when necessary?

I don't think that would work in all cases. E.g. a binary operator with two constant operands.

int main() {
  float z = 5.0/0.0;
  return z;
}

Well, not without a considerable effort to rework the IRBuilder at least. I'm fairly sure we can't produce an Instruction with two Constant operands right now.

Plus, we still have to deal with inlining being done llvm-side. We don't know in clang what llvm will inline. And we don't want an inlined function's instructions moved in between use of constrained intrinsics, for example.

That's undefined behavior. A translation unit that checks or modifies the FP Environment must set the FENV_ACCESS pragma to ON. When it is set to ON, only constrained intrinsics should be generated from Clang. If it is unset or set to OFF, any FP Environment access causes undefined behavior.

It's unclear to me how LTO (or other cross file inlining) would work here. I haven't given it much though until now. My knee-jerk reaction is that we shouldn't be inlining from a FENV_ACCESS=OFF to FENV_ACCESS=ON location.

kpn added a comment.Nov 1 2018, 11:45 AM
In D43142#1284190, @kpn wrote:

And couldn't we turn off constant folding in the IRBuilder when necessary?

I don't think that would work in all cases. E.g. a binary operator with two constant operands.

int main() {
  float z = 5.0/0.0;
  return z;
}

Well, not without a considerable effort to rework the IRBuilder at least. I'm fairly sure we can't produce an Instruction with two Constant operands right now.

Then don't we have to deal with this sometime? If #pragma FENV_ACCESS ON is set for the main() example above then is constant folding allowed?

In D43142#1284296, @kpn wrote:

Then don't we have to deal with this sometime? If #pragma FENV_ACCESS ON is set for the main() example above then is constant folding allowed?

We do not want to constant fold when FENV_ACCESS=ON. We want the hardware instruction to execute and set the flags at runtime.

It's not really a problem though, since the constrained intrinsics won't constant fold.

In D43142#1284190, @kpn wrote:

...

It's unclear to me how LTO (or other cross file inlining) would work here. I haven't given it much though until now. My knee-jerk reaction is that we shouldn't be inlining from a FENV_ACCESS=OFF to FENV_ACCESS=ON location.

I'm pretty sure that we decided that we couldn't (because once you inline the regular FP ops, there's no way to restrict their movement relative to the constrained intrinsics at the IR level).

I don't think this will work. The ConstantFolder could fold away traps early in the IRBuilder. We also wouldn't catch the FNeg/FSub case either.

I think Clang would need to explicitly generate the constrained intrinsics.

To be clear, this is something that we need to address even if it seems possible to do it another way. A core part of our project development practice is that we never leave the IR, in between passes, in a semantically invalid or inconsistent state. We always have a single IR semantic interpretation and phase ordering concerns matter only for performance, and effectiveness, not correctness. If we do something else, that should be an explicit decision (that should be discussed on the dev list).

In D43142#1284190, @kpn wrote:

...

It's unclear to me how LTO (or other cross file inlining) would work here. I haven't given it much though until now. My knee-jerk reaction is that we shouldn't be inlining from a FENV_ACCESS=OFF to FENV_ACCESS=ON location.

I'm pretty sure that we decided that we couldn't (because once you inline the regular FP ops, there's no way to restrict their movement relative to the constrained intrinsics at the IR level).

I thought what we said was that if constrained FP intrinsics were used anywhere in a function then they had to be used everywhere in the function. So if you inline a function with FENV_ACCESS=ON into a function with FENV_ACCESS=OFF then the FP operations in the destination function need to be converted to constrained intrinsics, and if you inline a function with FENV_ACCESS=OFF into a function with FENV_ACCESS=ON then the FP operations in the function being inlined need to be converted. This conversion will have a detrimental effect on optimizations, so it would be nice to factor that into the inling cost, but at this point I'm mostly concerned about generating correct code.

arsenm added a subscriber: arsenm.Nov 1 2018, 6:36 PM
arsenm added inline comments.
lib/Transforms/IPO/ForceConstrainedFP.cpp
59

Why isn't this a function pass?

83–84

I would expect we would have opcode->intrinsic and intrinsic->opcode functions in utilities, and then this pass wouldn't need to be aware of the current set of constrained intrinsics

lib/Transforms/IPO/PassManagerBuilder.cpp
407–410

I would drop this part

I thought what we said was that if constrained FP intrinsics were used anywhere in a function then they had to be used everywhere in the function. So if you inline a function with FENV_ACCESS=ON into a function with FENV_ACCESS=OFF then the FP operations in the destination function need to be converted to constrained intrinsics, and if you inline a function with FENV_ACCESS=OFF into a function with FENV_ACCESS=ON then the FP operations in the function being inlined need to be converted. This conversion will have a detrimental effect on optimizations, so it would be nice to factor that into the inling cost, but at this point I'm mostly concerned about generating correct code.

I'm realizing that FENV_ACCESS is poorly designed.

Let's take this small example:

#include <fenv.h>
#include <stdio.h>

void bar();

#pragma STDC FENV_ACCESS ON
int main() {
  feenableexcept(FE_DIVBYZERO);
  bar();
  float x = 1.0/0.0;
  printf("main x=%lf\n", x);
}

#pragma STDC FENV_ACCESS OFF
void bar() {
  float x = 1.0/0.0;
  printf("bar x=%lf\n", x);
}

The FDiv in bar() would access the FP environment since main() enabled exceptions. But, since bar() is preceded with FENV_ACCESS=OFF, the Standard says that accessing the FP environment in bar() is undefined behavior. Since it's undefined behavior, the compiler can optimize as it wishes, including inlining bar() without modifying the FDiv.

It would really be the user's responsibility to make bar() safe. E.g.:

#pragma STDC FENV_ACCESS OFF
void bar() {
  fedisableexcept(FE_DIVBYZERO);
  float x = 1.0/0.0;
  printf("bar x=%lf\n", x);
  feenableexcept(FE_DIVBYZERO);
}

In that case, we would be free to inline bar() assuming that the feenableexcept/fedisableexcept functions act as barriers. However, explicitly making each call safe seems like an unreasonable burden on the user. Imagine the mental gymnastics necessary for a large call tree.

I suppose the compiler could automatically manage this for the user, by saving/restoring the FP environment in the prolog/epilog of FENV_ACCESS=OFF functions. But, there would be a non-trivial cost for that. A priori reasoning tells me that it's not worth the trouble.

Can anyone poke holes in this?

kpn added a comment.Nov 2 2018, 9:27 AM

I thought what we said was that if constrained FP intrinsics were used anywhere in a function then they had to be used everywhere in the function. So if you inline a function with FENV_ACCESS=ON into a function with FENV_ACCESS=OFF then the FP operations in the destination function need to be converted to constrained intrinsics, and if you inline a function with FENV_ACCESS=OFF into a function with FENV_ACCESS=ON then the FP operations in the function being inlined need to be converted. This conversion will have a detrimental effect on optimizations, so it would be nice to factor that into the inling cost, but at this point I'm mostly concerned about generating correct code.

I'm realizing that FENV_ACCESS is poorly designed.

Let's take this small example:

#include <fenv.h>
#include <stdio.h>

void bar();

#pragma STDC FENV_ACCESS ON
int main() {
  feenableexcept(FE_DIVBYZERO);
  bar();
  float x = 1.0/0.0;
  printf("main x=%lf\n", x);
}

#pragma STDC FENV_ACCESS OFF
void bar() {
  float x = 1.0/0.0;
  printf("bar x=%lf\n", x);
}

The FDiv in bar() would access the FP environment since main() enabled exceptions. But, since bar() is preceded with FENV_ACCESS=OFF, the Standard says that accessing the FP environment in bar() is undefined behavior. Since it's undefined behavior, the compiler can optimize as it wishes, including inlining bar() without modifying the FDiv.

If all optimizations including constant folding, or at least optimizations on floating point, are delayed until after inlining then there's no problem. And the optimizations applied to an inlined functions don't have to be the same as the optimizations on the standalone function. The copy of bar() inlined into main() can inherit the FENV_ACCESS ON of main() despite the standalone version of bar() having it OFF.

kpn added a comment.Nov 2 2018, 9:34 AM
In D43142#1284190, @kpn wrote:

...

It's unclear to me how LTO (or other cross file inlining) would work here. I haven't given it much though until now. My knee-jerk reaction is that we shouldn't be inlining from a FENV_ACCESS=OFF to FENV_ACCESS=ON location.

I'm pretty sure that we decided that we couldn't (because once you inline the regular FP ops, there's no way to restrict their movement relative to the constrained intrinsics at the IR level).

I thought what we said was that if constrained FP intrinsics were used anywhere in a function then they had to be used everywhere in the function. So if you inline a function with FENV_ACCESS=ON into a function with FENV_ACCESS=OFF then the FP operations in the destination function need to be converted to constrained intrinsics, and if you inline a function with FENV_ACCESS=OFF into a function with FENV_ACCESS=ON then the FP operations in the function being inlined need to be converted. This conversion will have a detrimental effect on optimizations, so it would be nice to factor that into the inling cost, but at this point I'm mostly concerned about generating correct code.

This. My suggestion to add this pass came from this context right here as described by Andy.

The inlining case inserting regular FP ops into a function is no different from a case where FENV_ACCESS is ON in one block inside a function but OFF in the next block inside that same function. I don't think disabling inlining makes sense because of that.

In D43142#1285565, @kpn wrote:

If all optimizations including constant folding, or at least optimizations on floating point, are delayed until after inlining then there's no problem.

I'll add that this is a ton of work. A binary Instruction can't currently have two Constant operands. So, ConstantFolding is baked into the Instruction implementation right now. If I'm mistaken, someone please correct me.

I'm not an expert on inlining, but I imagine there are challenges moving it to 1st in the pass order too. I could see it being difficult to analyze cost on an unoptimized, non-canonical IR.

There are probably other optimizations, like InstCombine, that we're not considering also.

And the optimizations applied to an inlined functions don't have to be the same as the optimizations on the standalone function. The copy of bar() inlined into main() can inherit the >FENV_ACCESS ON of main() despite the standalone version of bar() having it OFF.

At its core, this code has undefined behavior. So, your suggestion is one possible solution.

If my reasoning in my last comment is correct, we don't have to worry about inlining FENV_ACCESS=OFF functions. Since it's undefined behavior, we're free to inline that function with no modifications. It's the user's responsibility to make sure that a FENV_ACCESS=OFF function isn't called by a FENV_ACCESS=ON function without explicitly modifying the FP environment.

Well, at least that's my current understanding of the problem. Does anyone see problems with my initial assumptions?

kpn added a comment.Nov 2 2018, 10:47 AM
In D43142#1285565, @kpn wrote:

If all optimizations including constant folding, or at least optimizations on floating point, are delayed until after inlining then there's no problem.

I'll add that this is a ton of work. A binary Instruction can't currently have two Constant operands. So, ConstantFolding is baked into the Instruction implementation right now. If I'm mistaken, someone please correct me.

Surely this can be worked around with assigning to compiler temporaries?

Dealing with statics may be an issue. This was mentioned to me at lunch by our front end guy.

I'm not an expert on inlining, but I imagine there are challenges moving it to 1st in the pass order too. I could see it being difficult to analyze cost on an unoptimized, non-canonical IR.

There are probably other optimizations, like InstCombine, that we're not considering also.

OK, that sounds like something that may get dealt with when we have time to polish things into an ideal state. Maybe never.

If my reasoning in my last comment is correct, we don't have to worry about inlining FENV_ACCESS=OFF functions. Since it's undefined behavior, we're free to inline that function with no modifications. It's the user's responsibility to make sure that a FENV_ACCESS=OFF function isn't called by a FENV_ACCESS=ON function without explicitly modifying the FP environment.

Hmmm. A clang option to turn on FENV_ACCESS for entire compilations would work around this issue. So the behavior you are describing here would work for me.

In D43142#1285692, @kpn wrote:

Hmmm. A clang option to turn on FENV_ACCESS for entire compilations would work around this issue. So the behavior you are describing here would work for me.

This is interesting. I was thinking the same thing (and maybe even changing the Standard to match).

I'm struggling to find a good use case for dynamically changing FENV_ACCESS in a translation unit. My current belief is that if a user wants to run with traps, they would want trap-safe code everywhere. If that was the case, the user would still be able to toggle signals with the feenableexcept() and friends functions (similar for rounding mode too).

I'm realizing that FENV_ACCESS is poorly designed.

Yes, this is true. It's difficult to understand and difficult to implement. Personally, I'm not as worried about precisely supporting FENV_ACCESS as I am supporting the code behavior that it is intended to enable. The FENV_ACCESS pragma is useful as a reference, but for me this is really about being able to control the optimizer and allow a way to restrict what it does with FP operations. The optimizer has no business knowing anything about pragmas in the source code and what any given language standard says about them. It just needs to be internally consistent and follow the rules we set for the IR language.

But getting back to FENV_ACCESS (because I would like to see us provide a correct implementation of the standard) as it applies to your example, I believe the standard says that the compiler should assume the default environment for any region marked as FENV_ACCESS OFF. The behavior is only undefined if the FP environment doesn't match the defaults when that code is executed. This puts the burden of making sure the environment is correct for transitions between regions with different access settings. That's a bit of a nightmare for the programmer if he decides to mix FENV_ACCESS modes across a program, but it's fairly simple for the compiler. In your original example, we are free to constant fold the division in bar and not worry about the fact that the FP exception won't occur.

Furthermore, if we inline bar into main, we can use "round.tonearest" as the rounding mode argument and "fpexcept.ignore" as the exception behavior argument, so we could still constant fold the division if we hadn't done so previously for whatever reason. This is because the contents of bar are still conceptually in the FENV_ACCESS OFF scope as the user originally specified. The fact that we have chosen to inline the function doesn't change the semantics. The only thing we need to worry about is making sure that FP operations don't migrate across calls or other FP operations that would break these semantics. Using the intrinsics after inlining takes care of that.

Let's take this small example:

#include <fenv.h>
#include <stdio.h>

void bar();

#pragma STDC FENV_ACCESS ON
int main() {
  feenableexcept(FE_DIVBYZERO);
  bar();
  float x = 1.0/0.0;
  printf("main x=%lf\n", x);
}

#pragma STDC FENV_ACCESS OFF
void bar() {
  float x = 1.0/0.0;
  printf("bar x=%lf\n", x);
}

The FDiv in bar() would access the FP environment since main() enabled exceptions. But, since bar() is preceded with FENV_ACCESS=OFF, the Standard says that accessing the FP environment in bar() is undefined behavior. Since it's undefined behavior, the compiler can optimize as it wishes, including inlining bar() without modifying the FDiv.

It would really be the user's responsibility to make bar() safe. E.g.:

#pragma STDC FENV_ACCESS OFF
void bar() {
  fedisableexcept(FE_DIVBYZERO);
  float x = 1.0/0.0;
  printf("bar x=%lf\n", x);
  feenableexcept(FE_DIVBYZERO);
}

In that case, we would be free to inline bar() assuming that the feenableexcept/fedisableexcept functions act as barriers. However, explicitly making each call safe seems like an unreasonable burden on the user. Imagine the mental gymnastics necessary for a large call tree.

I suppose the compiler could automatically manage this for the user, by saving/restoring the FP environment in the prolog/epilog of FENV_ACCESS=OFF functions. But, there would be a non-trivial cost for that. A priori reasoning tells me that it's not worth the trouble.

Can anyone poke holes in this?

In D43142#1285565, @kpn wrote:

If all optimizations including constant folding, or at least optimizations on floating point, are delayed until after inlining then there's no problem.

I'll add that this is a ton of work. A binary Instruction can't currently have two Constant operands. So, ConstantFolding is baked into the Instruction implementation right now. If I'm mistaken, someone please correct me.

I'm not an expert on inlining, but I imagine there are challenges moving it to 1st in the pass order too. I could see it being difficult to analyze cost on an unoptimized, non-canonical IR.

We really can't defer optimizations that occur before inlining. The legacy pass manager is specifically structured so that the function simplification passes and the inlining pass are layered such that the function simplification passes are run on a function before we decide whether or not to inline it and then run again on the calling functions so that if we did inline the code can be simplified again in its new context before we decide whether or not to inline that function. I'm not sure how the new pass manager manages this dance, but I'm certain that it must have the same effective ordering.

cameron.mcinally added a comment.EditedNov 2 2018, 11:48 AM

The fact that we have chosen to inline the function doesn't change the semantics.

I agree with this. Inlined or a call, it should be the same semantics.

The only thing we need to worry about is making sure that FP operations don't migrate across calls or other FP operations that would break these semantics. Using the intrinsics after inlining takes care of that.

I don't agree with this. This is changing the semantics if we choose to inline a function by converting some operations into constrained intrinsics. In other words, an executable will have different behavior if we choose to inline or not. That's not ok.

We're in undefined behavior territory if the user doesn't explicitly reset the FP Env to the default when moving to a FENV_ACCESS=OFF function. If we treat the feenableexcept() and friends as barriers (which we must anyway), then we can freely inline (or call) a function as we wish with no change to semantics.

The only thing we need to worry about is making sure that FP operations don't migrate across calls or other FP operations that would break these semantics. Using the intrinsics after inlining takes care of that.

I don't agree with this. This is changing the semantics if we choose to inline a function by converting some operations into constrained intrinsics. In other words, an executable will have different behavior if we choose to inline or not. That's not ok.

Well, yes and no. In a region defined as FENV_ACCESS off the user has granted the compiler explicit permission to ignore FP status bits and exception behavior. Depending on how we optimize these regions the FP status bits may come out differently and exceptions may or may not be raised, but the user has promised not to unmask exceptions or look at status bits in that region. So if that part of the behavior of the code changes depending on whether or not we inline that's OK because that's what the user signed up for. In that sense the behavior hasn't changed any more than it does between -O0 and -O3.

Now maybe someone might have code that calls an FENV_ACCESS OFF function from an FENV_ACCESS ON function and then looks at FP status bits after the call returns, but the standard explicitly says that when controls transfers from a scope with FENV_ACCESS off to a scope with FENV_ACCESS on the state of the FP status flags is undefined.

So, yes, technically the behavior may change, but that's OK.

kpn added a comment.Nov 2 2018, 12:20 PM

The only thing we need to worry about is making sure that FP operations don't migrate across calls or other FP operations that would break these semantics. Using the intrinsics after inlining takes care of that.

I don't agree with this. This is changing the semantics if we choose to inline a function by converting some operations into constrained intrinsics. In other words, an executable will have different behavior if we choose to inline or not. That's not ok.

Well, yes and no. In a region defined as FENV_ACCESS off the user has granted the compiler explicit permission to ignore FP status bits and exception behavior. Depending on how we optimize these regions the FP status bits may come out differently and exceptions may or may not be raised, but the user has promised not to unmask exceptions or look at status bits in that region. So if that part of the behavior of the code changes depending on whether or not we inline that's OK because that's what the user signed up for. In that sense the behavior hasn't changed any more than it does between -O0 and -O3.

More specifically, in an all FENV_ACCESS=OFF world we _already_ have different results that come from the compiler choosing to inline or not. So different results from inlining a FENV_ACCESS=OFF function into a FENV_ACCESS=ON function is just more of the same.

To the point about barriers, converting to use the constrained intrinsics functions as a barrier whether or not the =OFF function calls fpenableexcept() and friends.

Now maybe someone might have code that calls an FENV_ACCESS OFF function from an FENV_ACCESS ON function and then looks at FP status bits after the call returns, but the standard explicitly says that when controls transfers from a scope with FENV_ACCESS off to a scope with FENV_ACCESS on the state of the FP status flags is undefined.

So, yes, technically the behavior may change, but that's OK.

Agreed.

Well, yes and no. In a region defined as FENV_ACCESS off the user has granted the compiler explicit permission to ignore FP status bits and exception behavior. Depending on how we optimize these regions the FP status bits may come out differently and exceptions may or may not be raised, but the user has promised not to unmask exceptions or look at status bits in that region. So if that part of the behavior of the code changes depending on whether or not we inline that's OK because that's what the user signed up for. In that sense the behavior hasn't changed any more than it does between -O0 and -O3.

That's well put. I agree with that. I was probably overreaching in my last comment.

Now maybe someone might have code that calls an FENV_ACCESS OFF function from an FENV_ACCESS ON function and then looks at FP status bits after the call returns, but the standard explicitly says that when controls transfers from a scope with FENV_ACCESS off to a scope with FENV_ACCESS on the state of the FP status flags is undefined.

So, yes, technically the behavior may change, but that's OK.

I think we both agree that it's undefined behavior. If that's correct, then it doesn't really matter what we do after we hit it. So any solution is acceptable...

I think we both agree that it's undefined behavior. If that's correct, then it doesn't really matter what we do after we hit it. So any solution is acceptable...

We still need to control code motion. For instance, if the code checks the status bits before calling the FENV_ACCESS OFF function we can't have FP operations that are inlined hoisted above the status bit check.

We still need to control code motion. For instance, if the code checks the status bits before calling the FENV_ACCESS OFF function we can't have FP operations that are inlined hoisted above the status bit check.

How are you envisioning the status bits are checked?

If through fegetenv() and friends, those should probably be barriers.