Page MenuHomePhabricator

Teach the IRBuilder about constrained fadd and friends
Needs ReviewPublic

Authored by kpn on Oct 11 2018, 11:59 AM.

Details

Summary

The IRBuilder has calls to create floating point instructions like fadd. It does not have calls to create constrained versions of them. This patch adds support for constrained creation of fadd, fsub, fmul, fdiv, and frem.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
In D53157#1291964, @kpn wrote:

I don't expect anyone would need to switch between constrained and regular floating point at an instruction level of granularity.

The Standard allows for this (to some degree). E.g. FENV_ACCESS can be toggled between nested compound statements.

Also, some AVX512 hardware instructions have explicit SAE and RM operands.

kpn added a comment.Nov 8 2018, 11:45 AM
In D53157#1291964, @kpn wrote:

I don't expect anyone would need to switch between constrained and regular floating point at an instruction level of granularity.

The Standard allows for this (to some degree). E.g. FENV_ACCESS can be toggled between nested compound statements.

Yes, but I'm not worried about what sort of code is input to a compiler. I'm worried about what is right for the compiler itself. Also, since we're in LLVM here, we have to think about all languages that use LLVM as a backend.

In D53157#1291964, @kpn wrote:

I don't expect anyone would need to switch between constrained and regular floating point at an instruction level of granularity.

The Standard allows for this (to some degree). E.g. FENV_ACCESS can be toggled between nested compound statements.

Also, some AVX512 hardware instructions have explicit SAE and RM operands.

Just because FENV_ACCESS can be toggled on that granularity doesn't mean we have to represent it that way. We've previously agreed (I think) that if FENV_ACCESS is enabled anywhere in a function we will want to use the constrained intrinsics for all FP operations in the function, not just the ones in the scope where it was specified. I think this works because FENV_ACCESS ON needs to be respected (i.e. we need to assume that the user may change the FP environment) but FENV_ACCESS OFF doesn't need to be respected (i.e. we are not required to assume that the user will not change the environment). For instance, the spec does give us liberty to assume the default rounding mode in FENV_ACCESS OFF regions, but if we make no assumptions about the rounding mode that will be conservatively correct.

As for instructions that take an explicit rounding mode argument, that's a separate issue (and one that is relevant for multiple architectures). The constrained intrinsics do not enforce a rounding mode (i.e. they are descriptive rather than prescriptive), but if we have concrete rounding mode arguments for any given instruction we could use that to encode the rounding mode in the instruction. I'm not sure how user code would express this apart from use of intrinsics.

Fair enough. Just for conversation's sake, I was envisioning the FE support a little differently...

  1. Add a command line option, say -ffpe_safe=[true|false], that toggles FPEnv-safe code generation for a whole file. A -ffpe_safe=true would be equivalent to FENV_ACCESS=ON at the beginning of a translation unit and we would capture it in some FE variable. That command line option can be overridden by the FENV_ACCESS pragma.
  1. If the FENV_ACCESS pragma is seen, carry a nop/metadata/something construct to toggle the FE variable during IR generation.
  1. Then when generating IR, choose between constrained/unconstrained IR depending on the state of the FE variable.

#1 may seem superfluous, but there are certain benchmarks (e.g. SPEC CPU) that do not allow for code modifications. So, in order to run those with FPEnv-safe code, we'd need a non-invasive way to toggle FPEnv-safe code generation.

I'll also add that my interpretation of FENV_ACCESS in the C Standard [7.6.1] implies that we have to track the FENV_ACCESS pragma. That is, it's not okay to ignore FENV_ACCESS=OFF. When we transfer from FENV_ACCESS=OFF code to FENV_ACCESS=ON code, the rounding mode needs to be reset to the default setting. But, I'm open to hearing other interpretations. My copy of the Standard is old, so please correct me if this changed...

When execution passes from a part of the program translated with FENV_ACCESS ‘‘off’’ to a part translated with FENV_ACCESS ‘‘on’’, the state of the floating-point status flags is unspecified and the floating-point control modes have their default settings.

Craig Topper also raised some concerns with me (in person, his desk is right by mine) about the potential effect this might have on code size. He tells me that IRBuilder calls are intended to always be inlined and if we grow the implementation of these functions too much it could lead to noticeable bloat. It still seems to me like it might be worthwhile for the simplification it would allow in the front end, but I'm not really a front end guy so I definitely agree that we should get some input from front end people about what they want.

Craig's right about not wanting to bloat the inlinable functions in IRBuilder, but this is something that we can measure. In addition, we might be able to move the "slow path" (which create the constrained intrinsics) to the .cpp file (by manually outlining to a different function).

Just because FENV_ACCESS can be toggled on that granularity doesn't mean we have to represent it that way. We've previously agreed (I think) that if FENV_ACCESS is enabled anywhere in a function we will want to use the constrained intrinsics for all FP operations in the function, not just the ones in the scope where it was specified.

Yes, this is also my understanding. We can't soundly mix the two in the same function because we can't prevent the code motion within the function.

Craig Topper also raised some concerns with me (in person, his desk is right by mine) about the potential effect this might have on code size. He tells me that IRBuilder calls are intended to always be inlined and if we grow the implementation of these functions too much it could lead to noticeable bloat. It still seems to me like it might be worthwhile for the simplification it would allow in the front end, but I'm not really a front end guy so I definitely agree that we should get some input from front end people about what they want.

Craig's right about not wanting to bloat the inlinable functions in IRBuilder, but this is something that we can measure. In addition, we might be able to move the "slow path" (which create the constrained intrinsics) to the .cpp file (by manually outlining to a different function).

Also, to be clear, adding a mode that automatically adds these when using the existing IRBuilder functions seems worth investigating. It seems like that would greatly simply the FE code.

Fair enough. Just for conversation's sake, I was envisioning the FE support a little differently...

  1. Add a command line option, say -ffpe_safe=[true|false], that toggles FPEnv-safe code generation for a whole file. A -ffpe_safe=true would be equivalent to FENV_ACCESS=ON at the beginning of a translation unit and we would capture it in some FE variable. That command line option can be overridden by the FENV_ACCESS pragma.

To bikeshed, I'd prefer we don't have an underscore in the name of the command-line flag.

  1. If the FENV_ACCESS pragma is seen, carry a nop/metadata/something construct to toggle the FE variable during IR generation.
  2. Then when generating IR, choose between constrained/unconstrained IR depending on the state of the FE variable.

    #1 may seem superfluous, but there are certain benchmarks (e.g. SPEC CPU) that do not allow for code modifications. So, in order to run those with FPEnv-safe code, we'd need a non-invasive way to toggle FPEnv-safe code generation.

    I'll also add that my interpretation of FENV_ACCESS in the C Standard [7.6.1] implies that we have to track the FENV_ACCESS pragma. That is, it's not okay to ignore FENV_ACCESS=OFF. When we transfer from FENV_ACCESS=OFF code to FENV_ACCESS=ON code, the rounding mode needs to be reset to the default setting. But, I'm open to hearing other interpretations. My copy of the Standard is old, so please correct me if this changed...

    ` When execution passes from a part of the program translated with FENV_ACCESS ‘‘off’’ to a part translated with FENV_ACCESS ‘‘on’’, the state of the floating-point status flags is unspecified and the floating-point control modes have their default settings. `

The rounding mode does need to be reset to its default setting when passing from FENV_ACCESS "on" to FENV_ACCESS "off", but that seems to be the user's responsibility. Are you saying that the implementation should reset it on that transition?

Just because FENV_ACCESS can be toggled on that granularity doesn't mean we have to represent it that way. We've previously agreed (I think) that if FENV_ACCESS is enabled anywhere in a function we will want to use the constrained intrinsics for all FP operations in the function, not just the ones in the scope where it was specified.

Yes, this is also my understanding. We can't soundly mix the two in the same function because we can't prevent the code motion within the function.

Ugh, I don't know. The C Standard's language is so vague.

The rounding mode does need to be reset to its default setting when passing from FENV_ACCESS "on" to FENV_ACCESS "off", but that seems to be the user's responsibility. Are you saying that the implementation should reset it on that transition?

Yes, that's how I was interpreting the Standard (today). The implementation should reset the control modes. The verbiage is murky at best though.

We touched on this in D43142 and I do realize that my opinion has flip-flopped since then. I previously believed that reseting the control modes was up to the user, but now I'm not so sure. I suppose that either way, as long as a fesetround(default_mode_constant) is seen with a FENV_ACCESS=OFF, we could use that as a barrier to prevent the problematic code motion.

Thinking aloud, maybe we should be working on redefining FENV_ACCESS in the C Standard? It's pretty clear that this section could use some work.

All that said, my understanding of $7.6.1 in the Standard is cloudy at best. If I'm the only one that feels this way, I'll drop my objections...

Just because FENV_ACCESS can be toggled on that granularity doesn't mean we have to represent it that way. We've previously agreed (I think) that if FENV_ACCESS is enabled anywhere in a function we will want to use the constrained intrinsics for all FP operations in the function, not just the ones in the scope where it was specified.

Yes, this is also my understanding. We can't soundly mix the two in the same function because we can't prevent the code motion within the function.

Ugh, I don't know. The C Standard's language is so vague.

To be clear, I mean here that *we* can't mix the two soundly in the same function (where we generate some operations using the constrained instrinsics and some not) because of constraints imposed by our design. The standard may indeed allow for more precision. We'll need to be conservatively correct.

The rounding mode does need to be reset to its default setting when passing from FENV_ACCESS "on" to FENV_ACCESS "off", but that seems to be the user's responsibility. Are you saying that the implementation should reset it on that transition?

Yes, that's how I was interpreting the Standard (today). The implementation should reset the control modes. The verbiage is murky at best though.

I'll also point out that whether or not we insert a rounding-mode reset when the pragma changes state is orthogonal to whether we stop emitting constrained intrinsics at that point.

We touched on this in D43142 and I do realize that my opinion has flip-flopped since then. I previously believed that reseting the control modes was up to the user, but now I'm not so sure. I suppose that either way, as long as a fesetround(default_mode_constant) is seen with a FENV_ACCESS=OFF, we could use that as a barrier to prevent the problematic code motion.

Thinking aloud, maybe we should be working on redefining FENV_ACCESS in the C Standard? It's pretty clear that this section could use some work.

All that said, my understanding of $7.6.1 in the Standard is cloudy at best. If I'm the only one that feels this way, I'll drop my objections...

I don't read the standard that way, but the standard could certainly be more clear.

A couple of comments on the previous discussion:

  1. Instead of defining a new command line option, I'd prefer to use the existing options -frounding-math and -ftrapping-math to set the default behavior of math operations w.r.t. rounding modes and exception status. (For compatibility with GCC if nothing else.)
  2. I also read the C standard to imply that it is a requirement of user code to reset the status flag to default before switching back to FENV_ACCESS OFF. The fundamental characterization of the pragma says "The FENV_ACCESS pragma provides a means to inform the implementation when a program might access the floating-point environment to test floating-point status flags or run under non-default floating-point control modes." There is no mention anywhere that using the pragma, on its own, will ever change those control modes. The last sentence about "... the floating-point control modes have their default setting", while indeed a bit ambiguous, is still consistent with an interpretation that it is the responsibility of user code to ensure that state, there is no explicit statement that the implementation will do so.
  3. I agree that we need to be careful about intermixing "normal" floating-point operations with strict ones. However, I'm still not convinced that the pragma itself must be the scheduling barrier. It seems to me that the compiler already knows where FP control flags are ever modified directly (this can only happen with intrinsics or the like), so the main issue is whether function calls need to be considered. This is where the pragma comes in: in my mind, the primary difference between FENV_ACCESS ON and FENV_ACCESS OFF regions is that where the pragma is ON, function calls need to be considered (unless otherwise known for sure) to access FP control flags, while where the pragma is OFF, function calls can be considered to never touch FP control flags. So the real scheduling barrier would be any function call within a FENV_ACCESS ON region. Those would have to be marked by the front-end in the IR, presumably using a function attribute. The common LLVM optimizers would then need to respect that scheduling barrier (here is where we likely still have an open issue, there doesn't appear to be any way to express that at the IR level for regular floating-point operations ...), and likewise the back-ends (but that looks straightforward: a back-end typically will model FP status as residing in a register or in a pseudo-memory slot, and those can simply be considered used/clobbered by function calls marked as within FENV_ACCESS ON regions).

A couple of comments on the previous discussion:

  1. Instead of defining a new command line option, I'd prefer to use the existing options -frounding-math and -ftrapping-math to set the default behavior of math operations w.r.t. rounding modes and exception status. (For compatibility with GCC if nothing else.)

I agree that it's preferable to re-use these existing options if possible. I have some concerns that -ftrapping-math has a partial implementation in place that doesn't seem to be well aligned with the way fast-math flags are handled, so it might require some work to have that working as expected without breaking existing users. In general though these seem like they should do what we need.

Regarding GCC compatibility, I notice that GCC defaults to trapping math being enabled and I don't think that's what we want with clang. It also seems to imply something more than I think we need for constrained handling. For example, the GCC documentation says that -fno-trapping-math "can result in incorrect output for programs that depend on an exact implementation of IEEE or ISO rules/specifications for math functions" so it sounds like maybe it also implies (for GCC) something like LLVM's "afn" fast math flag.

So if we are going to use these options, I think we need to have a discussion about whether or not it's OK to diverge from GCC's interpretation of them.

  1. I also read the C standard to imply that it is a requirement of user code to reset the status flag to default before switching back to FENV_ACCESS OFF. The fundamental characterization of the pragma says "The FENV_ACCESS pragma provides a means to inform the implementation when a program might access the floating-point environment to test floating-point status flags or run under non-default floating-point control modes." There is no mention anywhere that using the pragma, on its own, will ever change those control modes. The last sentence about "... the floating-point control modes have their default setting", while indeed a bit ambiguous, is still consistent with an interpretation that it is the responsibility of user code to ensure that state, there is no explicit statement that the implementation will do so.

I definitely agree with this interpretation of the standard. My understanding is that behavior is undefined if the user has not left the FP environment in the default state when transitioning to an FENV_ACCESS OFF region.

  1. I agree that we need to be careful about intermixing "normal" floating-point operations with strict ones. However, I'm still not convinced that the pragma itself must be the scheduling barrier. It seems to me that the compiler already knows where FP control flags are ever modified directly (this can only happen with intrinsics or the like), so the main issue is whether function calls need to be considered. This is where the pragma comes in: in my mind, the primary difference between FENV_ACCESS ON and FENV_ACCESS OFF regions is that where the pragma is ON, function calls need to be considered (unless otherwise known for sure) to access FP control flags, while where the pragma is OFF, function calls can be considered to never touch FP control flags. So the real scheduling barrier would be any function call within a FENV_ACCESS ON region. Those would have to be marked by the front-end in the IR, presumably using a function attribute. The common LLVM optimizers would then need to respect that scheduling barrier (here is where we likely still have an open issue, there doesn't appear to be any way to express that at the IR level for regular floating-point operations ...), and likewise the back-ends (but that looks straightforward: a back-end typically will model FP status as residing in a register or in a pseudo-memory slot, and those can simply be considered used/clobbered by function calls marked as within FENV_ACCESS ON regions).

I'm a bit confused by this. The constrained intrinsics will cause all calls to act as barriers to motion of the FP operations represented by the intrinsics (at least before instruction selection). So I'm not clear what you are saying is needed here.

A couple of comments on the previous discussion:

  1. Instead of defining a new command line option, I'd prefer to use the existing options -frounding-math and -ftrapping-math to set the default behavior of math operations w.r.t. rounding modes and exception status. (For compatibility with GCC if nothing else.)

I like this proposal.

  1. I also read the C standard to imply that it is a requirement of user code to reset the status flag to default before switching back to FENV_ACCESS OFF. The fundamental characterization of the pragma says "The FENV_ACCESS pragma provides a means to inform the implementation when a program might access the floating-point environment to test floating-point status flags or run under non-default floating-point control modes." There is no mention anywhere that using the pragma, on its own, will ever change those control modes. The last sentence about "... the floating-point control modes have their default setting", while indeed a bit ambiguous, is still consistent with an interpretation that it is the responsibility of user code to ensure that state, there is no explicit statement that the implementation will do so.

That's a fair interpretation. Andy mentioned:

I definitely agree with this interpretation of the standard. My understanding is that behavior is undefined if the user has not left the FP environment in the default state when transitioning to an FENV_ACCESS OFF region.

If we all agree upon that, then we simply have to treat the functions that modify the FPEnv, e.g. fesetexcept(...), as barriers. That way it does not matter if a FENV_ACCESS=OFF function is translated with constrained intrinsics or not, since nothing can be scheduled around these barriers.

  1. I agree that we need to be careful about intermixing "normal" floating-point operations with strict ones. However, I'm still not convinced that the pragma itself must be the scheduling barrier. It seems to me that the compiler already knows where FP control flags are ever modified directly (this can only happen with intrinsics or the like), so the main issue is whether function calls need to be considered. This is where the pragma comes in: in my mind, the primary difference between FENV_ACCESS ON and FENV_ACCESS OFF regions is that where the pragma is ON, function calls need to be considered (unless otherwise known for sure) to access FP control flags, while where the pragma is OFF, function calls can be considered to never touch FP control flags. So the real scheduling barrier would be any function call within a FENV_ACCESS ON region. Those would have to be marked by the front-end in the IR, presumably using a function attribute. The common LLVM optimizers would then need to respect that scheduling barrier (here is where we likely still have an open issue, there doesn't appear to be any way to express that at the IR level for regular floating-point operations ...), and likewise the back-ends (but that looks straightforward: a back-end typically will model FP status as residing in a register or in a pseudo-memory slot, and those can simply be considered used/clobbered by function calls marked as within FENV_ACCESS ON regions).

I'm not sure if I fully understand this, but it seems to be an acceptable solution to the problem.

As mentioned above, couldn't we make the helper functions that read/write the FPEnv the barriers? That seems like a simpler solution.

kpn added a comment.Nov 20 2018, 7:57 AM

If we all agree upon that, then we simply have to treat the functions that modify the FPEnv, e.g. fesetexcept(...), as barriers. That way it does not matter if a FENV_ACCESS=OFF function is translated with constrained intrinsics or not, since nothing can be scheduled around these barriers.

I thought we couldn't do barriers. No barriers means no way to prevent code motion and mixing of constrained with non-constrained FP. That was the reason for having all FP in a function be constrained if any of it was.

OK, let me try to expand on my point 3 above, which appears to have confused everybody :-)

First, let's distinguish two separate requirements: those on floating-point operations that explicitly run in regions with non-default control modes, and those on floating-point operations that run outside such regions. (Note that those regions are by definition a subset of the regions marked as FENV_ACCESS ON, but not necessarily coincide with them.)

For the first class of operations, there are various requirements to constrain optimization. All of those are handled in the current design by simply representing those operations via constrained intrinsics. This can be done simply by always using constrained intrinsics whenever we are inside any FENV_ACCESS ON region. (My comments above were not intended to refer to those at all.)

Now, for the second class of operations, most of these requirements do not apply. However, there is one critical requirement that still does apply: such operations may never be moved to inside a region with non-default control modes. (Where it makes a difference, the operation can in principle be even moved inside a FENV_ACCESS ON region, as long as the control modes actually have not yet changed from the default modes.)

As @andrew.w.kaylor mentioned above, to fulfil this requirement if suffices to prevent moving any such floating-point operation across any statement that modifies the control modes (or inspects the exception flags). There is only a small number of statements that do so directly (those would all be inline asm or platform-specific builtins), but the action can be hidden behind a function call as well. This applies in particular to the related C library calls with well-known names (which the compiler could detect), but in addition any random function call might indirectly contain such a statement (which the compiler will not know, in general). Since LLVM today will freely move normal floating-point operations across normal function calls, this is a problem that needs to be addressed.

Now, one way to do that would be to ensure that any floating-point operation outside FENV_ACCESS ON regions is also implemented via constrained intrinsics, as long as there is any FENV_ACCESS ON present in the function at all. As @cameron.mcinally mentioned above, that would prevent reordering (across any function call, even) and therefore solve the problem. However, I believe there were concerns whether we can reliably ensure that property in the presence of optimizations like inlining, in particular if LTO is also used. Even if this can be resolved, I believe there were additional concerns that this might overly constrain optimization and lead to suboptimal performance.

This finally gets to the intent of my point 3 above, where I was trying to see if there is any way we can do better, that is correctly handle floating-point operations outside FENV_ACCESS ON regions without having to turn them all into constrained intrinsics. One way might be to add some new code motion barrier in the IR optimizers that would prevent movement of floating-point operations across the location of any FENV_ACCESS pragma. But -- and this is I think what @kpn referered to above -- those locations don't really correspond to anything in the IR that could serve as such a barrier. My observation above was simply that if we want to go that way, it in fact isn't really necessary to put those barriers at exactly these places. Instead, we can put the barriers at those places that will actually modify (or inspect) the control words -- which are either special operations we can explicitly recognize, or general function calls -- but only those calls where the call site was originally inside a FENV_ACCESS ON region (and was marked by the front-end as such). This has the advantage that a function call site is a natural place to act as barrier -- in fact it already acts as such e.g. for global memory accesses. (It would also be easy to implement as barrier on the MI / back-end level.) In addition, this approach would not lead to any performance penalty outside of FENV_ACCESS ON regions at all.

But given that there is still infrastructure missing in the IR optimizers, I also think that at least in the first implementation, we probably should go with the original approach and just use constrained intrinsics everywhere in the function, and possibly add some function attribute that prevent any cross-inlining of functions built with constrained intrinsics with functions built with regular floating-point operations.

uweigand added a comment.EditedNov 20 2018, 9:27 AM

I agree that it's preferable to re-use these existing options if possible. I have some concerns that -ftrapping-math has a partial implementation in place that doesn't seem to be well aligned with the way fast-math flags are handled, so it might require some work to have that working as expected without breaking existing users. In general though these seem like they should do what we need.

So as far as I can see, the current implementation of -f(no-)trapping-math in LLVM is pretty much a stub. The driver passes the option on to CC1, and in addition it disables -menable-unsafe-fp-math and -massociative-math if -ftrapping-math is in effect (the latter may be a minor difference to GCC, but should be conservatively compatible). The compiler itself does nothing except mark all functions with the no-trapping-math attribute if -fno-trapping-math is in effect. LLVM itself completely ignores this attribute except on ARM, where it is used to set the ABI_FP_exceptions extended attribute. (But since LLVM doesn't really implement the flag in actual codegen, setting the attribute to indicate the property seems a bit odd ...)

Regarding GCC compatibility, I notice that GCC defaults to trapping math being enabled and I don't think that's what we want with clang. It also seems to imply something more than I think we need for constrained handling. For example, the GCC documentation says that -fno-trapping-math "can result in incorrect output for programs that depend on an exact implementation of IEEE or ISO rules/specifications for math functions" so it sounds like maybe it also implies (for GCC) something like LLVM's "afn" fast math flag.

So if we are going to use these options, I think we need to have a discussion about whether or not it's OK to diverge from GCC's interpretation of them.

The GCC documentation says:

-fno-trapping-math

Compile code assuming that floating-point operations cannot generate user-visible traps. These traps include division by zero, overflow, underflow, inexact result and invalid operation. This option requires that -fno-signaling-nans be in effect. Setting this option may allow faster code if one relies on ``non-stop'' IEEE arithmetic, for example.

This option should never be turned on by any -O option since it can result in incorrect output for programs that depend on an exact implementation of IEEE or ISO rules/specifications for math functions.

The default is -ftrapping-math.

As far as I can tell, this does not imply any additional differences except that -fno-trapping-math may rearrange floating-point operations to eliminate or introduce traps or changes to exception flags. It is those changes themselves (not anything else) that can result in incorrect output for programs that depend on exact IEEE semantics, and therefore this option must not be enabled by default at any optimization level, as the second paragraph above states.

Note that in GCC, -fno-trapping-math is implied by -funsafe-math-optimizations, which in turn is implied by -ffast-math. Those options all are documented using the same "incorrect output for programs that depend on exact IEEE semantics" clause: -funsafe-math-optimization is the catch-all for all such effects, but it is separated into -fno-signed-zeros, -fno-trapping-math, -fassociative-math, and -freciprocal-math that cover the distinct aspects of just how IEEE semantics might get violated.

We can of course still have a separate discussion on what the default should be. But even if we choose a different default than GCC (as is already the case e.g. for the -ffp-contract option), I think it would be preferable for the two explicit options to behave in a compatible way.

In D53157#1304275, @kpn wrote:

If we all agree upon that, then we simply have to treat the functions that modify the FPEnv, e.g. fesetexcept(...), as barriers. That way it does not matter if a FENV_ACCESS=OFF function is translated with constrained intrinsics or not, since nothing can be scheduled around these barriers.

I thought we couldn't do barriers. No barriers means no way to prevent code motion and mixing of constrained with non-constrained FP. That was the reason for having all FP in a function be constrained if any of it was.

I'd like to hear more about this. The fesetexcept(...) function and friends change the FPEnv state, which can change the semantics for the instructions that follow. They definitely have to act as a barrier.

In D53157#1304275, @kpn wrote:

If we all agree upon that, then we simply have to treat the functions that modify the FPEnv, e.g. fesetexcept(...), as barriers. That way it does not matter if a FENV_ACCESS=OFF function is translated with constrained intrinsics or not, since nothing can be scheduled around these barriers.

I thought we couldn't do barriers. No barriers means no way to prevent code motion and mixing of constrained with non-constrained FP. That was the reason for having all FP in a function be constrained if any of it was.

I'd like to hear more about this. The fesetexcept(...) function and friends change the FPEnv state, which can change the semantics for the instructions that follow. They definitely have to act as a barrier.

There's no sense in which we can have a code-motion barrier within a function that acts on the regular FP instructions. We can have a barrier for the constrained intrinsics. This is why we need, in this design, for any function that uses FENV_ACCESS=ON for any part of it, to always use the constrained instrinsics.

OK, let me try to expand on my point 3 above, which appears to have confused everybody :-)

Ah, that makes more sense.

The problem I was missing is when a FENV_ACCESS=OFF operation, like a FDIV, is hoisted into a FENV_ACCESS=ON region. I see it now, but still think that forcing FENV_ACCESS=OFF operations into constrained intrinsics is a big hammer. If there is a way to add barriers around function calls in a FENV_ACCESS=ON region, that would be better.

Stepping way back, the *best* solution is to have the FPEnv implementation shut down unsafe optimizations on an individual basis. Perhaps we should be tagging functions that contain FENV_ACCESS=ON regions. That way unsafe optimization passes, e.g. hoisting, can query the tag to see if they should skip these functions.

But given that there is still infrastructure missing in the IR optimizers, I also think that at least in the first implementation, we probably should go with the original approach and just use constrained intrinsics everywhere in the function, and possibly add some function attribute that prevent any cross-inlining of functions built with constrained intrinsics with functions built with regular floating-point operations.

Yes, I'll stop moaning about this. This is perfectly suitable for a first implementation.

I'd like to hear more about this. The fesetexcept(...) function and friends change the FPEnv state, which can change the semantics for the instructions that follow. They definitely have to act as a barrier.

There's no sense in which we can have a code-motion barrier within a function that acts on the regular FP instructions. We can have a barrier for the constrained intrinsics. This is why we need, in this design, for any function that uses FENV_ACCESS=ON for any part of it, to always use the constrained instrinsics.

Thanks. Ulrich's description helped me see the problem.

The problem I was missing is when a FENV_ACCESS=OFF operation, like a FDIV, is hoisted into a FENV_ACCESS=ON region. I see it now, but still think that forcing FENV_ACCESS=OFF operations into constrained intrinsics is a big hammer. If there is a way to add barriers around function calls in a FENV_ACCESS=ON region, that would be better.

It may be a big hammer, but I don't think we'll need to use it terribly often (that is, I don't expect the mixing of FENV_ACCESS regions within a function to be common -- I could be wrong). In any event, we've been clear that choosing to enable FP environment access will mean sacrificing performance in the near term. Once we've had time to teach all the relevant optimizations how to handle the constrained intrinsics the hammer will start to feel much smaller.

Stepping way back, the *best* solution is to have the FPEnv implementation shut down unsafe optimizations on an individual basis. Perhaps we should be tagging functions that contain FENV_ACCESS=ON regions. That way unsafe optimization passes, e.g. hoisting, can query the tag to see if they should skip these functions.

One of the reasons we went with the approach we did is that it provides conservatively correct results by default. Optimizations don't need to be taught to leave the intrinsics alone. They do that as a default behavior when they don't recognize the intrinsic. If we had required optimizations to opt out as needed, we'd have problems chasing down all the places where that needed to happen. So we chose the opposite challenge of using the big hammer for initial implementation and then having to go looking for the places that needed to be taught how to interpret the intrinsics for cases when the optimization can still be performed.

I think at this point we're all on the same page in this regard. I just wanted to make sure we also understand why we're on that page. I still believe it was the correct choice.

I think at this point we're all on the same page in this regard. I just wanted to make sure we also understand why we're on that page. I still believe it was the correct choice.

Yup, we're on the same page.

To explain where I'm coming from, our in-house compiler has tracked down *many* of the trap-unsafe transformations in LLVM and can toggle them as desired. The execution penalty for running trap-safe code is (much) less than 5% with that compiler. My goal for this project is to make Clang competitive with that compiler.

kpn added a comment.Nov 21 2018, 5:42 AM

But given that there is still infrastructure missing in the IR optimizers, I also think that at least in the first implementation, we probably should go with the original approach and just use constrained intrinsics everywhere in the function, and possibly add some function attribute that prevent any cross-inlining of functions built with constrained intrinsics with functions built with regular floating-point operations.

Subtle. This last sentence seems to imply that cross-inlining should be allowed when there are no regular floating point operations in the function to be inlined. This makes sense due to, for example, the common use of tiny functions just to retrieve a value. Do I interpret your statement correctly?

In D53157#1305233, @kpn wrote:

But given that there is still infrastructure missing in the IR optimizers, I also think that at least in the first implementation, we probably should go with the original approach and just use constrained intrinsics everywhere in the function, and possibly add some function attribute that prevent any cross-inlining of functions built with constrained intrinsics with functions built with regular floating-point operations.

Subtle. This last sentence seems to imply that cross-inlining should be allowed when there are no regular floating point operations in the function to be inlined. This makes sense due to, for example, the common use of tiny functions just to retrieve a value. Do I interpret your statement correctly?

Sure, that would be an optimization. Another potential optimization would be to allow inlining, but have the inliner automatically replace regular floating-point operations with constrained intrinsics in the target routine ... But all of that is probably "stage2" after an initial implementation that just disables inlining.

Digressing a bit, but has anyone given thought to how this implementation will play out with libraries? When running with traps enabled, libraries must be compiled for trap-safety. E.g. optimizations on a lib's code could introduce a NaN or cause a trap that does not exist in the code itself.

I'm wondering if the speed penalty for trap-safe code will be a problem with the general Clang population that does not care about traps...

Digressing a bit, but has anyone given thought to how this implementation will play out with libraries? When running with traps enabled, libraries must be compiled for trap-safety. E.g. optimizations on a lib's code could introduce a NaN or cause a trap that does not exist in the code itself.

In my reading of the standard text, there is no special provision for library code. This means that in general, calling any library function while running with nondefault trap settings is undefined behavior, unless the library was itself compiled with FENV_ACCESS ON. There does not even appear to be any requirement for the C standard library functions to have been compiled with FENV_ACCESS ON, as far as I can see ...

kpn updated this revision to Diff 176432.Dec 3 2018, 10:51 AM

I've changed the patch so that calls to CreateFAdd() et al will give you constrained intrinsics if they are enabled. This required adding functions to enable/disable constrained-as-default plus calls to deal with the rounding mode and exception handling for contrained intrinsics.

I've left the independent create constrained functions to make it dead easy to move them out of the header and avoid inlining bloat.

I'm not dealing with any fast math flags or metadata currently. I'm not sure they make sense in a constrained context, or at least I haven't seen anyone argue they must be in place for an initial implementation.

In my reading of the standard text, there is no special provision for library code. This means that in general, calling any library function while running with nondefault trap settings is undefined behavior, unless the library was itself compiled with FENV_ACCESS ON. There does not even appear to be any requirement for the C standard library functions to have been compiled with FENV_ACCESS ON, as far as I can see ...

I'll take this conversation to llvm-dev in the near future. I don't want to detract from Kevin's work.

Apologies for highjacking this Diff, Kevin.

include/llvm/IR/IRBuilder.h
1323

I think you can forgo the else { in these functions since the if branch returns immediately.

kpn updated this revision to Diff 176464.Dec 3 2018, 1:13 PM
kpn marked an inline comment as done.

Address review comment: Shrink the diff by eliminating unneeded use of else.

kpn added a comment.Feb 5 2019, 9:57 AM

I emailed llvm-dev and cfe-dev on January 16th. The only responses I got back were of the don't care variety. For now it seems the constrained intrinsics will only be used by clang.

@rsmith, does the direction of this patch seem reasonable for clang?

Once Richard comments is there anything else needing us to delay here?

kpn added a reviewer: rsmith.May 10 2019, 10:18 AM
kpn added a subscriber: kbarton.May 10 2019, 12:33 PM

I think this looks straightforward, as long as people agree to have a separate CreateConstrained* version of these functions. I'm not qualified to weigh in on that as I don't know Clang at all and can't comment about the tradeoffs (although I think they have been well articulated in the discussion). From what I can see, that is the only thing blocking this from getting approved (unless there is something else I missed while reading the discussion).

The only other comment I have is there was some very good description about the intention here from @uweigand, @cameron.mcinally and @andrew.w.kaylor and @kpn. I think it would be good if that discussion is extracted from this review and put somewhere (perhaps the language ref) to indicate precisely what the semantics are we are trying to achieve with FENV_ACCESS ON/OFF.

include/llvm/IR/IRBuilder.h
228

This is a minor quibble, but the method is setIsConstrainedFP, while the member is IsFPConstrained.
I'm not sure if that is intentionally different, or an oversight.

kpn marked an inline comment as done.May 15 2019, 1:12 PM

I'm waiting for a signoff from at least one front-end guy. I hope the mode setting that allows us to keep front ends from having to touch every use of the IRBuilder is something that works for clang. But I haven't heard anything from @rsmith or any other front-end person.

include/llvm/IR/IRBuilder.h
228

Yeah, that's an oversight. Fixed.

include/llvm/IR/IRBuilder.h
234

I think it would be better to add some enumerated type to describe the exception semantic and rounding modes. The MDNode is an implementation detail and an awkward one for the front end to have to deal with.

erichkeane added inline comments.
include/llvm/IR/IRBuilder.h
111

Instead of adding these here, make these inline initializers on lines 100-102.

228

IS this fixed?

1329

Why set the last 2 to nullptr when you have defaults for these?

1340

The last 2 parameters are never actually used except in the test. Are these really important to have if they are never used in source?

1438

All these CreateConstrainedXXX should be distilled down to a single function that takes the intrinsic as a parameter.

I've posted a small change to this patch here, https://reviews.llvm.org/D62730 and there's a patch to add clang fp options here, https://reviews.llvm.org/D62731

include/llvm/IR/IRBuilder.h
234

I posted a patch showing the rounding and exception information being passed as enumeration not MDNode. I've uploaded the patch here, https://reviews.llvm.org/D62730

kpn updated this revision to Diff 202436.Fri, May 31, 7:54 AM
kpn marked 4 inline comments as done.

Address review comments.

I don't know which patch will proceed; but they weren't added already, this seems to be missing documentation changes.

kpn added a comment.Fri, May 31, 8:04 AM

I wanted to get the API straight before working on documentation.

include/llvm/IR/IRBuilder.h
228

In my working copy, yes.

234

It's a good idea. D62730 beat me to the punch.

1329

Eh, no really good reason.

1340

They're part of the intrinsics. It seems like there should be a way to emit them as part of the instruction when creating the instruction.

1438

That moves us further down the road towards having modes. I haven't seen a front-end guy sign off on it yet, but it seems like a good idea.

mibintc added inline comments.Fri, May 31, 8:17 AM
include/llvm/IR/IRBuilder.h
234

BTW I created FPState.h because if I put the enumeration types in IRBuilder it dragged in too many include files in clang and required other random source changes for disambigutation. Anyway I don't need to comandeer your patch -- I don't want it to die from neglect. I can close mine?

kpn added a comment.Fri, May 31, 9:52 AM

Oh, this ticket is not going to die from neglect. It is true that D43515 is a higher priority, but I need this IRBuilder work done as well. My department head wanted it done by the end of _last_ year. It's not going to die.

How about I merge your changes into this ticket and we continue work over here? There is the issue of the documentation that lebedev.ri asked you to write. Can I talk you into putting that together and sending it to me <kevin.neal@sas.com>? I'll work on the documentation that I was asked to write. Between the two of us we should be in pretty good shape. Does that work for you?

I am still waiting for feedback from an actual consumer of the IRBuilder who will be using this new functionality. If someone clang-side could chime in on this ticket I'd very much appreciate it.

In D53157#1525311, @kpn wrote:

Oh, this ticket is not going to die from neglect. It is true that D43515 is a higher priority, but I need this IRBuilder work done as well. My department head wanted it done by the end of _last_ year. It's not going to die.

How about I merge your changes into this ticket and we continue work over here? There is the issue of the documentation that lebedev.ri asked you to write. Can I talk you into putting that together and sending it to me <kevin.neal@sas.com>? I'll work on the documentation that I was asked to write. Between the two of us we should be in pretty good shape. Does that work for you?

Yes I'll do that. Thanks.

I am still waiting for feedback from an actual consumer of the IRBuilder who will be using this new functionality. If someone clang-side could chime in on this ticket I'd very much appreciate it.

I wrote a clang patch that works with your IRBuilder modifications. I also checked with the Intel fortran team and they think this interface will be workable for them too. From Intel perspective it's +1.

kpn updated this revision to Diff 202987.Tue, Jun 4, 11:04 AM

Address the rest of the review comments, hopefully.

I've lifted from mibintc's D62730 the enums for the exception behavior and rounding modes. The rest of that set of changes is better left to a separate ticket after all.

Since I now have those requested enums I've changed the relevant public functions to take them instead of an MDNode*.

There is no standalone or comprehensive IRBuilder documentation that I could find. I hope the references in comments back to the LangRef are good enough. I used the exact language from it to make it easier to reference said doc.

simoll added a subscriber: simoll.Mon, Jun 17, 5:29 AM

This works for me, I redid my patches adding fp-model options to work with this newest changeset, (but I haven't yet uploaded them.)

rjmccall added inline comments.Mon, Jun 17, 9:50 AM
include/llvm/IR/IRBuilder.h
113

Should these have "FP" in the name somewhere? And are they really IRBuilder-specific concepts, as opposed to something that should be declared as part of the interface for working with these intrinsics?

Also, I believe we can use explicit underlying types now in LLVM; it'd be nice if we didn't make IRBuilder unnecessarily large.

228

Maybe this should be more explicit about what exactly it does? Specifically, it changes the behavior of the CreateF<Op> methods so that they use constrained intrinsics instead of the standard instructions.

231

This seems unnecessary.

1115

Huh? You build an MDNode that wraps an MDString and then immediately extract the MDString from it and drop the MDNode?

I think you should just have a function somewhere that returns the correct metadata string for a ConstrainedRoundingKind, and then the code here is much more obvious. Maybe this can be wherever you declare the enums. You can also have a function that goes the other way and returns an Optional<ConstrainedRoundingKind>.

Function name should include FP.

Same points apply to the next function.

kpn marked an inline comment as done.Mon, Jun 17, 10:10 AM
kpn added inline comments.
include/llvm/IR/IRBuilder.h
113

Would it be better to use the RoundingMode and ExceptionBehavior enums in the ConstrainedFPIntrinsic class? These enums->strings here get turned back into those IntrinsicInst.h enums eventually anyway. But that means pulling in yet more headers in IRBuilder.h.

I admit I'm not sure what you mean with your second paragraph. Is that a way of saying that, for example, the relevant IntrinsicInst.h enums should be used instead?

rjmccall added inline comments.Mon, Jun 17, 11:00 AM
include/llvm/IR/IRBuilder.h
113

I think it's IRBuilder.h's fate to pull in the majority of the headers in IR, but even if we want to avoid that, duplicating the enums seems like a step too far. If there's too much code in the header declaring that intrinsic, you should extract a smaller header that just declares the enums and their string conversions.

The second paragraph is asking for these enums to be specifically constrained to uint8_t so that we aren't wasting a ton of memory everywhere we store them. IRBuilder doesn't have strong size constraints, but it'd still be nice if all these accumulated features didn't make it hundreds of bytes larger than it needs to be.

kpn updated this revision to Diff 205151.Mon, Jun 17, 12:44 PM
kpn marked 3 inline comments as done.

Address review comments.

Run the changes through clang-format.

include/llvm/IR/IRBuilder.h
1115

Well, when you put it that way, your proposed change does result in much simpler code.

The resulting code isn't that much more than the switch. I hope it is now easy enough to read. These functions are private to the class and just exist to be helpers. I don't see a need to put the switch table into a different function, but I'm willing to be convinced.

rjmccall added inline comments.Mon, Jun 17, 12:55 PM
include/llvm/IR/IRBuilder.h
1115

I feel like having the enum->string mapping available somewhere independent of IRBuilder will probably be useful to someone, and it nicely separates concerns in any case . You could just make it a static method on ConstrainedFPIntrinsic along with the reverse operation.

kpn marked an inline comment as done.Mon, Jun 17, 1:06 PM
kpn added inline comments.
include/llvm/IR/IRBuilder.h
1115

Ok, will do.

kpn updated this revision to Diff 205371.Tue, Jun 18, 9:06 AM

Add static methods to convert between a StringRef and the enums for RoundingMode or ExceptionBehavior.

rjmccall added inline comments.Tue, Jun 18, 10:28 AM
include/llvm/IR/IntrinsicInst.h
235

Is it okay that ebUnspecified and ebInvalid overlap here?

245

You should document the behavior of the StrTo... functions when the string is unexpected and the ...ToStr functions when the enum is invalid/unspecified.

kpn marked an inline comment as done.Tue, Jun 18, 10:54 AM
kpn added inline comments.
include/llvm/IR/IntrinsicInst.h
235

I can think of a couple of alternatives. If they don't overlap then we have to go back and sweep the source to make sure that ebUnspecified is always handled in all cases that currently handle ebInvalid. And in the future nobody is allowed to check in source that doesn't handle both.

Or, we could drop ebUnspecified, but then ebInvalid would have a valid meaning to the IRBuilder interface. That looks like a bug even if it works properly.

Generally, adding eb/rmUnspecified but having them overlap the invalid cases seems to me to be the best option.

rjmccall added inline comments.Tue, Jun 18, 12:15 PM
include/llvm/IR/IntrinsicInst.h
235

What is the purpose of ebInvalid? Is this a valid argument to the intrinsic, or is it there so that getExceptionBehavior() has something to return if it doesn't recognize the argument? For the latter, maybe it would be better to just assert that the argument was one of the recognized possibilities; we don't generally expect LLVM to silently propagate ill-formed IR. Or if it's valid to not provide exception behavior to the intrinsic, maybe that's the same as ebUnspecified, or maybe the accessor should return Optional<ExceptionBehavior>.

Andy, can I get you to chime in?

kpn added a comment.Thu, Jun 20, 8:45 AM

I found out that some compilers, including SAS/C, will warn about enum overlaps like the one here under discussion. So I now believe John is correct and that eliminating that overlap is the right thing to do. Over the weekend I'll be thinking about whether or not to try and find an alternative, but I'll be out of cell phone range until late Sunday evening.