This is an archive of the discontinued LLVM Phabricator instance.

Add intrinsics for constrained floating point operations
ClosedPublic

Authored by andrew.w.kaylor on Nov 22 2016, 5:32 PM.

Details

Summary

This adds intrinsics that can be used to constrain optimizations that assume the default rounding mode and ignore FP exceptions.

I am starting with a simple lowering that translates the intrinsics directly to the corresponding target-independent FP operations when the selection DAG is built. I think this is necessary because the existing target-specific pattern matching for FP operations is fairly complex and I think attempting to duplicate all of that infrastructure is likely to result in errors and will certainly result in more work each time new wrinkles are introduced in the future.

I intend to model the implicit uses and defs of FP control and status registers in a future revision, which should be sufficient to prevent unwanted optimizations at the MachineInstr level. There is also a potential for incorrect code motion in during instruction selection. My tentative plan to handle that is to introduce pseudo-instruction nodes that wrap the inputs and outputs of the FP operations created based on the new intrinsics and effectively model the implicit FP control and status register behavior and use a chain. These nodes would then be eliminated during instruction selection.

I believe that this patch is sufficient as presented here and can be committed without any solution to the potential problems with code motion at the ISel and MachineInstr levels.

Diff Detail

Repository
rL LLVM

Event Timeline

andrew.w.kaylor retitled this revision from to Add intrinsics for constrained floating point operations.
andrew.w.kaylor updated this object.
andrew.w.kaylor set the repository for this revision to rL LLVM.
andrew.w.kaylor added a subscriber: llvm-commits.
hfinkel edited edge metadata.Nov 23 2016, 6:31 AM

I intend to model the implicit uses and defs of FP control and status registers in a future revision, which should be sufficient to prevent unwanted optimizations at the MachineInstr level. There is also a potential for incorrect code motion in during instruction selection. My tentative plan to handle that is to introduce pseudo-instruction nodes that wrap the inputs and outputs of the FP operations created based on the new intrinsics and effectively model the implicit FP control and status register behavior and use a chain. These nodes would then be eliminated during instruction selection.

Based on our conversation at the dev meeting, here's how I thought this would work:

  1. Introduce target-independent chain-carrying nodes to represent these operations. For argument's sake, STRICT_FADD, etc.
  2. Since nothing in the SDAG knows about what these nodes do, there's no problem with optimizations doing bad things.
  3. You'd do something minimal in SelectionDAGISel::DoInstructionSelection() around the call to:

    Select(Node);

so that it would become:

bool IsStrictFPOp = isStrictFPOp(Node);
if (IsStrictFPOp)
  mutateStrictFPToFP(Node); // STRICT_FADD -> FADD, etc.

Select(Node);

if (IsStrictFPOp && !TLI->addStrictFPRegDeps(Node))
  report_fatal_error("Could not add strict FP reg deps");

and then you'd be done. Obviously this is somewhat hand wavy, but if there are complexities I'm overlooking, I'd like to understand them.

Based on our conversation at the dev meeting, here's how I thought this would work:

  1. Introduce target-independent chain-carrying nodes to represent these operations. For argument's sake, STRICT_FADD, etc.
  2. Since nothing in the SDAG knows about what these nodes do, there's no problem with optimizations doing bad things.
  3. You'd do something minimal in SelectionDAGISel::DoInstructionSelection() around the call to:

    Select(Node);

so that it would become:

bool IsStrictFPOp = isStrictFPOp(Node);
if (IsStrictFPOp)
  mutateStrictFPToFP(Node); // STRICT_FADD -> FADD, etc.

Select(Node);

if (IsStrictFPOp && !TLI->addStrictFPRegDeps(Node))
  report_fatal_error("Could not add strict FP reg deps");

and then you'd be done. Obviously this is somewhat hand wavy, but if there are complexities I'm overlooking, I'd like to understand them.

Yes, that's exactly what I wanted to do, but I couldn't figure out how to do it. This is my first time exploring in ISel and it's been a bit intimidating. The way you describe it sounds very simple. I guess I was just reluctant to do that because there are currently no target-independent transformations happening there. I tried putting something in SelectionDAGISel::SelectCommonCode() but it seemed like that was too late for the kind of double transformation I needed (STRICT_FADD->FADD->target-specific instruction). I'm always hesitant to believe that my new feature is special and really needs something that has never been needed before. Usually that's a signal that I'm misunderstanding the design. So I tried to do something that seemed a more natural fit with the existing code.

I'll go back and see what I can do with it following the model you describe here and see what it looks like.

arsenm added a subscriber: arsenm.Nov 23 2016, 11:13 AM

As I mentioned at the meeting I think these need a way to control whether denormals are flushed or not

As I mentioned at the meeting I think these need a way to control whether denormals are flushed or not

You mean that there should be a way to control this on a per-operation basis, or there should be some way to represent that the user might be changing some thread state that controls how this is done?

As I mentioned at the meeting I think these need a way to control whether denormals are flushed or not

You mean that there should be a way to control this on a per-operation basis, or there should be some way to represent that the user might be changing some thread state that controls how this is done?

I was going to ask that same question.

For the purposes of these intrinsics, we only need enough information to control how optimization passes will handle these instructions. I suppose a transformation like constant folding could theoretically have something to flush to zero. Does it make sense to treat that as a set of new rounding mode, like LLVM_ROUND_TONEAREST_FTZ, etc.?

As I mentioned at the meeting I think these need a way to control whether denormals are flushed or not

You mean that there should be a way to control this on a per-operation basis, or there should be some way to represent that the user might be changing some thread state that controls how this is done?

I was going to ask that same question.

For the purposes of these intrinsics, we only need enough information to control how optimization passes will handle these instructions. I suppose a transformation like constant folding could theoretically have something to flush to zero. Does it make sense to treat that as a set of new rounding mode, like LLVM_ROUND_TONEAREST_FTZ, etc.?

Yes, I think it makes sense to treat it the same as a rounding mode. For lowering some of these operations, we would need to change the denormal mode for some subsequence of the lowering. Some of the OpenCL library builtin functions also need to do this, and need to be able to control this via the intrinsic.

Is flush-to-zero currently handled with function attributes?

Also, how would you feel about deferring flush-to-zero support to a later patch?

Is flush-to-zero currently handled with function attributes?

Also, how would you feel about deferring flush-to-zero support to a later patch?

It's not handled in general llvm. We currently have a subtarget feature for f32/f64 denormal support in the default rounding mode

It's not handled in general llvm. We currently have a subtarget feature for f32/f64 denormal support in the default rounding mode

It looks like several sub-targets have some kind of denormal support, but I didn't look closely enough to see what each of them is doing. That's kind of why I would prefer to defer the issue...because I haven't thought about it enough to know that I'd be implementing it correctly.

So to get back to Hal's question, is this needed at a per-instruction level? I understand that you want to select instructions differently based on this setting, but is attaching it to the instruction just a convenient way to get the information to the ISel or can this change across different scopes?

It's not handled in general llvm. We currently have a subtarget feature for f32/f64 denormal support in the default rounding mode

It looks like several sub-targets have some kind of denormal support, but I didn't look closely enough to see what each of them is doing. That's kind of why I would prefer to defer the issue...because I haven't thought about it enough to know that I'd be implementing it correctly.

So to get back to Hal's question, is this needed at a per-instruction level? I understand that you want to select instructions differently based on this setting, but is attaching it to the instruction just a convenient way to get the information to the ISel or can this change across different scopes?

It's the same as the rounding mode and be per-instruction

andrew.w.kaylor edited edge metadata.

I re-wrote the ISel code to introduce a pseudo-instruction for the strict variants of FP operations, which is mutated directly to a normal FP node just before instruction selection. I removed the SDNodeFlag extensions I had added in my original patch because they aren't needed in this implementation. Some variant of that code will likely need to be re-introduced at some point, particularly to handle FTZ rounding modes.

I did not introduce the code to add a target-specific implicit use of the FP-environment register. I do intend to add this in a later patch, but this code works well enough without it that I believe it can be deferred.

At some point it will be necessary to teach the LegalizeDAG code how to handle the strict instructions in order to legalize strict FP nodes with vector operands. I intend to do that in a later patch also.

I did not introduce the code to add a target-specific implicit use of the FP-environment register. I do intend to add this in a later patch, but this code works well enough without it that I believe it can be deferred.

I'm fine with doing this, but there should be FIXME comments in the code about this and/or bug reports filed (preferably both) so that we've not tempted to remove 'experimental' from the name until we do this part of it.

Regarding denormals, LLVM has support for the following modes (include/llvm/Target/TargetOptions.h):

namespace FPDenormal {
  enum DenormalMode {
    IEEE,           // IEEE 754 denormal numbers
    PreserveSign,   // the sign of a flushed-to-zero number is preserved in
                    // the sign of 0
    PositiveZero    // denormals are flushed to positive zero
  };
}

and so I recommend just following the same pattern here as with the rounding mode with strings like "denormals.dynamic", "denormals.ieee", "denormals.preserve_sign", and "denormals.positive_zero".

docs/LangRef.rst
12046

To engage in some bikeshedding, I really don't like this naming convention which reminds me of macro names. Not that we've been extremely consistent about the standardized metadata strings we already use, but this would be yet another choice; we should avoid that.

For controlling LLVM optimization passes, we use strings like this !"llvm.loop.vectorize.enable". We also use strings like !"function_entry_count" and !"branch_weights" for profiling info. We also have strings like !"ProfileFormat" and !"TotalCount" in other places. Of these choices, I prefer the one used by our basic profiling information (i.e. !"branch_weights"), and so I'd prefer these be named:

"round_dynamic"
"round_downward"
...

and similar for the fpexcept strings. We might also borrow the dot-based hierarchy scheme from our loop optimization metadata and use:

"round.dynamic"
"round.downward"
 ...

and similar for the fpexcept strings. I think that I like this slightly better than just using the underscore separators.

I specifically dislike having "llvm" in the name here, as it makes it seem as though the meaning of these things is somehow LLVM-specific. It is not (they come from IEEE if nothing else). Having "llvm" in the optimization metadata makes a bit more sense because those refer to specific LLVM optimization passes.

12083

"not be unmasked" is a double negative. How about just saying will be masked? Or will not be enabled?

Few stylistic comments.

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
5773

No need for a SmallVector. Just create a 3 entry array and assign each element directly.

lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp
955

This doesn't need to be a SmallVector. You can just use

SDValue Ops[2] = { Node->getOperand(1), Node->getOperand(2) };

lib/IR/IntrinsicInst.cpp
107

Don't use 'else' after an 'if' containing a 'return' per coding standards.

lib/IR/IntrinsicInst.cpp
107

I can make that change. What I wanted to convey here is that this is effectively a switch statement with no uncovered cases. I suppose since there is no way to get the compiler to warn if that ever changes there is no particular benefit in doing it this way.

craig.topper added inline comments.Dec 5 2016, 11:43 AM
lib/IR/IntrinsicInst.cpp
108

You could maybe use a StringSwitch.

lib/IR/IntrinsicInst.cpp
108

That would be perfect, but can I put an llvm_unreachable() in a StringSwitch?

craig.topper added inline comments.Dec 5 2016, 12:12 PM
lib/IR/IntrinsicInst.cpp
108

I think StringSwitch will assert if it doesn't find a match and there is no default specified.

andrew.w.kaylor added inline comments.Dec 5 2016, 2:41 PM
lib/IR/IntrinsicInst.cpp
108

Yes, I saw that in the code. I wanted something that was informative in non-assert mode too. I guess the obvious solution to that is to add something like an UnreachableDefault() method to StringSwitch.

I don't think llvm_unreachable is guaranteed to do anything in non-asserts build. From the description in the header:

/ Marks that the current location is not supposed to be reachable.
/ In !NDEBUG builds, prints the message and location info to stderr.
/ In NDEBUG builds, becomes an optimizer hint that the current location
/ is not supposed to be reachable. On compilers that don't support
/ such hints, prints a reduced message instead.
/
/ Use this instead of assert(0). It conveys intent more clearly and
/ allows compilers to omit some unnecessary code.
#ifndef NDEBUG
#define llvm_unreachable(msg) \

::llvm::llvm_unreachable_internal(msg, __FILE__, __LINE__)

#elif defined(LLVM_BUILTIN_UNREACHABLE)
#define llvm_unreachable(msg) LLVM_BUILTIN_UNREACHABLE
#else
#define llvm_unreachable(msg) ::llvm::llvm_unreachable_internal()
#endif

mehdi_amini added inline comments.Dec 5 2016, 2:49 PM
lib/IR/IntrinsicInst.cpp
108

"Informative" as in "triggers a runtime failure"? Unreachable isn't the right solution, it is UB in release mode AFAIK.

I see. I still think it's worth adding UnreachableDefault() instead of just a comment explaining that other values will assert/crash. I'm checking the legal values in the verfifier, so this shouldn't be an issue.

mehdi_amini edited edge metadata.Dec 5 2016, 2:59 PM

Having the llvm_unreachable right after the StringSwitch should achieve the same thing.
Otherwise a possible more generic extension to StringSwitch would be to accept lambdas:

StringSwitch(...)
   .Case("Value1", [] { return computeValue(); })
   .Default([] -> T { report_fatal_error("Unexpected Value XXX in YYYY"); }

Having the llvm_unreachable right after the StringSwitch should achieve the same thing.

I'm not sure I understand what you're suggesting here. If I do this:

return StringSwitch<RoundingMode>(RoundingArg)
  .Case("round.dynamic",    rmDynamic)
  .Case("round.tonearest",  rmToNearest)
  .Case("round.downward",   rmDownward)
  .Case("round.upward",     rmUpward)
  .Case("round.towardzero", rmTowardZero);
llvm_unreachable("Unexpected rounding mode argument in FP intrinsic!");

the llvm_unreachable statement is purely unreachable because the implicit default from StringSwitch will assert or dereference a null pointer. The llvm_unreachable in this case effectively becomes a comment. So what I'd like to do is this:

return StringSwitch<RoundingMode>(RoundingArg)
  .Case("round.dynamic",    rmDynamic)
  .Case("round.tonearest",  rmToNearest)
  .Case("round.downward",   rmDownward)
  .Case("round.upward",     rmUpward)
  .Case("round.towardzero", rmTowardZero)
  .UnreachableDefault("Unexpected rounding mode argument in FP intrinsic!");

which at least in !NDEBUG builds will produce a useful message. This would be fairly trivial to implement.

Having the llvm_unreachable right after the StringSwitch should achieve the same thing.

I'm not sure I understand what you're suggesting here. If I do this:

return StringSwitch<RoundingMode>(RoundingArg)
  .Case("round.dynamic",    rmDynamic)
  .Case("round.tonearest",  rmToNearest)
  .Case("round.downward",   rmDownward)
  .Case("round.upward",     rmUpward)
  .Case("round.towardzero", rmTowardZero);
llvm_unreachable("Unexpected rounding mode argument in FP intrinsic!");

the llvm_unreachable statement is purely unreachable because the implicit default from StringSwitch will assert or dereference a null pointer. The llvm_unreachable in this case effectively becomes a comment.

Not really: in an optimized build it means that the default case is unreachable. The assertion does not exist there, and the optimizer can drop the nullptr dereference.

So what I'd like to do is this:

return StringSwitch<RoundingMode>(RoundingArg)
  .Case("round.dynamic",    rmDynamic)
  .Case("round.tonearest",  rmToNearest)
  .Case("round.downward",   rmDownward)
  .Case("round.upward",     rmUpward)
  .Case("round.towardzero", rmTowardZero)
  .UnreachableDefault("Unexpected rounding mode argument in FP intrinsic!");

which at least in !NDEBUG builds will produce a useful message. This would be fairly trivial to implement.

Right the only difference is that you get a nicer runtime failure in assert mode.

arsenm added inline comments.Dec 5 2016, 5:01 PM
docs/LangRef.rst
12126

Typo: lllvm (seems repeated a number of times)

arsenm added inline comments.Dec 5 2016, 5:04 PM
include/llvm/IR/Intrinsics.td
478

Also fma, minnum, maxnum, sqrt

andrew.w.kaylor added inline comments.Dec 5 2016, 5:19 PM
include/llvm/IR/Intrinsics.td
478

That's a bit of a can of worms, isn't it? I mean there are a whole bunch of FP intrinsics, and I can imagine it making sense to have constrained versions of any of them.

Having the llvm_unreachable right after the StringSwitch should achieve the same thing.

I'm not sure I understand what you're suggesting here. If I do this:

return StringSwitch<RoundingMode>(RoundingArg)
  .Case("round.dynamic",    rmDynamic)
  .Case("round.tonearest",  rmToNearest)
  .Case("round.downward",   rmDownward)
  .Case("round.upward",     rmUpward)
  .Case("round.towardzero", rmTowardZero);
llvm_unreachable("Unexpected rounding mode argument in FP intrinsic!");

the llvm_unreachable statement is purely unreachable because the implicit default from StringSwitch will assert or dereference a null pointer. The llvm_unreachable in this case effectively becomes a comment.

Not really: in an optimized build it means that the default case is unreachable. The assertion does not exist there, and the optimizer can drop the nullptr dereference.

But the 'return' will always return, won't it? I don't see how the optimizer could associate the llvm_unreachable statement with the 'return StringSwitch(...)' to be able to do anything with it.

I'm in the process of preparing an updated revision that implements UnreachableDefault as I suggested, but I'm not overly attached to that approach. If you feel strongly that something else would be better I can change it.

andrew.w.kaylor edited edge metadata.

Fixed typos and style issues.
Added FIXME comments as requested.

I have opened two Bugzilla reports.

One to add FP environment register modeling:
https://llvm.org/bugs/show_bug.cgi?id=31284

One to address denormal handling:
https://llvm.org/bugs/show_bug.cgi?id=31285

I'd like to talk more about the denormal handling apart from this review. I'm not certain that rolling it into these intrinsics is the correct solution. It might be, and I'd be happy to do that if we agree that's the best approach. It seems to me as though it has certain similarities to the fast math flags as well, so I'd like to talk about that relationship a bit.

arsenm added a comment.Dec 5 2016, 5:42 PM

Fixed typos and style issues.
Added FIXME comments as requested.

I have opened two Bugzilla reports.

One to add FP environment register modeling:
https://llvm.org/bugs/show_bug.cgi?id=31284

One to address denormal handling:
https://llvm.org/bugs/show_bug.cgi?id=31285

I'd like to talk more about the denormal handling apart from this review. I'm not certain that rolling it into these intrinsics is the correct solution. It might be, and I'd be happy to do that if we agree that's the best approach. It seems to me as though it has certain similarities to the fast math flags as well, so I'd like to talk about that relationship a bit.

Correct handling of ftz is required for correctness, so it isn't appropriate to be an optimization hint placed with the fast math flags

Correct handling of ftz is required for correctness, so it isn't appropriate to be an optimization hint placed with the fast math flags

But what's correct? The reason I think it's similar to the fast math flags is that if nothing is specified then optimizations can make no transformations based on FTZ, and if you want to be able to make a transformation that assumes a certain mode then you have to find the attribute set.

Basically, the way I've been viewing the difference between the intrinsics I'm introducing and the fast math flags is that the fast math flags are permissive (behavior is assumed to be restricted unless the attributes are present) whereas the intrinsics are restrictive (behavior is assumed to be permitted unless the constraining intrinisc is used). I'm not sure the denormal handling falls cleanly into either of these categories, but it seems to me more like the permissive approach is appropriate.

FWIW, rounding controls are needed for llvm.fma.*, llvm.fmuladd.*, and llvm.sqrt.*

Also, I don't see any reason to cover frem, but I also don't understand how frem ever got to be an instruction instead of a standard C library intrinsic.

FWIW, rounding controls are needed for llvm.fma.*, llvm.fmuladd.*, and llvm.sqrt.*

I agree these are needed. I've added a comment in the .td file and intend to add variations of those intrinsics in a later revision.

Also, I don't see any reason to cover frem, but I also don't understand how frem ever got to be an instruction instead of a standard C library intrinsic.

I'm not sure it actually happens, but I would think that frem is potentially subject to the same sorts of constant folding that can be done with fdiv.

DavidKreitzer edited edge metadata.Jan 4 2017, 8:52 AM

This all looks pretty good to me, Andy. Regarding

Also, I don't see any reason to cover frem, but I also don't understand how frem ever got to be an instruction instead of a standard C library intrinsic.

I'm not sure it actually happens, but I would think that frem is potentially subject to the same sorts of constant folding that can be done with fdiv.

When frem has a meaningful result, it is always exact, so perhaps we should omit the rounding behavior argument? I think we still need a constrained frem intrinsic, though, to handle the exceptional behavior in cases such as "frem inf, x" and "frem x, 0".

Regarding the denormal-handling issue, I agree that it is reasonable to defer support to a subsequent patch. I suspect you are ultimately going to want a separate argument rather than folding denormal-handling into the rounding mode. But we can discuss that in the Bugzilla report that you opened.

docs/LangRef.rst
12147

This question might be beyond the scope of this initial change set, but here goes.

FP negation is currently handled using fsub -0, X. Is that sufficient in a constrained context? If we allow X to be a signaling NaN, -0-X should raise Invalid while -X should not, at least according to IEEE-754.

lib/IR/Verifier.cpp
4265

Rather than duplicating the legal rounding mode & exception behavior strings here (and in the getRoundingMode & getExceptionBehavior methods), would it be better to have a string-->enum function that is called in both places?

scanon added a subscriber: scanon.Jan 4 2017, 9:32 AM
scanon added inline comments.
docs/LangRef.rst
12147

Probably more importantly, -X has a defined signbit (the negation of whatever the signbit of X was), but -0-X does not [assuming the obvious binding of -X to the IEEE 754 negate operation and -0-X to the subtract operation].

When frem has a meaningful result, it is always exact, so perhaps we should omit the rounding behavior argument? I think we still need a constrained frem intrinsic, though, to handle the exceptional behavior in cases such as "frem inf, x" and "frem x, 0".

The implementation is slightly simpler if I give the frem intrinsic a rounding argument, even though it isn't needed. Otherwise, that intrinsic couldn't be handled by the same subclass as the others. I realize that's a fairly weak argument, but I just feel like making this one intrinsic different from the others will make the code ugly.

lib/IR/Verifier.cpp
4265

I do like the idea of having a single location for these strings. On the other hand, I think I'd need to introduce a new enum value (invalid?) that is only used here. I'll think about it a bit and try to consolidate the strings one way or another.

When frem has a meaningful result, it is always exact, so perhaps we should omit the rounding behavior argument? I think we still need a constrained frem intrinsic, though, to handle the exceptional behavior in cases such as "frem inf, x" and "frem x, 0".

The implementation is slightly simpler if I give the frem intrinsic a rounding argument, even though it isn't needed. Otherwise, that intrinsic couldn't be handled by the same subclass as the others. I realize that's a fairly weak argument, but I just feel like making this one intrinsic different from the others will make the code ugly.

Uniformity of implementation is a reasonable argument in favor of the extraneous rounding mode parameter for the frem intrinsic. But could you please make it clear that that is your intent in the language ref description of @llvm.experimental.constrained.frem?

docs/LangRef.rst
12147

Steve, are you referring to the fact that the sign of -0-X is unspecified when X is NaN or something else? I am trying to understand the implications of your comment - whether they are specific to the new constrained FP intrinsics or whether something needs to be done for normal FP LLVM IR.

Another problem worth mentioning with implementing -X as -0-X in a constrained context is that -0-X will produce -0 when X is -0 and the rounding mode is toward -inf.

andrew.w.kaylor edited edge metadata.

-Combined uses of string literals for FP rounding mode and exception behavior.
-Removed extension to StringSwitch since it is no longer needed.
-Added documentation explaining that the rounding mode argument isn't used from the frem intrinsic.

arsenm added inline comments.Jan 10 2017, 5:47 PM
lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp
928–935

You can replace this with a switch strict->regular op mapping, and then you avoid having to switch over these again in MutateStrictFPToFP

andrew.w.kaylor marked an inline comment as done.

-Consolidated examination of StrictFP node opcodes.

Is there anything left blocking this patch?

hfinkel accepted this revision.Jan 25 2017, 5:05 PM

I'm happy with this now. We'll need to address adding implicit instruction dependencies (and generating these intrinsics from Clang in appropriate modes) in follow-up. Thanks for working on this.

This revision is now accepted and ready to land.Jan 25 2017, 5:05 PM
This revision was automatically updated to reflect the committed changes.

@andrew.w.kaylor I went through the mailing list thread regarding this change and saw "Eventually, we’ll want to go back and teach specific optimizations to understand the intrinsics so that where possible optimizations can be performed in a manner consistent with dynamic rounding modes and strict exception handling.".
Do you have any references\plans on how to teach specific optimizations on this?

@andrew.w.kaylor I went through the mailing list thread regarding this change and saw "Eventually, we’ll want to go back and teach specific optimizations to understand the intrinsics so that where possible optimizations can be performed in a manner consistent with dynamic rounding modes and strict exception handling.".
Do you have any references\plans on how to teach specific optimizations on this?

It depends on the optimization. Each optimization needs to be evaluated in some way to determine if it is safe to perform the optimization with the floating point constraints. Simon Moll has an idea that he can update the pattern matching to also match closely related intrinsics. Simon intends to work on this for the vector predicated intrinsics, but the idea should extend to constrained floating point intrinsics. However, we will still need to update the optimizations to evaluate whether its intended transformation meets the constraints. For example, pattern matching may find opportunities for constant folding, but we will need to examine the actual constants to determine if the operation being folded would be inexact (and therefore subject to rounding) or otherwise raise an FP exception.