This is an archive of the discontinued LLVM Phabricator instance.

[RFC v2] Allow target to handle STRICT floating-point nodes
ClosedPublic

Authored by uweigand on Dec 10 2018, 4:33 AM.

Details

Summary

This is an alternate approach to the RFC in D45576 / D52785 / D52786. The main difference is that we're no longer attempting to use MachineMemOperand structures to capture floating-point exception status.

Instead, we make the MI codegen explicitly aware of the floating-point exceptions by introducing two new concepts:

  • A new MCID flag "mayRaiseFPException" that the target should set on any instruction that possibly can raise FP exception according to the architecture definition.
  • A new MI flag FPExcept that CodeGen/SelectionDAG will set on any MI instruction resulting from expansion of any constrained FP intrinsic.

Any MI instruction that is *both* marked as mayRaiseFPException *and* FPExcept then needs to be considered as raising exceptions by MI-level codegen (e.g. scheduling).

Now, setting those two new flags is relatively straightforward. The mayRaiseFPException flag is simply set via TableGen by marking all relevant instruction patterns in the .td files.

The FPExcept flag is set in SDNodeFlags when creating the STRICT_ nodes in the SelectionDAG, and gets inherited in the MachineSDNode nodes created from it during instruction selection. The flag is then transfered to an MIFlag when creating the MI from the MachineSDNode. This is handled just like fast-math flags like no-nans are handled today. In a way, we can think of the FPExcept flag like an inverted fast-math flag "no-except" that just defaults to true instead of false.

This should address the concerns that MachineMemOperands might get dropped accidentally. The new mayRaiseException flag is an invariant setting anyway, and the FPExcept flag is a MIFlag, and to my understanding those cannot be dropped during MI codegen anyway.

Diff Detail

Repository
rL LLVM

Event Timeline

uweigand created this revision.Dec 10 2018, 4:33 AM

I like this implementation a lot. Some targets control the FPEnv through a register, not through memory. Modeling instructions with register side-effects through MachineMemOperand wasn't intuitive.

Also, the pattern changes seem concise. This is really good.

This patch does seem FP exception centric and rounding mode agnostic though. Should FPExcept and friends be named something more general to cover both? To be clear, I'm okay with the current naming scheme, so just playing Devil's advocate.

include/llvm/CodeGen/SelectionDAGNodes.h
378 ↗(On Diff #177482)

Nit-picking: the first sentence was a little hard to parse. One alternative:

We assume by default that instructions are considered to not raise

Or maybe a rewrite would be cleaner:

We assume instructions do not raise floating-point exceptions by default, and only those marked explicitly may do so.

Just thinking aloud...

uweigand updated this revision to Diff 178261.Dec 14 2018, 11:49 AM

Updated comment.

uweigand marked an inline comment as done.Dec 14 2018, 11:53 AM

This patch does seem FP exception centric and rounding mode agnostic though. Should FPExcept and friends be named something more general to cover both? To be clear, I'm okay with the current naming scheme, so just playing Devil's advocate.

Well, this is because I'm really only using this feature to handle exceptions (b.t.w. just like the MemOperands in the alternate attempt).

Rounding mode is handled completely in the back-end: we simply make all floating-point instructions (strict or not doesn't even matter here) use the FPC control register, and mark all instructions that change the rounding mode as changing FPC. The one missing piece is that we need to mark all function calls to functions marked as within FENV_ACCESS ON regions as also clobbering FPC -- that can be done easily in the target ABI code as soon as the front-end marks such function calls (which we need anyway).

cameron.mcinally added a comment.EditedDec 14 2018, 12:33 PM

@uweigand, it looks like you submitted some unintended changes with the last Diff update. Just FYI...

[EDIT: Maybe not. I'm not sure what I was seeing, but it seems to be gone after reloading. Apologies for the noise.]

Well, this is because I'm really only using this feature to handle exceptions (b.t.w. just like the MemOperands in the alternate attempt).

Rounding mode is handled completely in the back-end: we simply make all floating-point instructions (strict or not doesn't even matter here) use the FPC control register, and mark all instructions that change the rounding mode as changing FPC. The one missing piece is that we need to mark all function calls to functions marked as within FENV_ACCESS ON regions as also clobbering FPC -- that can be done easily in the target ABI code as soon as the front-end marks such function calls (which we need anyway).

Hm, I may be thinking about this a little differently. I'll elaborate to make sure we're on the same page...

I was envisioning a -frounding-math-like option that is more than just adhering to the current rounding mode. That option would also suppress optimizations affecting rounding results. This would be very much like how we use STRICT_ nodes to suppress optimizations on instructions that may trap. That's why I suggested that mayRaiseException and friends may be a little too narrow.

Is that how you are envisioning the rounding mode support to work too?

Well, mayRaise Exception is purely a MI level flag. I struggle to see where optimizations on the MI level would ever care about rounding modes in the sense you describe: note that currently, MI optimizations don't even know which operation an MI instruction performs -- if you don't even know whether you're dealing with addition or subtraction, why would you care which rounding mode the operation is performed in? MI transformations instead care about what I'd call "structural" properties of the operation: what are the operands, what is input vs. output, which memory may be accessed, which special registers may involved, which other side effects may the operation have. This is the type of knowledge you need for the types of transformations that are done on the MI level: mostly about moving instructions around, arriving at an optimal schedule, de-duplicating identical operations performed multiple times etc. (Even things like simply changing a register operand to a memory operand for the same operation cannot be done solely by common MI optimizations but require per-target support.)

On this level, I do believe that my proposed patch captures all relevant "structural" properties of floating-point instructions: dependency on control registers (including controlling changing rounding modes), and the possibility of floating-point exceptions and traps (affecting whether instruction may be rescheduled or executed speculatively). If you can think of anything I missed here, I'd certainly appreciate to learn more.

Now, of course, earlier passes (on the IR and also SelectionDAG levels) certainly do perform transformations that affect the actual operations performed, and those would certainly care. But at those levels we already have all the extra information we need; at the IR level we have the constrained intrinsics, and at the DAG level we have the STRICT_ nodes. (Currently, those STRICT_ nodes lose a bit of information available at the IR level: we don't specifically distinguish whether a STRICT_ node arose from an IR that is marked as requiring special handling because of just rounding modes, just exceptions, or both. If we ever want to add DAG optimization for STRICT_ node that requires this information, it would certainly be fine with me to add another bit in the SDNodeFlags for example.)

This looks like a promising direction. I particularly like the idea of having a way to intersect information from the backend instruction definitions with the constraints coming from the IR. However, I also have some concerns.

It seems that you're doing two things in this patch -- (1) adding the FCP register to the SystemZ backend, and (2) adding the strict FP handshake mechanism. Could you separate those two change sets?

I'd the new flags being added to make specific reference to FP. People who don't do a lot of FP-specific work are likely to misunderstand any unqualified term such as mayRaiseException. For the MIFlag I'd prefer something like FPStrict because I expect that we will want to extend this to handling rounding mode issues like constant folding or code motion relative to instructions that change the rounding mode.

The unmodeled side effects approach is a lot stronger than we really need. Ideally, these instructions would only act as a barrier to one another and not to other instructions. My concern is that simply marking them as having unmodeled side effects is going to have a bigger hit on optimizations than we want. Granted, the same thing is happening at the IR level with the intrinsics but I have a rough idea of how we'll be able to move forward in that case without rewriting the mechanism.

What I'd really like to see is a way for the backend to be able to completely model the relevant FP register uses/defs but conditionally strip those uses/defs out for non-strict instructions. I explored this with the X86 backend. I had to create an artificial split between the control and status parts of the MXCSR register. I ran into problems because every FP instruction had to be a use and a def of the status bits. That would be fine for strict mode, but it wasn't really acceptable as a default behavior. I only have a rough idea of how this would work.

Another problem I ran into when I tried to add modeling of the MXCSR register is that the machine verifier didn't like seeing uses of this register without ever having seen a def. That didn't show up in my initial testing and I got as far as committing something that modeled the use of MXCSR by the instructions that explicitly read and write it, but the machine verifier issue appeared in post-commit testing. I think it came up with inline assembly that was using the STMXCSR instruction directly. It seems like this could be an issue with the FCP register you're adding for System Z also.

Eventually I'd really like a way to pass the rounding mode argument through to the backend. I've been emphatic in the past that this argument is meant to describe the assumed rounding mode rather than control it, but it occurs to me that for architectures which have FP instructions where the rounding mode can/must be baked into the instruction the backend could use the assumed rounding mode to set the rounding mode operand. In X86 we've only got a few instructions like this, but it's my understanding that some other architectures require it more broadly.

include/llvm/CodeGen/SelectionDAGNodes.h
469 ↗(On Diff #178261)

There has been some discussion in the past about the possibility of using the fast math flags while still preserving strict FP exceptions. I'm not sure we reached a conclusion.

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
6520 ↗(On Diff #178261)

This really shouldn't be unconditional. It's possible to have constrained intrinsics that are preserving the rounding mode but ignoring FP exceptions. At this point we still have that information.

Well, mayRaise Exception is purely a MI level flag. I struggle to see where optimizations on the MI level would ever care about rounding modes in the sense you describe: note that currently, MI optimizations don't even know which operation an MI instruction performs -- if you don't even know whether you're dealing with addition or subtraction, why would you care which rounding mode the operation is performed in? MI transformations instead care about what I'd call "structural" properties of the operation: what are the operands, what is input vs. output, which memory may be accessed, which special registers may involved, which other side effects may the operation have. This is the type of knowledge you need for the types of transformations that are done on the MI level: mostly about moving instructions around, arriving at an optimal schedule, de-duplicating identical operations performed multiple times etc. (Even things like simply changing a register operand to a memory operand for the same operation cannot be done solely by common MI optimizations but require per-target support.)

Huh, this is interesting and was not clear to me a priori. After digging around a bit, I agree with you that almost all of the MI code is fine wrt rounding.

This looks like a promising direction. I particularly like the idea of having a way to intersect information from the backend instruction definitions with the constraints coming from the IR. However, I also have some concerns.

Thanks for the review!

It seems that you're doing two things in this patch -- (1) adding the FCP register to the SystemZ backend, and (2) adding the strict FP handshake mechanism. Could you separate those two change sets?

Of course. I'm keeping them as separate patches internally anyway, and was planning to commit them separately. I just wanted to show how the full picture would look like in the end.

I'd the new flags being added to make specific reference to FP. People who don't do a lot of FP-specific work are likely to misunderstand any unqualified term such as mayRaiseException. For the MIFlag I'd prefer something like FPStrict because I expect that we will want to extend this to handling rounding mode issues like constant folding or code motion relative to instructions that change the rounding mode.

I'd be fine with using mayRaiseFPException instead. I specifically did not want to mix other aspects (like rounding modes) into it, but rather separate the concerns here. See my previous comments to Cameron about rounding modes in general. But if we do need to track those at the MI level, I'd rather prefer to add another bit in MIFlags, and keep the one I add here strictly about exceptions.

Code motion relative to instructions that change the rounding mode is already handled in my patch, via the dependency on FPC that is added to all FP instructions.

The unmodeled side effects approach is a lot stronger than we really need. Ideally, these instructions would only act as a barrier to one another and not to other instructions. My concern is that simply marking them as having unmodeled side effects is going to have a bigger hit on optimizations than we want. Granted, the same thing is happening at the IR level with the intrinsics but I have a rough idea of how we'll be able to move forward in that case without rewriting the mechanism.

Agreed. In fact, that's really the main point why I want to have the MI flag specifically about exceptions, so that common code can be written to handle the flag just exactly as is required for FP exceptions and nothing else. I just didn't want to complicate this initial patch, and therefore added only a quick (overly conservative, but correct) check to hasUnmodeledSideEffect. Once this is in, we can refine the handling as a follow-on patch. (For example, handle exactly the necessary dependencies for FP exceptions in ScheduleDAGInstrs::buildSchedGraph, which might e.g. allow two FP instructions to be swapped as long as no speculative execution of FP instructions is introduced.)

What I'd really like to see is a way for the backend to be able to completely model the relevant FP register uses/defs but conditionally strip those uses/defs out for non-strict instructions. I explored this with the X86 backend. I had to create an artificial split between the control and status parts of the MXCSR register. I ran into problems because every FP instruction had to be a use and a def of the status bits. That would be fine for strict mode, but it wasn't really acceptable as a default behavior. I only have a rough idea of how this would work.

That's why I was going away from using registers to model exceptions. As you suggest, I'm also in effect performing a split between the control and status parts of the FPC register: I'm using the register at the MI level to model the control part, and using the mayRaiseException flag to model the status part. The side effect implied by the flag includes both the trap and setting of the status bits. This seems more straightforward than adding register defs, in particular since the latter overly constrain scheduling.

Eventually I'd really like a way to pass the rounding mode argument through to the backend. I've been emphatic in the past that this argument is meant to describe the assumed rounding mode rather than control it, but it occurs to me that for architectures which have FP instructions where the rounding mode can/must be baked into the instruction the backend could use the assumed rounding mode to set the rounding mode operand. In X86 we've only got a few instructions like this, but it's my understanding that some other architectures require it more broadly.

I'm not really sure I see where this can help. On SystemZ there are a few instructions that encode rounding modes, but those are mapped to separate DAG opcodes. For example, frint / fnearbyint / ffloor / fceil / ftrunc / fround all map to the same SystemZ instruction, just with a different encoded rounding mode value. The existing DAG instruction selection mechanism seems fine for that.

uweigand marked an inline comment as done.Dec 18 2018, 2:29 AM
uweigand added inline comments.
lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
6520 ↗(On Diff #178261)

Yes, of course, I'll add that for the final version. That's the main reason why I added the SDNode flag in the first place, instead of just checking for a STRICT_ node later on.

wuzish added a subscriber: wuzish.EditedFeb 24 2019, 11:19 PM

Well, mayRaise Exception is purely a MI level flag. I struggle to see where optimizations on the MI level would ever care about rounding modes in the sense you describe: note that currently, MI optimizations don't even know which operation an MI instruction performs -- if you don't even know whether you're dealing with addition or subtraction, why would you care which rounding mode the operation is performed in? MI transformations instead care about what I'd call "structural" properties of the operation: what are the operands, what is input vs. output, which memory may be accessed, which special registers may involved, which other side effects may the operation have. This is the type of knowledge you need for the types of transformations that are done on the MI level: mostly about moving instructions around, arriving at an optimal schedule, de-duplicating identical operations performed multiple times etc. (Even things like simply changing a register operand to a memory operand for the same operation cannot be done solely by common MI optimizations but require per-target support.)

On this level, I do believe that my proposed patch captures all relevant "structural" properties of floating-point instructions: dependency on control registers (including controlling changing rounding modes), and the possibility of floating-point exceptions and traps (affecting whether instruction may be rescheduled or executed speculatively). If you can think of anything I missed here, I'd certainly appreciate to learn more.

Now, of course, earlier passes (on the IR and also SelectionDAG levels) certainly do perform transformations that affect the actual operations performed, and those would certainly care. But at those levels we already have all the extra information we need; at the IR level we have the constrained intrinsics, and at the DAG level we have the STRICT_ nodes. (Currently, those STRICT_ nodes lose a bit of information available at the IR level: we don't specifically distinguish whether a STRICT_ node arose from an IR that is marked as requiring special handling because of just rounding modes, just exceptions, or both. If we ever want to add DAG optimization for STRICT_ node that requires this information, it would certainly be fine with me to add another bit in the SDNodeFlags for example.)

I think we can have a flag to indicate rounding mode for float operator such as FADD instead of STRICT_FADD, because STRICT_FADD does have side effect(chain), but some target (eg. RISCV) has static rounding mode encoded in the instruction which does not side effect. Then we can optimize specific STRICT_FADD node which ignoring exception into normal FADD with corresponding static round mode.

kpn added a comment.Feb 25 2019, 6:44 AM

I think we can have a flag to indicate rounding mode for float operator such as FADD instead of STRICT_FADD, because STRICT_FADD does have side effect(chain), but some target (eg. RISCV) has static rounding mode encoded in the instruction which does not side effect. Then we can optimize specific STRICT_FADD node which ignoring exception into normal FADD with corresponding static round mode.

WRT the chain: "That's not a bug, that's a feature!"

The chain prevents reordering of floating point instructions by the SelectionDAG. That makes it required.

Some System/Z floating point instructions can take an optional per-instruction rounding mode encoded in the instruction. This does not eliminate the need for strict ordering of floating point operations.

jsji added a subscriber: jsji.Feb 25 2019, 10:40 AM
In D55506#1408913, @kpn wrote:

I think we can have a flag to indicate rounding mode for float operator such as FADD instead of STRICT_FADD, because STRICT_FADD does have side effect(chain), but some target (eg. RISCV) has static rounding mode encoded in the instruction which does not side effect. Then we can optimize specific STRICT_FADD node which ignoring exception into normal FADD with corresponding static round mode.

WRT the chain: "That's not a bug, that's a feature!"

The chain prevents reordering of floating point instructions by the SelectionDAG. That makes it required.

Some System/Z floating point instructions can take an optional per-instruction rounding mode encoded in the instruction. This does not eliminate the need for strict ordering of floating point operations.

Yes, the chain prevents redordering of floating point instructions by SelectionDAG. But it's fine to reorder instructions with static rounding mode(and ignore exception) and no need for chain. Because they do not have side effect, it means the result of such instructions only depend on input (eg, float add including source operands and mode encoding), does not depend on outside state variable that is current rounding mode. It also means the result is unchanged no matter how many times happens so long as the input is same.

Yes, the chain prevents redordering of floating point instructions by SelectionDAG. But it's fine to reorder instructions with static rounding mode(and ignore exception) and no need for chain. Because they do not have side effect, it means the result of such instructions only depend on input (eg, float add including source operands and mode encoding), does not depend on outside state variable that is current rounding mode. It also means the result is unchanged no matter how many times happens so long as the input is same.

I'm not sure exactly what you have in mind here. For instructions which take explicit rounding mode arguments reordering is not always an issue, but when we're building the selection DAG do we know that we will end up with instructions that take an explicit rounding mode argument? For some architectures we do, but not for all. The concern with reordering, in the case where FP status flags are being ignored, is that we need to avoid possibly reordering instructions with respect to instructions that change the rounding mode. If the rounding mode is explicit in the instruction that's not an issue but otherwise it is, and at least in the case of X86 I don't think we can tell ahead of time (at least without doing things we shouldn't be doing) which instructions we'll end up with.

uweigand updated this revision to Diff 199243.May 13 2019, 4:18 AM
uweigand edited the summary of this revision. (Show Details)
uweigand added reviewers: kbarton, scanon, mcberg2017.

Updated to address issues raised during the review:

  • Separated out the introduction of the SystemZ FPC register -- this has now been committed as rev. 360570. The remaining changes in this patch are now solely related to handling FP exceptions.
  • Renamed the new flag to mayRaiseFPException to make clear that this is about floating-point.
  • No longer treat FP exceptions as equivalent to unmodeled side effects, but treat them separately. Specifically, as I mentioned above, I'm now handling instructions that may raise FP exceptions in the scheduler such that they'll only conflict with each other (and global barriers like calls and unmodeled side effects), but not with loads or stores.
  • Respect the constraint intrinsics metadata to the extent that instructions with fpexcept.ignore no longer are marked as "may raise FP exceptions".

This patch, together with the FPC register patch already committed, is to my understanding sufficient to correctly handle the FP exception related aspects of the constrained intrinsics (as to rounding mode, we still need support for function calls that may change the rounding mode). To make progress, I'd like to move forward with committing this piece -- assuming all issues related to this part are now addressed. Comments / reviews welcome!

Aside from two minor comments, I think this looks fine.
However, I don't think I'm qualified to give the final approval for this to land as I'm just starting to learn the background here.

include/llvm/CodeGen/MachineInstr.h
835 ↗(On Diff #199243)

it -> if

include/llvm/CodeGen/SelectionDAGNodes.h
376 ↗(On Diff #199243)

Should this be renamed to NoFPExcept also, to remain consistent with the mayRaiseFPException flag?

Why is the FPBarrierChain getting involved with the BarrierChain for memory objects? Even without this patch I think we might be entangling the strict FP nodes with other chained nodes more than we should.

Are there any cases where mayRaiseFPException should not imply hasUnmodeledSideEffects? If we checked mayRaiseFPException() inside the hasUnmodeledSideEffects() implementation a lot of these changes wouldn't be needed, and it might help with people who didn't think about FP exceptions in future changes.

There is no way to distinguish between fpexcept.maytrap and fpexcept.strict after pattern matching. We could probably live with that, but it's less than ideal.

I'd like to see a solution that could handle rounding mode also. For rounding mode it will be even more important to have more than a single on/off flag. Do you have any ideas for that?

Why is the FPBarrierChain getting involved with the BarrierChain for memory objects? Even without this patch I think we might be entangling the strict FP nodes with other chained nodes more than we should.

The BarrierChain tracks calls, volatile/atomic memory accesses, and UnmodeledSideEffects. I do believe that FP exceptions must indeed be kept stable relative to those.

My patch does *not* entangle FP exceptions with regular memory accesses, which do *not* touch BarrierChain.

Are there any cases where mayRaiseFPException should not imply hasUnmodeledSideEffects? If we checked mayRaiseFPException() inside the hasUnmodeledSideEffects() implementation a lot of these changes wouldn't be needed, and it might help with people who didn't think about FP exceptions in future changes.

Well, my last version did have FP exceptions implying UnmodeledSideEffects, and you didn't like that :-) Specifically, that would mean that FP exceptions could not be moved across any normal memory access (since UnmodeledSideEffect instructions cannot) ...

There is no way to distinguish between fpexcept.maytrap and fpexcept.strict after pattern matching. We could probably live with that, but it's less than ideal.

Well, to be honest I don't really see what the difference between those two at the MI level would be. Can you explain e.g. how scheduling restrictions ought to differ? If we have a need for that, we can just add a second MI flag.

I'd like to see a solution that could handle rounding mode also. For rounding mode it will be even more important to have more than a single on/off flag. Do you have any ideas for that?

As I mentioned in the description, I believe rounding modes can be handled completely in the back-end, by simply modeling the FPC register (which I already committed on SystemZ). Every FP instruction (strict or not) is modeled as a user of FPC, all assembler instruction that modify it (which are only a few special-purpose instruction) are modeled as a definition, and function calls within a FENV_ACCESS section should be marked by the front-end as having a variant ABI, which will cause the back-end to have them marked as clobbering FPC (that last part is still missing, but is completely independent of everything that is done by this patch). What aspect of rounding modes do you think is not covered by this approach?

Why is the FPBarrierChain getting involved with the BarrierChain for memory objects? Even without this patch I think we might be entangling the strict FP nodes with other chained nodes more than we should.

The BarrierChain tracks calls, volatile/atomic memory accesses, and UnmodeledSideEffects. I do believe that FP exceptions must indeed be kept stable relative to those.

My patch does *not* entangle FP exceptions with regular memory accesses, which do *not* touch BarrierChain.

I see. Thanks for the explanation.

Are there any cases where mayRaiseFPException should not imply hasUnmodeledSideEffects? If we checked mayRaiseFPException() inside the hasUnmodeledSideEffects() implementation a lot of these changes wouldn't be needed, and it might help with people who didn't think about FP exceptions in future changes.

Well, my last version did have FP exceptions implying UnmodeledSideEffects, and you didn't like that :-) Specifically, that would mean that FP exceptions could not be moved across any normal memory access (since UnmodeledSideEffect instructions cannot) ...

Sorry about that. I forgot what I said before. I've just re-read my earlier comments, and I think that line of reasoning was sound. As I said before, hasUnmodeledSideEffects is -- at least in theory -- a stronger barrier than mayRaiseFPExceptions. Looking at your current patch, I had the impression that hasUnmodeledSideEffects and mayRaiseFPExceptions were always coinciding, but even if that is currently the case it isn't necessarily required to be so. I apologize for the noise.

There is no way to distinguish between fpexcept.maytrap and fpexcept.strict after pattern matching. We could probably live with that, but it's less than ideal.

Well, to be honest I don't really see what the difference between those two at the MI level would be. Can you explain e.g. how scheduling restrictions ought to differ? If we have a need for that, we can just add a second MI flag.

With regard to scheduling there is no difference between strict and maytrap. The difference comes when we want to eliminate instructions. For example, if we have something like this:

BB1:
  FMP %0, %1
  JCC %BB2, 4
  JMP %BB3
BB2:
  JMP %BB3
BB3:
  ...

Under "fpexcept.strict" we cannot eliminate the compare because it may raise an exception (though we can still get rid of the JCC), but under "fpexcept.maytrap" we can eliminate the compare because our only promise is that we won't raise spurious exceptions.

I'd like to see a solution that could handle rounding mode also. For rounding mode it will be even more important to have more than a single on/off flag. Do you have any ideas for that?

As I mentioned in the description, I believe rounding modes can be handled completely in the back-end, by simply modeling the FPC register (which I already committed on SystemZ). Every FP instruction (strict or not) is modeled as a user of FPC, all assembler instruction that modify it (which are only a few special-purpose instruction) are modeled as a definition, and function calls within a FENV_ACCESS section should be marked by the front-end as having a variant ABI, which will cause the back-end to have them marked as clobbering FPC (that last part is still missing, but is completely independent of everything that is done by this patch). What aspect of rounding modes do you think is not covered by this approach?

First, I'd like to avoid modeling the FP control and status registers when we aren't in a constrained mode. I haven't done any tests to measure the effect of modeling these registers, but it does place some constraints on the backend that don't need to be there in the unconstrained mode and at least has the potential to degrade performance.

Second, there are some additional optimizations we could make in the cases where we know the rounding mode. Constant folding is an example. I'm not sure there are any backend optimizations that do constant folding after instruction selection, but there was a review recently where this exact issue came up for constant folding during ISel and the conclusion was that we could handle it if we knew the rounding mode but if not we'd have to just block the folding.

Third, there are some architectures where the rounding mode can be included as an operand. This is a potentially confusing point so I'm going to be verbose. I've said before that the rounding mode in constrained instructions is only a descriptive hint to the optimizer, declaring what the rounding mode is at that point rather than a prescriptive operator that sets the rounding mode. However, in the case where we do know the rounding mode (i.e. the rounding mode operand is something other than "round.dynamic") we can use that information to select an instruction form that has a rounding mode operand. How important or unimportant this capability is will depend on the architecture, but in some cases I think it could be significant.

With regard to the patch as a whole, let me say that my biggest concern at this point is to make sure we aren't going toward a dead end. If we think that all of the issues I've raised can likely be addressed in some variation of the patch you've proposed here then I would be in favor of committing this patch now without solving all of those problems. I just don't want to put something in place that will need to be ripped out later in order to make progress. Of course I understand that there's a balance to be found here. I know some of what we have already will need to be replaced, and that's just the way things go. That is to say, I just want to make sure we're headed in more or less the right direction.

Thanks for your patience and persistence.

Hopefully modelling all SSE/AVX/AVX512 FP instruction as having an implicit use of the control portion of MXCSR for rounding mode and exception controls won't create a significant constraint on the backend. I think most of the trouble would start if we had it as an implicit def as well. Properly annotating this in tablegen will likely be tedious and error prone unfortunately. This is just due to the complexity of the 3 different encoding forms as well as the shared multiclasses in the tablegen files. I can't promise that some integer and FP stuff aren't using the same multiclasses.

Under "fpexcept.strict" we cannot eliminate the compare because it may raise an exception (though we can still get rid of the JCC), but under "fpexcept.maytrap" we can eliminate the compare because our only promise is that we won't raise spurious exceptions.

Currently, it looks like the MI common optimizers cannot easily make that distinction; for example, the MI-level dead code elimination pass checks "isSafeToMove" to decide whether an instruction may be deleted. Now, instructions that may raise FP exceptions are never safe to move even with fpexcept.maytrap, so that check must always fail. If it ever becomes important to make that distinction, I'd add more fine-grained primitives like isSafeToMove vs. isSafeToDelete, and then add a second MI flag like NoDelete that would be set for fpexcept.strict insns.

As an aside, I'm not sure what this fpexcept.strict semantics is even necessary for; I don't believe it is required for the C standard FENV_ACCESS ON mode ...

First, I'd like to avoid modeling the FP control and status registers when we aren't in a constrained mode. I haven't done any tests to measure the effect of modeling these registers, but it does place some constraints on the backend that don't need to be there in the unconstrained mode and at least has the potential to degrade performance.

As Craig pointed out, just adding a use cannot degrade performance. I've already changed the SystemZ back-end to do this unconditionally, and it generated 100% identical assembler code as before (except in functions that use a built-in to modify the rounding mode, but there we actually want those changes).

Second, there are some additional optimizations we could make in the cases where we know the rounding mode. Constant folding is an example. I'm not sure there are any backend optimizations that do constant folding after instruction selection, but there was a review recently where this exact issue came up for constant folding during ISel and the conclusion was that we could handle it if we knew the rounding mode but if not we'd have to just block the folding.

I do not believe it is ever possible to have constant folding at the MI level. Common MI passes don't even understand whether an instruction is an add or a subtract (they only understand the general "form" of an instruction, e.g. what are input vs. output operands etc.); given that, it is hard to see where knowledge of a rounding mode could ever be useful at this level. Those optimizations have to be done earlier.

Third, there are some architectures where the rounding mode can be included as an operand. This is a potentially confusing point so I'm going to be verbose. I've said before that the rounding mode in constrained instructions is only a descriptive hint to the optimizer, declaring what the rounding mode is at that point rather than a prescriptive operator that sets the rounding mode. However, in the case where we do know the rounding mode (i.e. the rounding mode operand is something other than "round.dynamic") we can use that information to select an instruction form that has a rounding mode operand. How important or unimportant this capability is will depend on the architecture, but in some cases I think it could be significant.

If platforms want to do that, my suggestion would be to make the rounding mode available during instruction selection (at the ISel DAG level) so that the platform can then select *different* MI instruction opcodes. I think this also wouldn't conflict with anything in this current patch and can be added later. There would be no need the to have the generic MI layer somehow encode rounding modes then.

With regard to the patch as a whole, let me say that my biggest concern at this point is to make sure we aren't going toward a dead end. If we think that all of the issues I've raised can likely be addressed in some variation of the patch you've proposed here then I would be in favor of committing this patch now without solving all of those problems. I just don't want to put something in place that will need to be ripped out later in order to make progress. Of course I understand that there's a balance to be found here. I know some of what we have already will need to be replaced, and that's just the way things go. That is to say, I just want to make sure we're headed in more or less the right direction.

Thanks for your patience and persistence.

Thanks for the continued review!

andrew.w.kaylor accepted this revision.Jun 4 2019, 2:12 PM

OK. I'm convinced. This should let us get a correct solution in place, and as you say we can add on something to handle rounding modes later if needed.

You should probably address Kit's minor comments, but otherwise I'd be very happy to see this committed.

This revision is now accepted and ready to land.Jun 4 2019, 2:12 PM
uweigand updated this revision to Diff 203261.Jun 5 2019, 3:11 PM

Addressed review comments and updated to mainline changes.

uweigand marked 2 inline comments as done.Jun 5 2019, 3:14 PM

OK. I'm convinced. This should let us get a correct solution in place, and as you say we can add on something to handle rounding modes later if needed.

You should probably address Kit's minor comments, but otherwise I'd be very happy to see this committed.

Thanks for the review, Andrew!

I've updated the patch to address Kit's comments, and handle recent mainline changes, in particular the new STRICT_FP_ROUND and STRICT_FP_EXTEND nodes. I'll be committing this version shortly.

This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptJun 5 2019, 3:33 PM

Is there any further step to enable the rounding-mode infra to handle rounding mode metadata in intrinsics? For example, covert rounding mode metadata to a new constant operand of strict_* series op. If some target/platform does not have related machine instruction for different static rounding-mode, then ignore this constant in td selection.

Is there any further step to enable the rounding-mode infra to handle rounding mode metadata in intrinsics? For example, covert rounding mode metadata to a new constant operand of strict_* series op. If some target/platform does not have related machine instruction for different static rounding-mode, then ignore this constant in td selection.

If this is actually needed by some target, I agree that there are further steps required to propagate that information. Instead of using a constant operand, I think it would probably be better to create a new subclass of SDNode (e.g. StrictFPSDNode or the like), and store this information in its associated bits, like e.g. access bits are stored for MemSDNodes.

I have another question about any_*.

For example, I see you use any_fadd to match both fadd and strict_fadd. I guess ZSystem target is similar to POWER that there is no fp addition machine instruction with nearest rounding mode and no exception statically.
But the semantic of fadd node requires. So theoretically, fadd node should be mapped into 2 instructions for correct compiling. One sets rounding mode to nearest and clear exception handler bit(no exception happens), the other is float addition.
Do we not handle fadd like above because there is no function error for such fine-grained instruction semantic mostly? And avoid performance penalty? @uweigand

I have another question about any_*.

For example, I see you use any_fadd to match both fadd and strict_fadd. I guess ZSystem target is similar to POWER that there is no fp addition machine instruction with nearest rounding mode and no exception statically.
But the semantic of fadd node requires. So theoretically, fadd node should be mapped into 2 instructions for correct compiling. One sets rounding mode to nearest and clear exception handler bit(no exception happens), the other is float addition.
Do we not handle fadd like above because there is no function error for such fine-grained instruction semantic mostly? And avoid performance penalty? @uweigand

I believe you may be misunderstanding the semantics of the non-strict operations. These do *not* instruct codegen to actively generate code to switch rounding modes and/or clear exception bits. Rather, the use of a non-strict operation like fadd is an *assertion* that tells codegen that it may *assume* that the rounding mode is set to default, trapping exceptions are switched off, and exceptions flags are don't care (i.e. subsequent code will not check them). If you ever use fadd in a context where this assumption is not true, the behaviour is undefined. This means it is perfectly correct to implement fadd using the "normal" floating-point instruction.

Note that similarly, if a strict operation with a given rounding mode flag is used, this likely does *not* instruct coegen to actively generate code to implement this particular rounding mode. Rather, this is again simply an *assertion* that tells codegen that it may *assume* the current rounding mode is set to this specific value. Again, this means that it is perfectly correct to implement such an operation using the "normal" floating-point instruction that will use the current rounding mode.

I have another question about any_*.

For example, I see you use any_fadd to match both fadd and strict_fadd. I guess ZSystem target is similar to POWER that there is no fp addition machine instruction with nearest rounding mode and no exception statically.
But the semantic of fadd node requires. So theoretically, fadd node should be mapped into 2 instructions for correct compiling. One sets rounding mode to nearest and clear exception handler bit(no exception happens), the other is float addition.
Do we not handle fadd like above because there is no function error for such fine-grained instruction semantic mostly? And avoid performance penalty? @uweigand

I believe you may be misunderstanding the semantics of the non-strict operations. These do *not* instruct codegen to actively generate code to switch rounding modes and/or clear exception bits. Rather, the use of a non-strict operation like fadd is an *assertion* that tells codegen that it may *assume* that the rounding mode is set to default, trapping exceptions are switched off, and exceptions flags are don't care (i.e. subsequent code will not check them). If you ever use fadd in a context where this assumption is not true, the behaviour is undefined. This means it is perfectly correct to implement fadd using the "normal" floating-point instruction.

Note that similarly, if a strict operation with a given rounding mode flag is used, this likely does *not* instruct coegen to actively generate code to implement this particular rounding mode. Rather, this is again simply an *assertion* that tells codegen that it may *assume* the current rounding mode is set to this specific value. Again, this means that it is perfectly correct to implement such an operation using the "normal" floating-point instruction that will use the current rounding mode.

Thank you. I think I got what you said. Dynamic rounding mode can be applied to static rounding mode so long as the environment rounding mode is correct set by other user/system code before. So to avoid the UB, initial work of environment(clear exception and set rounding mode to nearest) can be done at the beginning of program, maybe in runtime library.

Thank you. I think I got what you said. Dynamic rounding mode can be applied to static rounding mode so long as the environment rounding mode is correct set by other user/system code before. So to avoid the UB, initial work of environment(clear exception and set rounding mode to nearest) can be done at the beginning of program, maybe in runtime library.

Yes, that's what the startup code already does on Linux.