Page MenuHomePhabricator

[RFC v2] Allow target to handle STRICT floating-point nodes
Needs ReviewPublic

Authored by uweigand on Dec 10 2018, 4:33 AM.

Details

Summary

This is an alternate approach to the RFC in D45576 / D52785 / D52786. The main difference is that we're no longer attempting to use MachineMemOperand structures to capture floating-point exception status.

Instead, we make the MI codegen explicitly aware of the floating-point exceptions by introducing two new concepts:

  • A new MCID flag "mayRaiseException" that the target should set on any instruction that possibly can raise FP exception according to the architecture definition.
  • A new MI flag FPExcept that CodeGen/SelectionDAG will set on any MI instruction resulting from expansion of any constrained FP intrinsic.

Any MI instruction that is *both* marked as mayRaiseException *and* FPExcept then needs to be considered as raising exceptions by MI-level codegen (e.g. scheduling). Right now, this is done by simply treating and such instruction as if it had hasUnmodeledSideEffects set. In a future update, we can improve on this by handling such instructions in a more tailored way in the scheduler.

Now, setting those two new flags is relatively straightforward. The mayRaiseException flag is simply set via TableGen by marking all relevant instruction patterns in the .td files.

The FPExcept flag is set in SDNodeFlags when creating the STRICT_ nodes in the SelectionDAG, and gets inherited in the MachineSDNode nodes created from it during instruction selection. The flag is then transfered to an MIFlag when creating the MI from the MachineSDNode. This is handled just like fast-math flags like no-nans are handled today. In a way, we can think of the FPExcept flag like an inverted fast-math flag "no-except" that just defaults to true instead of false.

This should address the concerns that MachineMemOperands might get dropped accidentally. The new mayRaiseException flag is an invariant setting anyway, and the FPExcept flag is a MIFlag, and to my understanding those cannot be dropped during MI codegen anyway.

Diff Detail

Event Timeline

uweigand created this revision.Dec 10 2018, 4:33 AM

I like this implementation a lot. Some targets control the FPEnv through a register, not through memory. Modeling instructions with register side-effects through MachineMemOperand wasn't intuitive.

Also, the pattern changes seem concise. This is really good.

This patch does seem FP exception centric and rounding mode agnostic though. Should FPExcept and friends be named something more general to cover both? To be clear, I'm okay with the current naming scheme, so just playing Devil's advocate.

include/llvm/CodeGen/SelectionDAGNodes.h
378

Nit-picking: the first sentence was a little hard to parse. One alternative:

We assume by default that instructions are considered to not raise

Or maybe a rewrite would be cleaner:

We assume instructions do not raise floating-point exceptions by default, and only those marked explicitly may do so.

Just thinking aloud...

uweigand updated this revision to Diff 178261.Dec 14 2018, 11:49 AM

Updated comment.

uweigand marked an inline comment as done.Dec 14 2018, 11:53 AM

This patch does seem FP exception centric and rounding mode agnostic though. Should FPExcept and friends be named something more general to cover both? To be clear, I'm okay with the current naming scheme, so just playing Devil's advocate.

Well, this is because I'm really only using this feature to handle exceptions (b.t.w. just like the MemOperands in the alternate attempt).

Rounding mode is handled completely in the back-end: we simply make all floating-point instructions (strict or not doesn't even matter here) use the FPC control register, and mark all instructions that change the rounding mode as changing FPC. The one missing piece is that we need to mark all function calls to functions marked as within FENV_ACCESS ON regions as also clobbering FPC -- that can be done easily in the target ABI code as soon as the front-end marks such function calls (which we need anyway).

cameron.mcinally added a comment.EditedDec 14 2018, 12:33 PM

@uweigand, it looks like you submitted some unintended changes with the last Diff update. Just FYI...

[EDIT: Maybe not. I'm not sure what I was seeing, but it seems to be gone after reloading. Apologies for the noise.]

Well, this is because I'm really only using this feature to handle exceptions (b.t.w. just like the MemOperands in the alternate attempt).

Rounding mode is handled completely in the back-end: we simply make all floating-point instructions (strict or not doesn't even matter here) use the FPC control register, and mark all instructions that change the rounding mode as changing FPC. The one missing piece is that we need to mark all function calls to functions marked as within FENV_ACCESS ON regions as also clobbering FPC -- that can be done easily in the target ABI code as soon as the front-end marks such function calls (which we need anyway).

Hm, I may be thinking about this a little differently. I'll elaborate to make sure we're on the same page...

I was envisioning a -frounding-math-like option that is more than just adhering to the current rounding mode. That option would also suppress optimizations affecting rounding results. This would be very much like how we use STRICT_ nodes to suppress optimizations on instructions that may trap. That's why I suggested that mayRaiseException and friends may be a little too narrow.

Is that how you are envisioning the rounding mode support to work too?

Well, mayRaise Exception is purely a MI level flag. I struggle to see where optimizations on the MI level would ever care about rounding modes in the sense you describe: note that currently, MI optimizations don't even know which operation an MI instruction performs -- if you don't even know whether you're dealing with addition or subtraction, why would you care which rounding mode the operation is performed in? MI transformations instead care about what I'd call "structural" properties of the operation: what are the operands, what is input vs. output, which memory may be accessed, which special registers may involved, which other side effects may the operation have. This is the type of knowledge you need for the types of transformations that are done on the MI level: mostly about moving instructions around, arriving at an optimal schedule, de-duplicating identical operations performed multiple times etc. (Even things like simply changing a register operand to a memory operand for the same operation cannot be done solely by common MI optimizations but require per-target support.)

On this level, I do believe that my proposed patch captures all relevant "structural" properties of floating-point instructions: dependency on control registers (including controlling changing rounding modes), and the possibility of floating-point exceptions and traps (affecting whether instruction may be rescheduled or executed speculatively). If you can think of anything I missed here, I'd certainly appreciate to learn more.

Now, of course, earlier passes (on the IR and also SelectionDAG levels) certainly do perform transformations that affect the actual operations performed, and those would certainly care. But at those levels we already have all the extra information we need; at the IR level we have the constrained intrinsics, and at the DAG level we have the STRICT_ nodes. (Currently, those STRICT_ nodes lose a bit of information available at the IR level: we don't specifically distinguish whether a STRICT_ node arose from an IR that is marked as requiring special handling because of just rounding modes, just exceptions, or both. If we ever want to add DAG optimization for STRICT_ node that requires this information, it would certainly be fine with me to add another bit in the SDNodeFlags for example.)

This looks like a promising direction. I particularly like the idea of having a way to intersect information from the backend instruction definitions with the constraints coming from the IR. However, I also have some concerns.

It seems that you're doing two things in this patch -- (1) adding the FCP register to the SystemZ backend, and (2) adding the strict FP handshake mechanism. Could you separate those two change sets?

I'd the new flags being added to make specific reference to FP. People who don't do a lot of FP-specific work are likely to misunderstand any unqualified term such as mayRaiseException. For the MIFlag I'd prefer something like FPStrict because I expect that we will want to extend this to handling rounding mode issues like constant folding or code motion relative to instructions that change the rounding mode.

The unmodeled side effects approach is a lot stronger than we really need. Ideally, these instructions would only act as a barrier to one another and not to other instructions. My concern is that simply marking them as having unmodeled side effects is going to have a bigger hit on optimizations than we want. Granted, the same thing is happening at the IR level with the intrinsics but I have a rough idea of how we'll be able to move forward in that case without rewriting the mechanism.

What I'd really like to see is a way for the backend to be able to completely model the relevant FP register uses/defs but conditionally strip those uses/defs out for non-strict instructions. I explored this with the X86 backend. I had to create an artificial split between the control and status parts of the MXCSR register. I ran into problems because every FP instruction had to be a use and a def of the status bits. That would be fine for strict mode, but it wasn't really acceptable as a default behavior. I only have a rough idea of how this would work.

Another problem I ran into when I tried to add modeling of the MXCSR register is that the machine verifier didn't like seeing uses of this register without ever having seen a def. That didn't show up in my initial testing and I got as far as committing something that modeled the use of MXCSR by the instructions that explicitly read and write it, but the machine verifier issue appeared in post-commit testing. I think it came up with inline assembly that was using the STMXCSR instruction directly. It seems like this could be an issue with the FCP register you're adding for System Z also.

Eventually I'd really like a way to pass the rounding mode argument through to the backend. I've been emphatic in the past that this argument is meant to describe the assumed rounding mode rather than control it, but it occurs to me that for architectures which have FP instructions where the rounding mode can/must be baked into the instruction the backend could use the assumed rounding mode to set the rounding mode operand. In X86 we've only got a few instructions like this, but it's my understanding that some other architectures require it more broadly.

include/llvm/CodeGen/SelectionDAGNodes.h
469

There has been some discussion in the past about the possibility of using the fast math flags while still preserving strict FP exceptions. I'm not sure we reached a conclusion.

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
6520

This really shouldn't be unconditional. It's possible to have constrained intrinsics that are preserving the rounding mode but ignoring FP exceptions. At this point we still have that information.

Well, mayRaise Exception is purely a MI level flag. I struggle to see where optimizations on the MI level would ever care about rounding modes in the sense you describe: note that currently, MI optimizations don't even know which operation an MI instruction performs -- if you don't even know whether you're dealing with addition or subtraction, why would you care which rounding mode the operation is performed in? MI transformations instead care about what I'd call "structural" properties of the operation: what are the operands, what is input vs. output, which memory may be accessed, which special registers may involved, which other side effects may the operation have. This is the type of knowledge you need for the types of transformations that are done on the MI level: mostly about moving instructions around, arriving at an optimal schedule, de-duplicating identical operations performed multiple times etc. (Even things like simply changing a register operand to a memory operand for the same operation cannot be done solely by common MI optimizations but require per-target support.)

Huh, this is interesting and was not clear to me a priori. After digging around a bit, I agree with you that almost all of the MI code is fine wrt rounding.

This looks like a promising direction. I particularly like the idea of having a way to intersect information from the backend instruction definitions with the constraints coming from the IR. However, I also have some concerns.

Thanks for the review!

It seems that you're doing two things in this patch -- (1) adding the FCP register to the SystemZ backend, and (2) adding the strict FP handshake mechanism. Could you separate those two change sets?

Of course. I'm keeping them as separate patches internally anyway, and was planning to commit them separately. I just wanted to show how the full picture would look like in the end.

I'd the new flags being added to make specific reference to FP. People who don't do a lot of FP-specific work are likely to misunderstand any unqualified term such as mayRaiseException. For the MIFlag I'd prefer something like FPStrict because I expect that we will want to extend this to handling rounding mode issues like constant folding or code motion relative to instructions that change the rounding mode.

I'd be fine with using mayRaiseFPException instead. I specifically did not want to mix other aspects (like rounding modes) into it, but rather separate the concerns here. See my previous comments to Cameron about rounding modes in general. But if we do need to track those at the MI level, I'd rather prefer to add another bit in MIFlags, and keep the one I add here strictly about exceptions.

Code motion relative to instructions that change the rounding mode is already handled in my patch, via the dependency on FPC that is added to all FP instructions.

The unmodeled side effects approach is a lot stronger than we really need. Ideally, these instructions would only act as a barrier to one another and not to other instructions. My concern is that simply marking them as having unmodeled side effects is going to have a bigger hit on optimizations than we want. Granted, the same thing is happening at the IR level with the intrinsics but I have a rough idea of how we'll be able to move forward in that case without rewriting the mechanism.

Agreed. In fact, that's really the main point why I want to have the MI flag specifically about exceptions, so that common code can be written to handle the flag just exactly as is required for FP exceptions and nothing else. I just didn't want to complicate this initial patch, and therefore added only a quick (overly conservative, but correct) check to hasUnmodeledSideEffect. Once this is in, we can refine the handling as a follow-on patch. (For example, handle exactly the necessary dependencies for FP exceptions in ScheduleDAGInstrs::buildSchedGraph, which might e.g. allow two FP instructions to be swapped as long as no speculative execution of FP instructions is introduced.)

What I'd really like to see is a way for the backend to be able to completely model the relevant FP register uses/defs but conditionally strip those uses/defs out for non-strict instructions. I explored this with the X86 backend. I had to create an artificial split between the control and status parts of the MXCSR register. I ran into problems because every FP instruction had to be a use and a def of the status bits. That would be fine for strict mode, but it wasn't really acceptable as a default behavior. I only have a rough idea of how this would work.

That's why I was going away from using registers to model exceptions. As you suggest, I'm also in effect performing a split between the control and status parts of the FPC register: I'm using the register at the MI level to model the control part, and using the mayRaiseException flag to model the status part. The side effect implied by the flag includes both the trap and setting of the status bits. This seems more straightforward than adding register defs, in particular since the latter overly constrain scheduling.

Eventually I'd really like a way to pass the rounding mode argument through to the backend. I've been emphatic in the past that this argument is meant to describe the assumed rounding mode rather than control it, but it occurs to me that for architectures which have FP instructions where the rounding mode can/must be baked into the instruction the backend could use the assumed rounding mode to set the rounding mode operand. In X86 we've only got a few instructions like this, but it's my understanding that some other architectures require it more broadly.

I'm not really sure I see where this can help. On SystemZ there are a few instructions that encode rounding modes, but those are mapped to separate DAG opcodes. For example, frint / fnearbyint / ffloor / fceil / ftrunc / fround all map to the same SystemZ instruction, just with a different encoded rounding mode value. The existing DAG instruction selection mechanism seems fine for that.

uweigand marked an inline comment as done.Dec 18 2018, 2:29 AM
uweigand added inline comments.
lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
6520

Yes, of course, I'll add that for the final version. That's the main reason why I added the SDNode flag in the first place, instead of just checking for a STRICT_ node later on.

wuzish added a subscriber: wuzish.EditedFeb 24 2019, 11:19 PM

Well, mayRaise Exception is purely a MI level flag. I struggle to see where optimizations on the MI level would ever care about rounding modes in the sense you describe: note that currently, MI optimizations don't even know which operation an MI instruction performs -- if you don't even know whether you're dealing with addition or subtraction, why would you care which rounding mode the operation is performed in? MI transformations instead care about what I'd call "structural" properties of the operation: what are the operands, what is input vs. output, which memory may be accessed, which special registers may involved, which other side effects may the operation have. This is the type of knowledge you need for the types of transformations that are done on the MI level: mostly about moving instructions around, arriving at an optimal schedule, de-duplicating identical operations performed multiple times etc. (Even things like simply changing a register operand to a memory operand for the same operation cannot be done solely by common MI optimizations but require per-target support.)

On this level, I do believe that my proposed patch captures all relevant "structural" properties of floating-point instructions: dependency on control registers (including controlling changing rounding modes), and the possibility of floating-point exceptions and traps (affecting whether instruction may be rescheduled or executed speculatively). If you can think of anything I missed here, I'd certainly appreciate to learn more.

Now, of course, earlier passes (on the IR and also SelectionDAG levels) certainly do perform transformations that affect the actual operations performed, and those would certainly care. But at those levels we already have all the extra information we need; at the IR level we have the constrained intrinsics, and at the DAG level we have the STRICT_ nodes. (Currently, those STRICT_ nodes lose a bit of information available at the IR level: we don't specifically distinguish whether a STRICT_ node arose from an IR that is marked as requiring special handling because of just rounding modes, just exceptions, or both. If we ever want to add DAG optimization for STRICT_ node that requires this information, it would certainly be fine with me to add another bit in the SDNodeFlags for example.)

I think we can have a flag to indicate rounding mode for float operator such as FADD instead of STRICT_FADD, because STRICT_FADD does have side effect(chain), but some target (eg. RISCV) has static rounding mode encoded in the instruction which does not side effect. Then we can optimize specific STRICT_FADD node which ignoring exception into normal FADD with corresponding static round mode.

kpn added a comment.Feb 25 2019, 6:44 AM

I think we can have a flag to indicate rounding mode for float operator such as FADD instead of STRICT_FADD, because STRICT_FADD does have side effect(chain), but some target (eg. RISCV) has static rounding mode encoded in the instruction which does not side effect. Then we can optimize specific STRICT_FADD node which ignoring exception into normal FADD with corresponding static round mode.

WRT the chain: "That's not a bug, that's a feature!"

The chain prevents reordering of floating point instructions by the SelectionDAG. That makes it required.

Some System/Z floating point instructions can take an optional per-instruction rounding mode encoded in the instruction. This does not eliminate the need for strict ordering of floating point operations.

jsji added a subscriber: jsji.Feb 25 2019, 10:40 AM
In D55506#1408913, @kpn wrote:

I think we can have a flag to indicate rounding mode for float operator such as FADD instead of STRICT_FADD, because STRICT_FADD does have side effect(chain), but some target (eg. RISCV) has static rounding mode encoded in the instruction which does not side effect. Then we can optimize specific STRICT_FADD node which ignoring exception into normal FADD with corresponding static round mode.

WRT the chain: "That's not a bug, that's a feature!"

The chain prevents reordering of floating point instructions by the SelectionDAG. That makes it required.

Some System/Z floating point instructions can take an optional per-instruction rounding mode encoded in the instruction. This does not eliminate the need for strict ordering of floating point operations.

Yes, the chain prevents redordering of floating point instructions by SelectionDAG. But it's fine to reorder instructions with static rounding mode(and ignore exception) and no need for chain. Because they do not have side effect, it means the result of such instructions only depend on input (eg, float add including source operands and mode encoding), does not depend on outside state variable that is current rounding mode. It also means the result is unchanged no matter how many times happens so long as the input is same.

Yes, the chain prevents redordering of floating point instructions by SelectionDAG. But it's fine to reorder instructions with static rounding mode(and ignore exception) and no need for chain. Because they do not have side effect, it means the result of such instructions only depend on input (eg, float add including source operands and mode encoding), does not depend on outside state variable that is current rounding mode. It also means the result is unchanged no matter how many times happens so long as the input is same.

I'm not sure exactly what you have in mind here. For instructions which take explicit rounding mode arguments reordering is not always an issue, but when we're building the selection DAG do we know that we will end up with instructions that take an explicit rounding mode argument? For some architectures we do, but not for all. The concern with reordering, in the case where FP status flags are being ignored, is that we need to avoid possibly reordering instructions with respect to instructions that change the rounding mode. If the rounding mode is explicit in the instruction that's not an issue but otherwise it is, and at least in the case of X86 I don't think we can tell ahead of time (at least without doing things we shouldn't be doing) which instructions we'll end up with.