This is an archive of the discontinued LLVM Phabricator instance.

[FPEnv] Constrained FCmp intrinsics
ClosedPublic

Authored by uweigand on Oct 21 2019, 2:10 PM.

Details

Summary

This is a continuation of https://reviews.llvm.org/D54649, which was abandoned by @cameron.mcinally

The major differences to Cameron's approach include:

  • I did not actually require any TableGen changes, but was able to represent the overloaded intrinsic types using existing mechanisms. This solution looks correct to me, but if I'm missing anything here, please let me know ...
  • This adds a full set of compare intrinsics for all comparison codes.
  • I've added (mostly) complete SystemZ back-end support to actually generate correct code for all of them -- this is also to verify that the use of Custom expansion of strict operations actually allows the back-end to do what it needs to do.

This patch is still not complete, but I wanted to show it now to ask for feedback.

Some areas that definitely need more work are:

  • Vector type legalization of invalid vector types involving STRICT_FSETCC doesn't work yet.
  • Signalling comparisons are not supported yet.
  • We may also need strict versions of SELECT_CC and BR_CC. I haven't implemented those yet since they aren't really necessary for SystemZ, but some other platforms may require them.
  • The X86 back-end changes are incomplete, they're right now the bare minimum to keep Cameron's two test cases working.
  • There are many SETCC-related transformations and optimizations in the SelectionDAG codegen. Many of these are also valid for the strict FP case. I haven't found a simple way yet to write those transformations so that can easily handle both cases without duplicating a lot of boilerplate code ... Right now the patch implements the minimum set of those to get the SystemZ test cases to work (at least generating the same code as the non-strict versions for extremely simple use cases).

Diff Detail

Event Timeline

uweigand created this revision.Oct 21 2019, 2:10 PM
Herald added a project: Restricted Project. · View Herald TranscriptOct 21 2019, 2:11 PM

I did not actually require any TableGen changes, but was able to represent the overloaded intrinsic types using existing mechanisms. This solution looks correct to me, but if I'm missing anything here, please let me know ...

That sounds right. LLVMScalarOrSameVectorWidth did not exist when D54649 was proposed, IINM.

This adds a full set of compare intrinsics for all comparison codes.

There's really no reasonable alternative, correct? I believe that negating < and > would get messy wrt signals when a NaN is present. I don't think that savings is worth the effort (e.g. readability).

The X86 back-end changes are incomplete, they're right now the bare minimum to keep Cameron's two test cases working.

I'd be ok with leaving these changes to a later patch. Your call...

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
7073 ↗(On Diff #225954)

These DAG.getNode changes could probably be broken out into a standalone Diff.

lib/Target/SystemZ/SystemZISelLowering.cpp
2171 ↗(On Diff #225954)

This is probably worthy of a comment.

2655 ↗(On Diff #225954)

I notice the static_cast<SystemZISD::NodeType> has been dropped. Is this ok?

lib/Target/SystemZ/SystemZISelLowering.h
248 ↗(On Diff #225954)

preoduce typo could use a fix.

uweigand updated this revision to Diff 227715.Nov 4 2019, 8:55 AM
uweigand changed the repository for this revision from rL LLVM to rG LLVM Github Monorepo.

I did not actually require any TableGen changes, but was able to represent the overloaded intrinsic types using existing mechanisms. This solution looks correct to me, but if I'm missing anything here, please let me know ...

That sounds right. LLVMScalarOrSameVectorWidth did not exist when D54649 was proposed, IINM.

OK, thanks for your review!

This adds a full set of compare intrinsics for all comparison codes.

There's really no reasonable alternative, correct? I believe that negating < and > would get messy wrt signals when a NaN is present. I don't think that savings is worth the effort (e.g. readability).

Agreed.

The X86 back-end changes are incomplete, they're right now the bare minimum to keep Cameron's two test cases working.

I'd be ok with leaving these changes to a later patch. Your call...

OK, I'm dropping the X86-related changes for now.

Due to the switch to the github monorepo, it seems all the inline comments disappeared -- sorry for that.

Anyway, I believe I've anyway addressed them all:

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp:7073
These DAG.getNode changes could probably be broken out into a standalone Diff.

Agreed. Since this is just a simple refactoring, I've checked this in as 664f84e246478db82be2871f36fd1a523d9f2731.

lib/Target/SystemZ/SystemZISelLowering.cpp:2171
This is probably worthy of a comment.

Agreed. Added comment.

lib/Target/SystemZ/SystemZISelLowering.cpp:2655
I notice the static_cast<SystemZISD::NodeType> has been dropped. Is this ok?

Since the return value of that function was just plain int anyway, that static_cast didn't really have any point as far as I can see ...

lib/Target/SystemZ/SystemZISelLowering.h:248
preoduce typo could use a fix.

Checked in separately as d4a7855b68d4d53f121209333d5f2796731ab1f5

simoll added a subscriber: simoll.Nov 4 2019, 11:26 AM
pengfei added a subscriber: pengfei.Nov 4 2019, 5:12 PM

This looks like a very good start. I've talked to @pengfei about the x86 backend support for this. As long as x86 doesn't fail horribly, I'd be OK with the x86 work being done in a separate patch that depends on this one.

Do you have a plan for handling quiet versus signaling predicates?

Hmm, something is weird (with my Diff, at least). It looks like changes are listed twice (e.g. lib/IR/Verifier.cpp and llvm/lib/IR/Verifier.cpp). Maybe this is an SVN->GIT side-effect and the patch needs a rebase?

Hmm, something is weird (with my Diff, at least). It looks like changes are listed twice (e.g. lib/IR/Verifier.cpp and llvm/lib/IR/Verifier.cpp). Maybe this is an SVN->GIT side-effect and the patch needs a rebase?

In the current diff I only see the new llvm/lib/... paths. It is indeed the case that the diff I submitted originally had the SVN-style lib/... paths, but I updated it with a new patch after the GIT transition.

Oh, it's gone now. Maybe it was a temporary hiccup. Will review this afternoon...

llvm/lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp
1372

This throws away strict-ness for the expanded scalar selects. Is that deliberate?

I would have expected this to produce scalar STRICT_FSETCC nodes instead.

llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
3790

This else can be removed.

llvm/lib/Target/SystemZ/SystemZISelLowering.cpp
2522–2527

Missing space before ?.

uweigand marked an inline comment as done.Nov 12 2019, 8:26 AM

Thanks for the review!

llvm/lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp
1372

I don't see how this throws away strict-ness. Note that we do produce scalar STRICT_FSETCC nodes (that's in the ScalarOp = DAG.getNode(Op->getOpcode() ...) line). The additional select just ensures the (integral) output has the same value that a vector SETCC would have, instead of the (difference) value that a scalar SETCC has. Since this is completely an integer operation, strict-ness doesn't apply.

cameron.mcinally marked an inline comment as done.Nov 12 2019, 8:35 AM
cameron.mcinally added inline comments.
llvm/lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp
1372

Oh, I see it now. I misread this change...

simoll added inline comments.Nov 12 2019, 8:58 AM
llvm/include/llvm/IR/Intrinsics.td
750–791

Out of curiosity: what is your motivation to have one intrinsic per fcmp predicate instead of, say, an i8 immarg?

uweigand marked an inline comment as done.Nov 12 2019, 9:14 AM
uweigand added inline comments.
llvm/include/llvm/IR/Intrinsics.td
750–791

I didn't really make the decision, just took it over from the original patch. But I agree it seems to make sense simply from a perspective of being explicit in the IR: if there's just a random numeric value, it would make the IR harder to read/write.

llvm/include/llvm/IR/Intrinsics.td
750–791

It would be hard to read -- unless we wrote an IR decoding routine somewhere. That seems weird though.

llvm/include/llvm/IR/Intrinsics.td
750–791

This is why we used metadata strings for the constraint arguments. I'm still not entirely satisfied with that. Would it be reasonable to make these arguments constants and add a custom printer and parser for these intrinsics? What about token arguments?

simoll added inline comments.Nov 13 2019, 1:40 AM
llvm/include/llvm/IR/Intrinsics.td
750–791

This is why we used metadata strings for the constraint arguments. I'm still not entirely satisfied with that.

You could invent some fpenv operand bundle for fpexcept, fpround, as in:

%p = llvm.experimental.constrained.fcmp_ugt(%x, %y) [ "fpenv"(metadata !"fpround.tonearest", metadata !"fpexcept.strict") ]

Omission would then imply the default fp env (i'd like that for the VP fp intrinsics).

From https://llvm.org/docs/LangRef.html#operand-bundles

An operand bundle at a call site cannot change the implementation of the called function.

I am not sure what to make out of that statement. I guess it should be ok to also use operand bundles also for the comparison predicate? Otw, it might be possible to change to OpBound specification to allow different implementations.

Personally, i'd prefer we had a more flexible attribute mechanism, so we could tag-on all of these things:

%p = fcmp %x, %y cmp(ugt) fpround(tonearest) fpexcept(strict)
llvm/include/llvm/IR/Intrinsics.td
750–791

I like the idea of operand bundles for the constraint arguments. They seem naturally separate. I'm less comfortable with the comparison predicate being in the operand bundle because it is essential to the semantic meaning of the operation.

I started an experiment yesterday using token arguments instead of metadata to represent the rounding mode and exception behavior. I can easily shift the tokens to an operand bundle. I think we can use a token argument for the fcmp intrinsic predicate. I'll post an RFC to the mailing list soon to solicit opinions about using token arguments this way.

Regarding more flexible attributes, the problem we were trying to solve with the intrinsics was introducing new semantics in a way that wouldn't break existing optimizations that didn't know about the new semantics. Once these intrinsics are well established and well supported we might be able to merge their behavior back into the instructions.

llvm/include/llvm/IR/Intrinsics.td
750–791

I've just uploaded D70261, showing what I have in mind for implementing the FP constraints as tokens in an operand bundle. I think the fcmp predicates could be represented by a new token type in the same way, but as I said previously I'd prefer the predicate to be a proper argument to the constrained fcmp intrinsic.

uweigand updated this revision to Diff 229913.Nov 18 2019, 1:29 PM

Updated patch to only use a single llvm.experimental.constrained.fcmp intrinsic, where the predicate is passed as argument.

I agree that it does look preferable to have a single intrinsic, if only to make potential future IR optimizations simpler structurally to the corresponding operations on fcmp statements.

For now I'm simply using a metadata string to hold the predicate; if we come up with some form of "enum constant" in the IR in the future, it should be straightforward to change this (users are encapsulated).

uweigand updated this revision to Diff 230775.Nov 23 2019, 2:11 PM
uweigand retitled this revision from [FPEnv][WIP] Constrained FCmp intrinsics to [FPEnv] Constrained FCmp intrinsics.

This is now a functionally complete patch to implement constrained floating-point comparison intrinsics.

The major differences to the prior version are:

  • Added documentation for the new intrinsics.
  • Added support for signaling comparisons.
  • Updated for common-code changes (ConstrainedOps.def).
  • Removed (incorrect and unnecessary) optimization in DAGCombine.
  • Reviewed SystemZ back-end optimizations to make sure they respect strict semantics.

There are still opportunities to implement more optimizations in common code, but that can wait for later patches.

To handling signaling comparisons I now added a second intrinsic llvm.experimental.constrained.fcmps, which works otherwise the same as llvm.experiemental.constrained.fcmp. (Another design choice could have been to use a single intrinsics with an extra boolean argument, but that seemed to be less useful.)

I would prefer a single intrinsic/opcode with additional predicates to handle signalling vs. quiet. My main reason is consistency with how the IEEE-754 spec describes comparisons, but that's a weak argument. @craig.topper mentioned to me that there are some additional opcodes that might need to be duplicated if you treat signalling and quiet as separate operations.

I would prefer a single intrinsic/opcode with additional predicates to handle signalling vs. quiet. My main reason is consistency with how the IEEE-754 spec describes comparisons, but that's a weak argument. @craig.topper mentioned to me that there are some additional opcodes that might need to be duplicated if you treat signalling and quiet as separate operations.

I was referring to SELECT_CC and BR_CC. If we have a separate STRICT_FSETCC for signalling and quiet, then we we'll need separate STRICT_SELECT_CC/BR_CC if for signalling and quiet.

In the end, the reason why I chose to separate STRICT_FSETCC and STRICT_FSETCCS is that this makes it straightforward for the back-end to signal availability of the underlying instructions (which may be distinct).

Specifically, on SystemZ we have quiet vector compare instructions in z13, but we only got the signaling vector compare instructions in z14. This means on z13 we have to scalarize signaling (but not quiet) vector compares. With two DAG opcodes, the back end only has to specify this like so:

setOperationAction(ISD::STRICT_FSETCC, VT, Custom);
if (Subtarget.hasVectorEnhancements1())
  setOperationAction(ISD::STRICT_FSETCCS, VT, Custom);

and everything is just handled correctly automatically.

Having just a single DAG opcode would make this more complicated.

On the other hand, I found handling two opcodes in common code quite straightforward, usually simply adding another case statement to a switch. Adding an extra argument to the DAGnode (sort of like FP_ROUND) also need special code all over the place to handle ...

In the end, the reason why I chose to separate STRICT_FSETCC and STRICT_FSETCCS is that this makes it straightforward for the back-end to signal availability of the underlying instructions (which may be distinct).

Specifically, on SystemZ we have quiet vector compare instructions in z13, but we only got the signaling vector compare instructions in z14. This means on z13 we have to scalarize signaling (but not quiet) vector compares. With two DAG opcodes, the back end only has to specify this like so:

setOperationAction(ISD::STRICT_FSETCC, VT, Custom);
if (Subtarget.hasVectorEnhancements1())
  setOperationAction(ISD::STRICT_FSETCCS, VT, Custom);

and everything is just handled correctly automatically.

Having just a single DAG opcode would make this more complicated.

On the other hand, I found handling two opcodes in common code quite straightforward, usually simply adding another case statement to a switch. Adding an extra argument to the DAGnode (sort of like FP_ROUND) also need special code all over the place to handle ...

Just a thought if we merged signaling into the predicate, could we use the setCondCodeAction(might have the name wrong). On X86 for vectors we have signaling on qnan for LT/LE/GT/GE, but eq/ne/ord/unord only signal on snan. At least prior to AVX. Of course since it’s X86 we probably need custom handling for other reasons so maybe it doesn’t matter.

Just a thought if we merged signaling into the predicate, could we use the setCondCodeAction(might have the name wrong).

Hmm, right now it appears this is only checked for scalar operations, not vector.

craig.topper added inline comments.Nov 26 2019, 11:21 AM
llvm/lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp
1329

The || should be on the line before

1364

The || on the line before

I didn't look at the SystemZ specifics, but the rest seemed fine

llvm/docs/LangRef.rst
15734

Can we say "will" rather than "may"?

uweigand updated this revision to Diff 231243.Nov 27 2019, 6:44 AM

Address review comments.

uweigand marked 3 inline comments as done.Nov 27 2019, 6:45 AM
cameron.mcinally added a comment.EditedNov 27 2019, 10:17 AM

I would prefer a single intrinsic/opcode with additional predicates to handle signalling vs. quiet. My main reason is consistency with how the IEEE-754 spec describes comparisons, but that's a weak argument. @craig.topper mentioned to me that there are some additional opcodes that might need to be duplicated if you treat signalling and quiet as separate operations.

I don't have a strong opinion on this, but this would get weird for compareQuietUnordered and compareQuietOrdered. There are no signaling counterparts for those.

Oh, or maybe I misunderstood. I assumed Andy meant a single intrinsics with an extra boolean argument, but now I'm not so sure...

If we're adding new predicates, e.g. SGT/QGT, then that's probably fine.

I would prefer to model the complete set of operations both in a signaling and quiet form. It is true that IEEE does not define a named operation for each of these combinations, but it would still be useful:

  • so we can model the capabilities of the hardware ISA (e.g. SystemZ and Power do have an instruction for any of these)
  • and the compiler could in fact make use of them via optimization (e.g. an "isSignalingEqual" could be result of an optimization of "x <= y && x >= y").

Note that even today, not all of the predicate codes match an IEEE named operation, e.g. there is none corresponding to "one".

Now, assuming we want to model the complete set of operations, there would still be different ways to do so:

  • use distinct intrinsics / opcodes to distinguish signaling vs. quiet operations, each carrying a predicate operand (like this patch currently does)
  • use a single intrinsic / opcode with two operands (the predicate / condition code, and a boolean to indicate signaling vs. quiet)
  • encode signaling vs. quiet status into the condition code, and use a single intrinsic / opcode with just one operand, the extended condition code

These ways would be generally equivalent, so it is a question of what seems easiest to implement. I had initially experimented with using a single intrinsic / opcode with two operands, but in the end found the current solution with two opcodes simpler overall (in particular because of the setOperationAction issue discussed above). I have not tried extended condition codes -- I'm not sure if this would run into issues elsewhere if those "leak" into other code currently handling condition code that is not aware of the new ones ...

To clarify, I was suggesting a single intrinsic with quiet/signaling encoded in the predicate. I think the IR definition is a bit cleaner this way and it maps to both the IEEE specification and at least the most common way (I think) this is implemented in x86 hardware and probably others. That said, if doing it this way makes the implementation more complicated I could go along with separate intrinsics and ISD opcodes.

To clarify, I was suggesting a single intrinsic with quiet/signaling encoded in the predicate. I think the IR definition is a bit cleaner this way and it maps to both the IEEE specification and at least the most common way (I think) this is implemented in x86 hardware and probably others. That said, if doing it this way makes the implementation more complicated I could go along with separate intrinsics and ISD opcodes.

Legacy X87 has separate instructions for signalling and quiet. Originally with FCOM/FUCOM that write to the condition code bits in FPSW. Later with FCOMI and FUCOMI that write to EFLAGS. SSE added similar scalar comparisons that are split between signalling and quiet that produce an EFLAGS result, COMISS/COMISD and UCOMISS/UCOMISD. The flags are defined in a weird way that requires two branches or two setcc instructions and a logic op to implement oeq and une.

SSE also added scalar and vector instructions, CMPPS/CMPD/CMPSS/CMPSD, that produce a alls 0s or all 1s result in each element of a vector register. These instructions have 8 encodings, some that are signalling and some that are quiet. And for some behaviors we have to commute operands. They can't represent all 32 possible behaviors. So for vector comparisons we'll need to scalarize when we don't have the right encoding. With AVX the 8 encodings where extended to 32 encodings that support all possible behaviors.

Having two different ISD opcodes is straightforward for FCOM/FCOMI/FUCOMI/FCOMI/COMISS/COMISD/UCOMISS/UCOMISD selection. A single ISD opcode with merged predicate would also be straightforward.

The CMPPS/CMPD/CMPSS/CMPSD selection in X86 is already in custom code due to the 8 encoding issue. So two ISD opcodes or one probably doesn't make a lot of difference there either.

How can we get to a decision on this? At some point we need to move ahead one way or the other.

I'd still argue that using separate ISD nodes for signaling vs. quiet compares is preferable, both from a complexity of implementation perspective and also conceptually: I feel that a signaling and a quiet compare implement the same predicates (and also return the same values in the absence of traps), it's just that operation is different in which exceptions are signaled ...

Adding an extra argument to the DAGnode (sort of like FP_ROUND) also need special code all over the place to handle ...

Hi @uweigand, I don't quite understand why we need special code all over the place to handle. From what I can see, STRICT_FSETCCS always have the same action with STRICT_FSETCC in common code.

Adding an extra argument to the DAGnode (sort of like FP_ROUND) also need special code all over the place to handle ...

Hi @uweigand, I don't quite understand why we need special code all over the place to handle. From what I can see, STRICT_FSETCCS always have the same action with STRICT_FSETCC in common code.

I had the variant with the extra operand implemented initially, and what I was seeing was that made the node "special" because now we don't have the same set of operands between SETCC and STRICT_FSETCC any more. Usually, the STRICT_ opcodes have the same set of operands as the regular operands, with the exception of the chain. Code tends to assume that this is the case (e.g. when morphing a strict node into a regular node, or when writing a single legalization or other transformation that applies to both the strict and the regular node). Having not just the chain, but one additional extra operand required code changes to handle that operand in various places. Instead, with two DAG opcodes, you usually just have to add two more cases to a switch and that's it.

But those were really minor issues; the primary reason for me to switch was that I needed a way to tell common code which operations are legal, where the ISA has differences between signaling and quiet operations. (On Z we always have both for scalar types, but for vector types we got the signaling operations at a later ISA level than the quiet ones). This is trivial with two opcodes, but would require quite a bit of custom code if when using just a single one.

For X86, I think the one opcode or two is a wash. So I think this is fine.

This revision is now accepted and ready to land.Dec 5 2019, 3:03 PM

Adding an extra argument to the DAGnode (sort of like FP_ROUND) also need special code all over the place to handle ...

Hi @uweigand, I don't quite understand why we need special code all over the place to handle. From what I can see, STRICT_FSETCCS always have the same action with STRICT_FSETCC in common code.

I had the variant with the extra operand implemented initially, and what I was seeing was that made the node "special" because now we don't have the same set of operands between SETCC and STRICT_FSETCC any more. Usually, the STRICT_ opcodes have the same set of operands as the regular operands, with the exception of the chain. Code tends to assume that this is the case (e.g. when morphing a strict node into a regular node, or when writing a single legalization or other transformation that applies to both the strict and the regular node). Having not just the chain, but one additional extra operand required code changes to handle that operand in various places. Instead, with two DAG opcodes, you usually just have to add two more cases to a switch and that's it.

But those were really minor issues; the primary reason for me to switch was that I needed a way to tell common code which operations are legal, where the ISA has differences between signaling and quiet operations. (On Z we always have both for scalar types, but for vector types we got the signaling operations at a later ISA level than the quiet ones). This is trivial with two opcodes, but would require quite a bit of custom code if when using just a single one.

Got it, thanks!

This revision was automatically updated to reflect the committed changes.