Page MenuHomePhabricator

[TargetLowering] Fix another potential FPE in expandFP_TO_UINT
Needs ReviewPublic

Authored by uweigand on Tue, Sep 3, 9:10 AM.

Details

Summary

D53794 introduced code to perform the FP_TO_UINT expansion via FP_TO_SINT in a way that would never expose floating-point exceptions in the intermediate steps. Unfortunately, I just noticed there is still a way this can happen. As discussed in D53794, the compiler now generates this sequence:

// Sel = Src < 0x8000000000000000 
// Val = select Sel, Src, Src - 0x8000000000000000 
// Ofs = select Sel, 0, 0x8000000000000000 
// Result = fp_to_sint(Val) ^ Ofs

The problem is with the Src - 0x8000000000000000 expression. As I mentioned in the original review, that expression can never overflow or underflow if the original value is in range for FP_TO_UINT. But I missed that we can get an Inexact exception in the case where Src is a very small positive value. (In this case the result of the sub is ignored, but that doesn't help.)

Instead, I'd suggest to use the following sequence:

// Sel = Src < 0x8000000000000000 
// FltOfs = select Sel, 0, 0x8000000000000000
// IntOfs = select Sel, 0, 0x8000000000000000
// Result = fp_to_sint(Val - FltOfs) ^ IntOfs

In the case where the value is already in range of FP_TO_SINT, we now simply compute Val - 0, which now definitely cannot trap (unless Val is a NaN in which case we'd want to trap anyway).

In the case where the value is not in range of FP_TO_SINT, but still in range of FP_TO_UINT, the sub can never be inexact, as Val is between 2^(n-1) and (2^n)-1, i.e. always has the 2^(n-1) bit set, and the sub is always simply clearing that bit.

There is a slight complication in the case where Val is a constant, so we know at compile time whether Sel is true or false. In that scenario, the old code would automatically optimize the sub away, while this no longer happens with the new code. Instead, I've added extra code to check for this case and then just fall back to FP_TO_SINT directly. (This seems to catch even slightly more cases.)

Diff Detail

Event Timeline

uweigand created this revision.Tue, Sep 3, 9:10 AM
efriedma added inline comments.Tue, Sep 3, 12:42 PM
lib/CodeGen/SelectionDAG/TargetLowering.cpp
5713

I'm not sure what you're trying to do here. If the input value is a constant, we should just constant-fold the operation.

Ah, good point. Currently, constrained floating-point intrinsics are never constant-folded. In general, this would likely be wrong anyway, since we might not know the current rounding mode that applies.

However, in the specific case of a fptoui, we don't depend on the current rounding mode (it's always round-to-zero even for the constrained case), so we actually could constant fold those in the cases where we the result is in range.

@kpn -- the test cases you added specifically verify that constrained fptoui does not constant-fold but does the runtime conversion. Should those be changed to use nonconstant arguments then?

kpn added a comment.Wed, Sep 4, 9:54 AM

Ah, good point. Currently, constrained floating-point intrinsics are never constant-folded. In general, this would likely be wrong anyway, since we might not know the current rounding mode that applies.

However, in the specific case of a fptoui, we don't depend on the current rounding mode (it's always round-to-zero even for the constrained case), so we actually could constant fold those in the cases where we the result is in range.

@kpn -- the test cases you added specifically verify that constrained fptoui does not constant-fold but does the runtime conversion. Should those be changed to use nonconstant arguments then?

Hmm, today? I'm leaning towards keeping it unchanged for now.

When constant folding of constrained intrinsics is implemented we'll certainly need to have tests for the various cases of NaNs, inexact cases, in-range exact constants, variables, and maybe other cases. Then there's the cases where we're using constrained intrinsics but with exceptions ignored, which would obviously change folding behavior. For now it may be enough to not bother with worrying about it and just leave the constant folding for another day.

uweigand updated this revision to Diff 218739.Wed, Sep 4, 10:12 AM

Remove attempt to handle constant arguments

So this removes the attempted handling of constant arguments. I agree with @efriedma that it doesn't make sense to just do this small optimization; we should really constant fold.

As long as this constant-folding is not yet implemented, the test cases using constant arguments will now have more complex code (similarly to what they'd have with variable arguments). Not sure if that is really an issue.

kpn added a comment.Fri, Sep 6, 12:10 PM

Well, I like it. The problem is solved, the pseudocode is I find a little easier to read, and the System/Z instruction sequence looks like a sideways move performance-wise.

An outsider's view is that the X86 instructions look worse, but in the strict case I think correctness should win. If that strictness isn't required then targets can still use the old non-strict sequence. I am surprised the X86 instruction sequences are that different.

In D67105#1661418, @kpn wrote:

Well, I like it. The problem is solved, the pseudocode is I find a little easier to read, and the System/Z instruction sequence looks like a sideways move performance-wise.

Thanks for having a look.

An outsider's view is that the X86 instructions look worse, but in the strict case I think correctness should win. If that strictness isn't required then targets can still use the old non-strict sequence. I am surprised the X86 instruction sequences are that different.

Note that the large differences in instruction sequences are really only in the vector-constrained-intrinsics.ll test, and those are all due to the fact that the test case uses a constant argument to the intrinsics, and --as discussed above-- while we do not currently perform constant folding for constrained intrinsics, after expanding the intrinsic to the "strict" sequence, constant folding still occurs on part of that resulting sequence, and it just so happens that the "old" sequence allows for more such folding than the "new" sequence. I had initially tried to implement a sort of "partial constant folding" that would get us to the same level as before, but given the discussion with Eli above, I now also think this doesn't make much sense: we really should do the full constant folding, and until such time as this is implemented, we can probably live with the lack of this (accidental) "partial" constant folding we had previously.

The difference in the fp-intrinsic.ll test, which doesn't use constant arguments and therefore didn't have this accidental partial folding to begin with, are more in line with the SystemZ changes, and appear to direcly correspond to the changes in IR. But I'd certainly appreciate X86 experts giving their opinion on the quality of those sequences, in particular since on X86, we're using the "strict" expansion by default even for regular FP opcodes.

xbolva00 added inline comments.
lib/CodeGen/SelectionDAG/TargetLowering.cpp
5732

Val is Src? New comment is confusing

uweigand updated this revision to Diff 219330.Mon, Sep 9, 5:53 AM

Fixed comment typo.

uweigand marked 2 inline comments as done.Mon, Sep 9, 5:53 AM
uweigand added inline comments.
lib/CodeGen/SelectionDAG/TargetLowering.cpp
5732

Right, forgot to update this. Should be fixed now, thanks!

spatel added inline comments.Mon, Sep 9, 8:51 AM
test/CodeGen/X86/vector-constrained-fp-intrinsics.ll
4763

This looks fine to me given that there's no constant folding yet.

It is the x86 manifestation of this node:
Creating new node: t15: f64,ch = strict_fsub t0, ConstantFP:f64<4.210000e+01>, ConstantFP:f64<0.000000e+00>