This is an archive of the discontinued LLVM Phabricator instance.

[CodeGen] Remove pending AssertZext AssertSext in promoting FP_TO_INT
AbandonedPublic

Authored by xiangzhangllvm on Jul 15 2021, 4:01 AM.

Details

Summary

Adding AssertZext AssertSext in promoting FP_TO_INT is not safe.

for example:
t3: v8i8 = fp_to_uint t2 --promote to--> t4: v8i16 = fp_to_sint t2 + t5: v8i16 = AssertZext t86, ValueType:ch:i8
will let following optimization see the "high 8-bits of each element of t5" == 0. (but it is not, it may be undef, when there is overflow in fp_to_uint )

So some setting t5's "high 8-bits of each element" action will be optimized out.
And finally cause calculation error.

Related commit: https://reviews.llvm.org/D40591
Related optimization: https://reviews.llvm.org/rG2a419a0b9957ebac9e11e4b43bc9fbe42a9207df

tests in llvm/test/CodeGen/X86/tmp is just for discussion, I will remove it at last.

Diff Detail

Event Timeline

xiangzhangllvm created this revision.Jul 15 2021, 4:01 AM
xiangzhangllvm requested review of this revision.Jul 15 2021, 4:01 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 15 2021, 4:01 AM
RKSimon added inline comments.Jul 15 2021, 4:22 AM
llvm/lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp
715

The codegen diff is awful - what happens if you replace this with a SIGN_EXTEND_INREG / ZERO_EXTEND_INREG (AND) pattern?

If the conversion overflows the result is poison according to the IR langref. So is this really a problem?

Does the same code need to be removed from type legalization?

I agree, the existing lowering seems fine. If it's causing problems for C code, there's probably a source bug, which should be caught by -fsanitize=undefined.

To be clear, I'm working on the assumption that AssertZext means "if the operand is not poison, the high N bits are not set". I'm not sure we've formally stated that anywhere, but I think it's consistent with how SelectionDAG optimizations use AssertZext in practice.

To be clear, I'm working on the assumption that AssertZext means "if the operand is not poison, the high N bits are not set". I'm not sure we've formally stated that anywhere, but I think it's consistent with how SelectionDAG optimizations use AssertZext in practice.

Is it possible that user expect some saturate behaviour (get the max value of the int) when convert from FP to INT overflow? AssertZext cause the undefined behaviour be propagated to each optimization that assume the upper bits being zero.

To be clear, I'm working on the assumption that AssertZext means "if the operand is not poison, the high N bits are not set". I'm not sure we've formally stated that anywhere, but I think it's consistent with how SelectionDAG optimizations use AssertZext in practice.

Is it possible that user expect some saturate behaviour (get the max value of the int) when convert from FP to INT overflow? AssertZext cause the undefined behaviour be propagated to each optimization that assume the upper bits being zero.

It's undefined behavior according to C. The code would get different behavior on different platforms and shouldn't be relied on. We have fptosi_sat and fptoui_sat intrinsics that are used by rust that produce a well defined behavior on overflow. It doesn't look like we have a target independent builtin for them.

To be clear, I'm working on the assumption that AssertZext means "if the operand is not poison, the high N bits are not set". I'm not sure we've formally stated that anywhere, but I think it's consistent with how SelectionDAG optimizations use AssertZext in practice.

Is it possible that user expect some saturate behaviour (get the max value of the int) when convert from FP to INT overflow? AssertZext cause the undefined behaviour be propagated to each optimization that assume the upper bits being zero.

It's undefined behavior according to C. The code would get different behavior on different platforms and shouldn't be relied on. We have fptosi_sat and fptoui_sat intrinsics that are used by rust that produce a well defined behavior on overflow. It doesn't look like we have a target independent builtin for them.

I'm not sure you get saturation with this assert removed. The result will be truncated to the original integer type. If the conversion overflowed the original type but not int32, you would just get the lower 8 or 16 bits of the conversion result. If it overflowed the conversion to int32, the cvt instruction produces INT_MIN and the truncate will turn it into 0.

This cause a runfail of a big project, Let me show key context of the test first:

%wide.insert226vector_func.i = shufflevector <32 x i8> %wide.insert224vector_func.i, <32 x i8> %extended.225vector_func.i, <32 x i32> <i32 0, i32 32, i32 undef, i32 3, i32 4, i32 33, i32 undef, i32 7, i32 8, i32 34, i32 undef, i32 11, i32 12, i32 35, i32 undef, i32 15, i32 16, i32 36, i32 undef, i32 19, i32 20, i32 37, i32 undef, i32 23, i32 24, i32 38, i32 undef, i32 27, i32 28, i32 39, i32 undef, i32 31>

it try to convert to (late me trunc the first 4 element of shuffle mask)

%wide.insert226vector_func.i0 = shufflevector <32 x i8> %wide.insert224vector_func.i, <32 x i8> --, <32 x i32> <i32 0, i32 zo, i32 undef, i32 3...> // zo means set here element = 0, -- means "can optimize out this operand"
%wide.insert226vector_func.i1 = shufflevector <32 x i8> %--, <32 x i8> %extended.225vector_func.i, <32 x i32> <i32 zo, i32 33, i32 zo, i32 zo...>
%wide.insert226vector_func.i = %wide.insert226vector_func.i0 | %wide.insert226vector_func.i1

But the first IR %wide.insert226vector_func.i0 can be directly replaced by its operand %wide.insert224vector_func.i when the index 1, 1+4, 1+2*4 ... is zero.
@RKSimon 's patch (optimization: https://reviews.llvm.org/rG2a419a0b9957ebac9e11e4b43bc9fbe42a9207df) will try check if related elements is zero, and @craig.topper 's patch happen generated AssertZext to show this element is zero.
But in fact , this element may be undef.

Let me upload the small reproduce test soon. (if you remove @RKSimon 's patch, that will be more obvious)

I'm working on assumption that AssertZext means "if the operand is not poison, the high N bits are not set"

And In fact, current optimizations on AssertZext don't care about its operand's poison or other details, it just thought its hight N bits are 0.
for example, in

"SelectionDAG::computeKnownBits" 
 3297   case ISD::AssertZext: {
 3298     EVT VT = cast<VTSDNode>(Op.getOperand(1))->getVT();
 3299     APInt InMask = APInt::getLowBitsSet(BitWidth, VT.getSizeInBits());
 3300     Known = computeKnownBits(Op.getOperand(0), Depth+1);
 3301     Known.Zero |= (~InMask);
 3302     Known.One  &= (~Known.Zero);
 3303     break;

You can see the high N bits directly be mark as zero.

This comment was removed by xiangzhangllvm.

And for "AssertZext ", both following DAG optimizations and ISel will not try generate node/MIR to clear its high N bits to zero. ISel will directly ignore it.

(I think it make sense, because it is just a flag/mark, it means its operand's high N bits already be 0)
xiangzhangllvm edited the summary of this revision. (Show Details)

upload llvm/test/CodeGen/X86/tmp

I tend not to commit this test at last, this is just for show the problem in discussion.

The diff is not cause by Current patch on the rep.ll (because it is not obviously)
The diff is caused by revert Simon's https://reviews.llvm.org/rG2a419a0b9957ebac9e11e4b43bc9fbe42a9207df or not.
(Simon's patch self is no problem, it is just show out the problem, the root cause is related with generating AssertZext)

Let me reformat/refine my upper comments:
This cause a runfail of a big project, Let me show key context of the test first:

%wide.insert226vector_func.i = shufflevector <32 x i8> %wide.insert224vector_func.i, <32 x i8> %extended.225vector_func.i, <32 x i32> <i32 0, i32 32, i32 undef, i32 3, i32 4, i32 33, i32 undef, i32 7, i32 8, i32 34, i32 undef, i32 11, i32 12, i32 35, i32 undef, i32 15, i32 16, i32 36, i32 undef, i32 19, i32 20, i32 37, i32 undef, i32 23, i32 24, i32 38, i32 undef, i32 27, i32 28, i32 39, i32 undef, i32 31>

it try to convert to (late me trunc the first 4 element of shuffle mask)

%wide.insert226vector_func.i0 = shufflevector <32 x i8> %wide.insert224vector_func.i, <32 x i8> --, <32 x i32> <i32 0, i32 zo, i32 undef, i32 3...> // zo means set here element = 0, -- means "can optimize out this operand"
%wide.insert226vector_func.i1 = shufflevector <32 x i8> %--, <32 x i8> %extended.225vector_func.i, <32 x i32> <i32 zo, i32 32, i32 zo, i32 zo...>
%wide.insert226vector_func.i = %wide.insert226vector_func.i0 | %wide.insert226vector_func.i1

But the first IR %wide.insert226vector_func.i0 can be directly replaced by its operand %wide.insert224vector_func.i when the element with index 1, 1+4, 1+2*4 ... is zero.

@RKSimon 's patch (optimization: https://reviews.llvm.org/rG2a419a0b9957ebac9e11e4b43bc9fbe42a9207df) will try check if related elements is zero, and @craig.topper 's patch happen generated AssertZext to show this element is zero.
But in fact , this element may be undef.

craig.topper added inline comments.Jul 15 2021, 6:18 PM
llvm/test/CodeGen/X86/tmp/rep.ll
27 ↗(On Diff #359181)

Can you provide the values for %call.i24.i, %i384, %i385, %i386 in your failing case?

craig.topper added inline comments.Jul 15 2021, 6:25 PM
llvm/test/CodeGen/X86/tmp/rep.ll
14 ↗(On Diff #359181)

Or is it this fptoui that overflowed?

xiangzhangllvm added inline comments.Jul 15 2021, 6:48 PM
llvm/test/CodeGen/X86/tmp/rep.ll
14 ↗(On Diff #359181)

Yes, It is. We can't control the load value of %1 %2

27 ↗(On Diff #359181)

This is just a small reproduce, in our project, if we clear the high 8 bits of the related element, the project will run pass.

craig.topper added inline comments.Jul 15 2021, 6:50 PM
llvm/test/CodeGen/X86/tmp/rep.ll
14 ↗(On Diff #359181)

Where did it come from? Does the program fail -fsanitize=undefined?

xiangzhangllvm added inline comments.Jul 15 2021, 6:56 PM
llvm/test/CodeGen/X86/tmp/rep.ll
14 ↗(On Diff #359181)

It come from a long way: a lot of fmul and fadd operations, also include calling some function.
What is the llc option "-fsanitize=undefined" corresponding to ? It is a OCL project with a log of cl files. It is much easy for me to use llc option.

Hello @craig.topper , what do you want to mean ?
Do you want to mean the load of float value should not overflow to i8 ?

craig.topper added inline comments.Jul 15 2021, 7:02 PM
llvm/test/CodeGen/X86/tmp/rep.ll
14 ↗(On Diff #359181)

There isn't an llc option. The option makes clang include extra code in the binary to check the range of inputs on different operations.

How far out of range is the loaded data? Can you provide the values? What does the program expect the result of the fptoui to be?

xiangzhangllvm added inline comments.Jul 15 2021, 7:11 PM
llvm/test/CodeGen/X86/tmp/rep.ll
14 ↗(On Diff #359181)

It is difficult for me to provide the value.
In fact, when I debug the problem, I also want to get the value, but the host program is not built from our project, it make me hard to print the middle value of kernel code.

Can we focus on this small reproduce cast ?
If we assume there is an over flow in this small case. We can compare the *.s I uppload. The rep_old.s is not correct.

Hello @craig.topper , what do you want to mean ?
Do you want to mean the load of float value should not overflow to i8 ?

If the loaded value is not in the range [0.0, 1.0] so that %i375 is in the range [0.0, 255.0], then I don't know what value the fptoui is supposed to produce.

If the loaded value is not in the range [0.0, 1.0] so that %i375 is in the range [0.0, 255.0], then I don't know what value the fptoui is supposed to produce.

Or we can see in this way.
In the edge calculation, we just need to load 4 elements of v8f32, but for performance reason, we usually load full (8) elements. So some of its loaded element has no meaning (may over flow).
So the shuffle don't select these element, but the shuffle may zero there element then for other use. (very just like this small reproduce case )

I'm confused. The AssertZext gets inserted by LegalizeIntegerTypes.cpp for this test case when the 8xi8 fptoui becomes an 8xi16 fptosi. LegalizeVectorOps inserts an AssertSext later when the v8i16 gets promoted to v8i32. This patch only gets rid of the AssertSext.

I'm confused. The AssertZext gets inserted by LegalizeIntegerTypes.cpp for this test case when the 8xi8 fptoui becomes an 8xi16 fptosi. LegalizeVectorOps inserts an AssertSext later when the v8i16 gets promoted to v8i32. This patch only gets rid of the AssertSext.

Because I am happen see your change about adding AssertZext at https://reviews.llvm.org/D40591, let me recheck other adding AssertZext at LegalizeIntegerTypes.cpp. Thanks for your remind!

I'm confused. The AssertZext gets inserted by LegalizeIntegerTypes.cpp for this test case when the 8xi8 fptoui becomes an 8xi16 fptosi. LegalizeVectorOps inserts an AssertSext later when the v8i16 gets promoted to v8i32. This patch only gets rid of the AssertSext.

Because I am happen see your change about adding AssertZext at https://reviews.llvm.org/D40591, let me recheck other adding AssertZext at LegalizeIntegerTypes.cpp. Thanks for your remind!

The same code also exists in X86TargetLowering::ReplaceNodeResults.

My confusion is that at multiple points in this discussion you mentioned AssertZext being what's causing the problem, but this patch doesn't affected the AssertZext in the test.

The same code also exists in X86TargetLowering::ReplaceNodeResults.

My confusion is that at multiple points in this discussion you mentioned AssertZext being what's causing the problem, but this patch doesn't affected the AssertZext in the test.

My mistake, I thought only here adding the AssertZext.
The project is very sensitive, even I fine tuning the order of the IR, it will fail to reproduce. So after I apply this patch, it not reproduce too, So I thought the fix "really" works.

I'll update the patch a little later, Thanks for your reviews.

If I'm understanding correctly, the testcase reduces to something like this:

define <16 x i8> @src(<4 x float> %arg1) {
  %f = fptoui <4 x float> %arg1 to <4 x i8>
  %s = shufflevector <4 x i8> %f, <4 x i8> undef, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
  %ss = shufflevector <16 x i8> %s, <16 x i8> zeroinitializer, <16 x i32> <i32 0, i32 17, i32 18, i32 19, i32 1, i32 21, i32 22, i32 23, i32 2, i32 25, i32 26, i32 27, i32 3, i32 29, i32 30, i32 31>
  ret <16 x i8> %ss
}

define <16 x i8> @tgt(<4 x float> %arg1) {
  %f = fptosi <4 x float> %arg1 to <4 x i32>
  %ss = bitcast <4 x i32> %f to <16 x i8>
  ret <16 x i8> %ss
}

llc compiles both of these to a single cvttps2dq; alive2 says these are not equivalent. So apparently there's some disagreement within SelectionDAG about what AssertZext actually means.

I'd consider this a bug in the x86 shuffle code. Changing fptoui legalization would basically imply that poison doesn't exist in SelectionDAG, which is not the direction we want to go, I think.

Err, wait, obviously they're not equivalent. Not

llc compiles both of these to a single cvttps2dq; alive2 says these are not equivalent.

Err, wait, they're obviously not equivalent... not sure what I was thinking. Would need a different testcase.

Anyway, the general issue issue I'm getting at is that poison in vectors is supposed to apply separately to each element of the vector. Here, it's leaking across elements.

If you think the small reproduce rep.ll is not small enough, I'll go further to simplify it, but I am not 100% sure I will be successful.

craig.topper added inline comments.Jul 15 2021, 10:45 PM
llvm/lib/Target/X86/X86ISelLowering.cpp
30754 ↗(On Diff #359221)

This only exists because of the code you removed.

30758 ↗(On Diff #359221)

Same with this.

llvm/lib/Target/X86/X86ISelLowering.cpp
30758 ↗(On Diff #359221)

Got it, thanks a lot!!

xiangzhangllvm marked 2 inline comments as done.
efriedma requested changes to this revision.Jul 16 2021, 4:57 PM

Marking request changes pending a response to the question of whether the usage of AssertZext here is actually incorrect.

This revision now requires changes to proceed.Jul 16 2021, 4:57 PM

Marking request changes pending a response to the question of whether the usage of AssertZext here is actually incorrect.

I thought that is clear.

fptoui <8 x float> %1 to <8 x i8>   !=  fptosi <8 x float> %1 to <8 x i16>  +  AssertZext

In right side, when overflow, the high N bits is undef, but we mark them Zero with AssertZext.
The following optimizations (e.g. shuffle combine) tried to reset the high N bits to Zero, but it find AssertZext (which means the high N bits already Zero), so it stop reset them, this makes the error.

The small reproduce test case is just show in this way.

@efriedma We have a job blocked by this patch, so I hope we can quickly got the key point.

Let me duplicate my previous statement:

Or we can see in this way. 
In the edge calculation, we just need to load 4 elements of v8f32, but for performance reason, we usually load full (8) elements. So some of its loaded element has no meaning (may over flow).
So the shuffle don't select these element, but the shuffle may zero these elements then for other use. (very just like this small reproduce case )

+  After appending AssertZext, the shuffle optimization will stop zero these elements, and directly use them.   That is wrong.

+ After appending AssertZext, the shuffle optimization will stop zero these elements, and directly use them. That is wrong.

I agree up to this point.

The question is, is the AssertZext node wrong, or are the shuffle optimizations wrong? That dictates whether we go with this patch, or instead revert 2a419a0b9957 (and any similar optimizations, if they exist).


In any case, we should probably add the following example as a regression test:

define <16 x i8> @src(<4 x float> %arg1) {
  %f = fptoui <4 x float> %arg1 to <4 x i8>
  %s = shufflevector <4 x i8> %f, <4 x i8> undef, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
  %ss = shufflevector <16 x i8> %s, <16 x i8> zeroinitializer, <16 x i32> <i32 0, i32 17, i32 18, i32 19, i32 1, i32 21, i32 22, i32 23, i32 2, i32 25, i32 26, i32 27, i32 3, i32 29, i32 30, i32 31>
  ret <16 x i8> %ss
}

+ After appending AssertZext, the shuffle optimization will stop zero these elements, and directly use them. That is wrong.

I agree up to this point.

The question is, is the AssertZext node wrong, or are the shuffle optimizations wrong? That dictates whether we go with this patch, or instead revert 2a419a0b9957 (and any similar optimizations, if they exist).

It is the "adding AssertZext node" wrong. AssertZext node should correctly marked node which is really zero in N bits.

In any case, we should probably add the following example as a regression test:
...

No problem, Let me add the test soon, thanks you!

Address @efriedma 's comment, add test fptoui-may-overflow.ll, thanks very much!

Why does this patch not touch the code in DAGTypeLegalizer::PromoteIntRes_FP_TO_XINT? If the code in LegalizeVectorOps is a problem I think that code is too.

aqjune added inline comments.Jul 16 2021, 8:58 PM
llvm/test/CodeGen/X86/fptoui-may-overflow.ll
5 ↗(On Diff #359521)

According to LangRef (https://llvm.org/docs/LangRef.html#fptoui-to-instruction), the overflow returns poison which means that using the value is invalid.

This is analogous to using an uninitialized variable in C/C++; there is no guarantee that the compiled program will have a reasonable behavior.

int x; // not initialized
printf("%d", x); // Assume that this printed 0xDEADBEEF
... (don't update x)
printf("%d", x); // There is no guarantee that this will also print 0xDEADBEEF

+ After appending AssertZext, the shuffle optimization will stop zero these elements, and directly use them. That is wrong.

I agree up to this point.

The question is, is the AssertZext node wrong, or are the shuffle optimizations wrong? That dictates whether we go with this patch, or instead revert 2a419a0b9957 (and any similar optimizations, if they exist).

It is the "adding AssertZext node" wrong. AssertZext node should correctly marked node which is really zero in N bits.

A lane of a vector is either poisoned, or not poisoned. It doesn't apply to individual bits of a vector.

Extending this to how computeKnownBits works, if computeKnownBits says a bit is "known", it doesn't really mean the bit has to have that value. It means either the bit has that value, or that lane of the vector is poisoned. As far as I know, this applies to both the SelectionDAG and the ValueTracking versions of computeKnownBits.

llvm/test/CodeGen/X86/fptoui-may-overflow.ll
5 ↗(On Diff #359521)

The comment is wrong. But the CHECK lines are correct. LangRef and alive2 say the following transform is invalid:

define <16 x i8> @src(<4 x float> %arg1) {
  %ss = shufflevector <16 x i8> poison, <16 x i8> zeroinitializer, <16 x i32> <i32 0, i32 17, i32 18, i32 19, i32 1, i32 21, i32 22, i32 23, i32 2, i32 25, i32 26, i32 27, i32 3, i32 29, i32 30, i32 31>
  ret <16 x i8> %ss
}
=>
define <16 x i8> @tgt(<4 x float> %arg1) {
  ret <16 x i8> poison
}

(On a side-note, alive2 gives some confusing results for fptoui; apparently it thinks fptoui float 31.5 to i32 is poison.)

Why does this patch not touch the code in DAGTypeLegalizer::PromoteIntRes_FP_TO_XINT? If the code in LegalizeVectorOps is a problem I think that code is too.

Let me fix here too, thanks very much for your remind!

llvm/test/CodeGen/X86/fptoui-may-overflow.ll
5 ↗(On Diff #359521)

Sorry, not much understand, Here the test didn't use poison elements, and we don't know the fptoui will be overflow or not, it is runtime.

Address Craig's comment: fix in PromoteIntRes_FP_TO_XINT too.

These changes cause test failures on other targets.

These changes cause test failures on other targets.

Yes, I first commit the change let you review, my local is build with '-DLLVM_TARGETS_TO_BUILD="all"', I'll update the other target tests after I finish the testing. Thank you again!

If I'm understanding correctly, the testcase reduces to something like this:

define <16 x i8> @src(<4 x float> %arg1) {
  %f = fptoui <4 x float> %arg1 to <4 x i8>
  %s = shufflevector <4 x i8> %f, <4 x i8> undef, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
  %ss = shufflevector <16 x i8> %s, <16 x i8> zeroinitializer, <16 x i32> <i32 0, i32 17, i32 18, i32 19, i32 1, i32 21, i32 22, i32 23, i32 2, i32 25, i32 26, i32 27, i32 3, i32 29, i32 30, i32 31>
  ret <16 x i8> %ss
}

define <16 x i8> @tgt(<4 x float> %arg1) {
  %f = fptosi <4 x float> %arg1 to <4 x i32>
  %ss = bitcast <4 x i32> %f to <16 x i8>
  ret <16 x i8> %ss
}

Other than reverting the patch, can %f in tgt be frozen? (Unless it is likely to cause performance regression; I found that updating vector ops to deal with freeze is hard work)?

llvm/test/CodeGen/X86/fptoui-may-overflow.ll
5 ↗(On Diff #359521)

@efriedma Thank you for the report, I made a pull request at Alive2 that fixes the bugs. @nlopes will be back in a few weeks and have a look at it.

@xiangzhangllvm poison is kind of a conceptual value that appears in LLVM IR's abstract machine.
It is used to carry guarantees from C/C++ that e.g., casting big floats to signed integer is not legal.
Of course, one cannot statically determine if a C program will do such cast or not.
As double free raises segmentation fault, the execution can fail or print a bogus value; integer overflow is just less visible to users.

llvm/test/CodeGen/X86/fptoui-may-overflow.ll
5 ↗(On Diff #359521)

Hi @aqjune ,let's talk on the test. The current problem is

"%f = fptoui <4 x float> %arg to <4 x i8>" will be convert to "fptosi <4 x float> %1 to <4 x i16> + AssertZext"

Do you mean in overflow case, %f should be poison value and report error ? (I agree)
But this convert self is not correct.
And even we mark some elements of %f is poison, but the following shuffle didn't use (read) this poison element. It just write 0 into these elements.
And this action of "write 0" will be remove by AssertZext. So this patch is try to removing this AssertZext.

Update the related tests.

jrtc27 added inline comments.Jul 17 2021, 8:51 AM
llvm/test/CodeGen/AArch64/arm64-convert-v4f64.ll
26–28 ↗(On Diff #359553)

Did the instruction count actually go down or did you just delete the instructions that were no longer present but not add new ones? You've certainly lost the CHECK line for the final instruction that puts all the temporary values into the return register.

llvm/test/CodeGen/RISCV/rv64d-double-convert.ll
25 ↗(On Diff #359553)

All the RISC-V changes look like regressions to me unless I'm missing something...

aqjune added inline comments.Jul 17 2021, 9:22 AM
llvm/test/CodeGen/X86/fptoui-may-overflow.ll
5 ↗(On Diff #359521)

"%f = fptoui <4 x float> %arg to <4 x i8>" will be convert to "fptosi <4 x float> %1 to <4 x i16> + AssertZext"

To me the transformation seems correct; let me explain why I think so.
I'll use <1 x float> instead of <4 x float> to avoid confusion; it seems the number of elements doesn't matter here.

Let's assume that %arg[0] fit in [0,256); then AssertZext is fine, because 8 MSBs of i16 is anyway zero.
The problematic case is when %arg[0] doesn't fit in [0, 256).
Then, %f[0] is poison; using %f[0] will raise undefined behavior.
Then, the optimized code can do anything including using the register which isn't filled with zero bits...
So in either case, having AssertZext is fine.

To me, efriedma's shufflevector transformation (https://reviews.llvm.org/D106053#2882183) looks fishier.
If 2a419a0b9957 is the root cause of the shufflevector transformation, and reverting it (or fixing it using freeze) solves the problem, then what do you think about the alternative solution?
2a419a0b9957 has small diffs in the tests which implies that it will have a small impact on performance, maybe.

Here's what I think is the issue. After type legalization we have a bitcast from <4 x i32> to <16 x i8>. When computeKnownBits sees a bitcast from a wider element type to a smaller element type, it converts DemandedElts into DemandedBits to try to get an answer from a portion of the wider input element elements.

How does poison interact with a bitcast from a wide element type to a narrower element type? Is what computeKnownBits does for this poison safe?

How does poison interact with a bitcast from a wide element type to a narrower element type? Is what computeKnownBits does for this poison safe?

The rules for bitcast are pretty simple, I think: each element of the output vector depends on some number of elements in the input vector. (In the case of bitcast <4 x i32> to <16 x i8>, that's exactly one element.) If an output element depends on a poisoned input element, the result is poison.

I think computeKnownBits is handling this correctly.

Maybe the following will help clarify: consider the following example:

define <4 x i32> @src(<4 x float> %a) {
  %u = fptoui <4 x float> <float 256.0> to <4 x i8>
  %z = zext <4 x i8> %u to <4 x i32>
  %and = and <4 x i32> %z, <i32 255, i32 255, i32 255, i32 255>
  ret <4 x i32> %and
}

It is legal to optimize this to plain "poison". On the surface, this looks like the same thing as the shufflevector example, but shufflevector itself is sort of special: each element of the output is only poisoned if the corresponding source element is poisoned.

How does poison interact with a bitcast from a wide element type to a narrower element type? Is what computeKnownBits does for this poison safe?

The rules for bitcast are pretty simple, I think: each element of the output vector depends on some number of elements in the input vector. (In the case of bitcast <4 x i32> to <16 x i8>, that's exactly one element.) If an output element depends on a poisoned input element, the result is poison.

I think computeKnownBits is handling this correctly.

Maybe the following will help clarify: consider the following example:

define <4 x i32> @src(<4 x float> %a) {
  %u = fptoui <4 x float> <float 256.0> to <4 x i8>
  %z = zext <4 x i8> %u to <4 x i32>
  %and = and <4 x i32> %z, <i32 255, i32 255, i32 255, i32 255>
  ret <4 x i32> %and
}

It is legal to optimize this to plain "poison". On the surface, this looks like the same thing as the shufflevector example, but shufflevector itself is sort of special: each element of the output is only poisoned if the corresponding source element is poisoned.

Ok that makes sense if the bitcast existed in IR. But this bitcast was created as part of type legalization and potentially spread poison to adjacent elements that didn't happen in the IR.

Ok that makes sense if the bitcast existed in IR. But this bitcast was created as part of type legalization and potentially spread poison to adjacent elements that didn't happen in the IR.

Are you asking if it's legal to turn src into tgt in the following?

define <4 x i8> @src(<1 x float> %a) {
  %u = fptoui <1 x float> %a to <1 x i8>
  %s = shufflevector <1 x i8> %u, <1 x i8> undef, <4 x i32> <i32 0, i32 undef, i32 undef, i32 undef>
  ret <4 x i8> %s
}

define <4 x i8> @tgt(<1 x float> %a) {
  %u = fptoui <1 x float> %a to <1 x i32>
  %and = bitcast <1 x i32> %u to <4 x i8>
  ret <4 x i8> %and
}

I think according to current LangRef rules, it technically isn't; we're turning undef into poison. But we're probably going to change shufflevector so we have "poison" in the shuffle mask instead of undef, so the result of the shufflevector has poison elements. So probably not worth spending time on this at the moment. See discussion on https://bugs.llvm.org/show_bug.cgi?id=44185 .

Ok that makes sense if the bitcast existed in IR. But this bitcast was created as part of type legalization and potentially spread poison to adjacent elements that didn't happen in the IR.

Oh, also, technically speaking, in fptoui-may-overflow.ll, the bitcast isn't getting created by type legalization; there's still a BUILD_VECTOR at that point. The bitcast is getting built by DAGCombine. From the debug dumps:

Type-legalized selection DAG: %bb.0 'fptoui_shuffle:'
SelectionDAG has 27 nodes:
  t0: ch = EntryToken
          t24: i32 = extract_vector_elt t19, Constant:i64<0>
        t25: i8 = truncate t24
          t27: i32 = extract_vector_elt t19, Constant:i64<1>
        t28: i8 = truncate t27
          t30: i32 = extract_vector_elt t19, Constant:i64<2>
        t31: i8 = truncate t30
          t33: i32 = extract_vector_elt t19, Constant:i64<3>
        t34: i8 = truncate t33
      t35: v16i8 = BUILD_VECTOR t25, t28, t31, t34, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8
      t7: v16i8 = BUILD_VECTOR Constant:i8<0>, Constant:i8<0>, Constant:i8<0>, Constant:i8<0>, Constant:i8<0>, Constant:i8<0>, Constant:i8<0>, Constant:i8<0>, Constant:i8<0>, Constant:i8<0>, Constant:i8<0>, Constant:i8<0>, Constant:i8<0>, Constant:i8<0>, Constant:i8<0>, Constant:i8<0>
    t8: v16i8 = vector_shuffle<0,17,18,19,1,21,22,23,2,25,26,27,3,29,30,31> t35, t7
  t11: ch,glue = CopyToReg t0, Register:v16i8 $xmm0, t8
      t2: v4f32,ch = CopyFromReg t0, Register:v4f32 %0
    t17: v4i32 = fp_to_sint t2
  t19: v4i32 = AssertZext t17, ValueType:ch:i8
  t12: ch = X86ISD::RET_FLAG t11, TargetConstant:i32<0>, Register:v16i8 $xmm0, t11:1


Optimized type-legalized selection DAG: %bb.0 'fptoui_shuffle:'
SelectionDAG has 14 nodes:
  t0: ch = EntryToken
            t2: v4f32,ch = CopyFromReg t0, Register:v4f32 %0
          t17: v4i32 = fp_to_sint t2
        t19: v4i32 = AssertZext t17, ValueType:ch:i8
      t36: v16i8 = bitcast t19
      t7: v16i8 = BUILD_VECTOR Constant:i8<0>, Constant:i8<0>, Constant:i8<0>, Constant:i8<0>, Constant:i8<0>, Constant:i8<0>, Constant:i8<0>, Constant:i8<0>, Constant:i8<0>, Constant:i8<0>, Constant:i8<0>, Constant:i8<0>, Constant:i8<0>, Constant:i8<0>, Constant:i8<0>, Constant:i8<0>
    t48: v16i8 = vector_shuffle<0,17,18,19,4,21,22,23,8,25,26,27,12,29,30,31> t36, t7
  t11: ch,glue = CopyToReg t0, Register:v16i8 $xmm0, t48
  t12: ch = X86ISD::RET_FLAG t11, TargetConstant:i32<0>, Register:v16i8 $xmm0, t11:1

Its been a a long week and with that and timezone differences I have been really struggling to keep up with the pace and direction of this patch - are you saying that we can't depend on value tracking in the DAG for manipulation of vector elements? If D106222 make this whole thing go away then so be it, but all its doing is asking if a blend with zero shuffle is necessary or if that element is already known to be zero - we do a lot of other similar combines for shuffles (packs relying on zero/sign upper bits etc.) and otherwise.

The one-line summary: if computeKnownBits says a bit is "known", it means either the bit has that value, or that lane of the vector is poisoned.

For almost all optimizations, the possibility of poison doesn't matter: if the source lane is poison, the result lane is also poison. However, this means the 2a419a0b transform isn't safe: if a lane in the result of a shufflevector is supposed to be zero, it has to actually be zero, not zero-or-poison. We could recover the optimization for the cases where it's legal with some form of isGuaranteedNotToBePoison().

Some other shuffle transforms might also run into trouble with poison? For example, matchShuffleWithPACK is wrong, I think, but the problem isn't the use of MaskedValueIsZero. The issue is the use of getBitcast(): if you bitcast from <4 x i8> to <1 x i32>, the 32-bit lane is poison if any of the input i8 lanes are poison.

OK thanks for getting me up to speed - I'll take a look at matchShuffleWithPACK (creating those bitcasts on the fly has been a problem for ages as it also affects one use tests....)

xiangzhangllvm added inline comments.Jul 17 2021, 4:37 PM
llvm/test/CodeGen/AArch64/arm64-convert-v4f64.ll
26–28 ↗(On Diff #359553)

Sorry for I didn't familiar with aarch64 instructions, I didn't clear about what uzp1 mean, and the result didn't have these uzp1 instructions, So I dele them.
Can I update it with utils/update_llc_test_checks.py ?

llvm/test/CodeGen/RISCV/rv64d-double-convert.ll
25 ↗(On Diff #359553)

Yes, if it base on the PromoteIntRes_FP_TO_XINT

llvm/test/CodeGen/X86/fptoui-may-overflow.ll
5 ↗(On Diff #359521)

I think the key of our discussion should base on the correctness of the transformation.

The logic of 2a419a0b9957 self is no problem, it base on the how the computeKnownBits handle the AssertZext.
Now the IR already be

"fptosi <4 x float> %1 to <4 x i16> + AssertZext  i8"

If you think the transformation is correct, how can we know there is poison bits ?
AssertZext self do not contain the "poison" meaning that the bits 8-15 is poison.
it should just tell a truth that the high N bits is zero.

efriedma added inline comments.Jul 17 2021, 6:51 PM
llvm/test/CodeGen/X86/fptoui-may-overflow.ll
5 ↗(On Diff #359521)

It's hard to give good examples involving poison in SelectionDAG, generally. Most of the focus of poison-based optimizations has been at the IR level, IR has better documentation, and Alive2 also only works at the IR level. So it's easier to discuss examples in IR as much as possible.

For now, let's put aside the discussion of the semantics of AssertZext.

In IR, the following transform is illegal, according to LangRef, and Alive2:

define <16 x i8> @src() {
  %and = and <4 x i32> poison, <i32 255, i32 255, i32 255, i32 255>
  %bitcast = bitcast <4 x i32> %and to <16 x i8>
  %ss = shufflevector <16 x i8> %bitcast, <16 x i8> zeroinitializer, <16 x i32> <i32 0, i32 17, i32 18, i32 19, i32 1, i32 21, i32 22, i32 23, i32 2, i32 25, i32 26, i32 27, i32 3, i32 29, i32 30, i32 31>
  ret <16 x i8> %ss
}

define <16 x i8> @tgt() {
  %and = and <4 x i32> poison, <i32 255, i32 255, i32 255, i32 255>
  %bitcast = bitcast <4 x i32> %and to <16 x i8>
  ret <16 x i8> %bitcast
}

Does this make sense?

I think 2a419a0b9957 performs the equivalent transform on SelectionDAG nodes. Do you think my understanding here is correct? Or is the SelectionDAG transform different somehow?

xiangzhangllvm added inline comments.Jul 17 2021, 7:14 PM
llvm/test/CodeGen/X86/fptoui-may-overflow.ll
5 ↗(On Diff #359521)

Hi @efriedma, Yes, I agree/understand the "In IR, the following transform is illegal" in your test, because we can obvious see there is poison value. thanks.

But I don't understand why " 2a419a0b9957 performs the equivalent transform on SelectionDAG nodes"

The MaskedElementsAreZero didn't know the element is poison in fptoui-may-overflow.ll. It just check the common bits of DemandedElts, which not mark with poison.

efriedma added inline comments.Jul 17 2021, 9:43 PM
llvm/test/CodeGen/X86/fptoui-may-overflow.ll
5 ↗(On Diff #359521)

The following is also illegal to transform because %a might be poison:

define <16 x i8> @src(<4 x i32> %a) {
  %and = and <4 x i32> %a, <i32 255, i32 255, i32 255, i32 255>
  %bitcast = bitcast <4 x i32> %and to <16 x i8>
  %ss = shufflevector <16 x i8> %bitcast, <16 x i8> zeroinitializer, <16 x i32> <i32 0, i32 17, i32 18, i32 19, i32 4, i32 21, i32 22, i32 23, i32 8, i32 25, i32 26, i32 27, i32 12, i32 29, i32 30, i32 31>
  ret <16 x i8> %ss
}

define <16 x i8> @tgt(<4 x i32> %a) {
  %and = and <4 x i32> %a, <i32 255, i32 255, i32 255, i32 255>
  %bitcast = bitcast <4 x i32> %and to <16 x i8>
  ret <16 x i8> %bitcast
}

In this situation, computeKnownBits will return that the high bits of %and are known zero.

Why does it do this? We just said that it might be poison, so they're not known. But as it turns out, this relaxed notion of a bit being "known" is generally more useful. Almost all IR transformations involve transforming an expression from one form to another. And for almost all IR expressions, a poison input implies a poison result.

For places that do actually care that a value isn't poison, there are various ways we could express the alternate semantics. We could add an alternate mode to computeKnownBits that doesn't return any known bits for values which might be poison. Or we can change the transform in question to call isGuaranteedNotToBePoison, or something like that. But that currently doesn't exist in SelectionDAG, so someone would need to write it.

@efriedma, thank you for make the issue more clear. So the poison value is different to undef value. For poison value, we should not should assume it is zero value, but for undef value it is not harmful to assume the value is zero. Right?

I notice in the function SelectionDAG::computeKnownBits , the code assume the upper bit is zero (Known.Zero |= (~InMask);). Since the upper bit may be poison value, is it correct?

case ISD::AssertZext: {
  EVT VT = cast<VTSDNode>(Op.getOperand(1))->getVT();
  APInt InMask = APInt::getLowBitsSet(BitWidth, VT.getSizeInBits());
  Known = computeKnownBits(Op.getOperand(0), Depth+1);
  Known.Zero |= (~InMask);
  Known.One  &= (~Known.Zero);
  break;
}

I've raised https://bugs.llvm.org/show_bug.cgi?id=51129 to track adding a llvm::isGuaranteedNotToBePoison(SDValue) helper method

@LuoYuanke AFAICT the current implementation is fine, we've just got to be more explicit that the DAG value tracking doesn't track poison.

From D106222: "I think we're going to have to add something to the SelectionDAG::computeKnownBits/ComputeNumSignBits doxygen comments explaining that the value could have the determined known/signbits BUT it could be a poison value (or if a vector any demanded element could be poison value)."

Random thought - the GlobalISel value tracking is less developed, could we improve handling for potential poison values to it before it gets too complicated?

@efriedma, Thanks you for explain, I am much clear about that. (And sorry for my late respond, I got a long travel yesterday)

Or Maybe we can replace the AssertZext with AssertZextOrPoison in "May Be Poison" case ?
If it is much easier to change all the existed optimizations.

Anyway, I suggest let us first revert 2a419a0b9957 (= D106222) , because we have a job blocked by this problem and we already find it really has problem.

Or Maybe we can replace the AssertZext with AssertZextOrPoison in "May Be Poison" case ?

Not sure renaming AssertZext to AssertZextOrPoison really helps readability, but we should update the documentation in ISDOpcodes.h.

@LuoYuanke AFAICT the current implementation is fine, we've just got to be more explicit that the DAG value tracking doesn't track poison.

From D106222: "I think we're going to have to add something to the SelectionDAG::computeKnownBits/ComputeNumSignBits doxygen comments explaining that the value could have the determined known/signbits BUT it could be a poison value (or if a vector any demanded element could be poison value)."

Random thought - the GlobalISel value tracking is less developed, could we improve handling for potential poison values to it before it gets too complicated?

The implementation of computeKnownBits is very similar to the one in SelectionDAG. Having tried to catch up on these threads, I'm not sure what's really missing though except for some clearer documentation?

@LuoYuanke AFAICT the current implementation is fine, we've just got to be more explicit that the DAG value tracking doesn't track poison.

From D106222: "I think we're going to have to add something to the SelectionDAG::computeKnownBits/ComputeNumSignBits doxygen comments explaining that the value could have the determined known/signbits BUT it could be a poison value (or if a vector any demanded element could be poison value)."

Random thought - the GlobalISel value tracking is less developed, could we improve handling for potential poison values to it before it gets too complicated?

The implementation of computeKnownBits is very similar to the one in SelectionDAG. Having tried to catch up on these threads, I'm not sure what's really missing though except for some clearer documentation?

I've trying to think if there would be a decent way to run the value tracking calls in a 'no poison' mode (working with isGuaranteedNotToBePoison functionality) - and I was wondering if it'd be easier to develop in GISel first which has accumulated less baggage so far.

Can this be abandoned now?

xiangzhangllvm abandoned this revision.Jul 20 2021, 11:18 PM

Can this be abandoned now?

Let me abandon it first.
If you has a new fix patch, Please note here.

For "a decent way", How about set a flag in the IR/Node for the no-poison (or "maybe poison") value, I think it not just useful in DAG optimization, but also in IPO.

Just to be clear - D106222 unblocked the issue you were seeing?

Just to be clear - D106222 unblocked the issue you were seeing?

Yes, D106222 fixed my blocked issue too (a shuffle combine problem), thank you!