Page MenuHomePhabricator

[SLP]Fix undef handling in gather function.
Needs ReviewPublic

Authored by ABataev on Jun 6 2022, 8:40 AM.

Details

Summary

SLP cannot replace undefs with poison without extra checks, plus need to
preserve the original order of scalars, if decided not to shuffle the
final vectorbuild sequence.

Diff Detail

Event Timeline

ABataev created this revision.Jun 6 2022, 8:40 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 6 2022, 8:40 AM
Herald added a subscriber: hiraditya. · View Herald Transcript
ABataev requested review of this revision.Jun 6 2022, 8:40 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 6 2022, 8:40 AM
vporpo added inline comments.Jun 6 2022, 11:26 AM
llvm/test/Transforms/SLPVectorizer/X86/diamond_broadcast_extra_shuffle.ll
37 ↗(On Diff #434535)

Doesn't an undef here mean that lane 4 can potentially be poison? Wouldn't that be incorrect?

hvdijk added inline comments.Jun 6 2022, 11:33 AM
llvm/test/Transforms/SLPVectorizer/X86/diamond_broadcast_extra_shuffle.ll
37 ↗(On Diff #434535)

A mask of undef means the result is undef, not poison, even if the input contains poison elements. See https://llvm.org/docs/LangRef.html#shufflevector-instruction (which is less clear than I would have liked, it used to explicitly say undef, it now says "undefined" to be more readable, but it's not obvious that it is still meant to refer to undef rather than poison)

nikic added a subscriber: nikic.Jun 6 2022, 11:37 AM
nikic added inline comments.
llvm/test/Transforms/SLPVectorizer/X86/diamond_broadcast_extra_shuffle.ll
37 ↗(On Diff #434535)

This will be switched to return a poison element in the future, so it would be best not to rely on it. (cc @nlopes).

nlopes added inline comments.Jun 6 2022, 11:46 AM
llvm/test/Transforms/SLPVectorizer/X86/diamond_broadcast_extra_shuffle.ll
37 ↗(On Diff #434535)

Right, thanks!
We need to switch that behavior to fix a bunch of optimizations.

hvdijk added inline comments.Jun 6 2022, 11:51 AM
llvm/test/Transforms/SLPVectorizer/X86/diamond_broadcast_extra_shuffle.ll
37 ↗(On Diff #434535)

D93818 linked for reference. Okay, are you saying for the purpose of SLPVectorizer, you would prefer that we already treat undef masks as selecting poison? That should always result in code that is already correct today, but it may not always be optimal under the current rules.

vporpo added inline comments.Jun 6 2022, 11:55 AM
llvm/test/Transforms/SLPVectorizer/X86/diamond_broadcast_extra_shuffle.ll
37 ↗(On Diff #434535)

Thanks for the explanation!

38 ↗(On Diff #434535)

SHUFFLE is: LD LD LD Undef
SHUFFLE1 is: LD LD LD LD
Shouldn't we reuse SHUFFLE` for both operands of mul` ?

nlopes added a subscriber: aqjune.Jun 6 2022, 12:05 PM
nlopes added inline comments.
llvm/test/Transforms/SLPVectorizer/X86/diamond_broadcast_extra_shuffle.ll
37 ↗(On Diff #434535)

That would be awesome, yes!
It would also motivate us to get that done (@aqjune :)

ABataev added inline comments.Jun 6 2022, 12:07 PM
llvm/test/Transforms/SLPVectorizer/X86/diamond_broadcast_extra_shuffle.ll
37 ↗(On Diff #434535)

I'll try to update this patch to follow this new rule

ABataev added inline comments.Jun 6 2022, 12:10 PM
llvm/test/Transforms/SLPVectorizer/X86/diamond_broadcast_extra_shuffle.ll
37 ↗(On Diff #434535)

Just one question. Shall I assume that <0, 0, 0, 0> and <0, 0, 0, undef> can be safely merged to <0, 0, 0, 0> then? Or better to wait for D93818?

hvdijk added inline comments.Jun 6 2022, 12:13 PM
llvm/test/Transforms/SLPVectorizer/X86/diamond_broadcast_extra_shuffle.ll
37 ↗(On Diff #434535)

I think my D127073 actually already achieved that. The concerns there of regressions there are cases where under the current rules, we can use undef masks, but under those proposed new rules of undef masks, we cannot as we need to ensure the result is not poison. However, this diff is intended to also perform some additional optimisations so it may still be worth updating this one.

ABataev added inline comments.Jun 6 2022, 12:15 PM
llvm/test/Transforms/SLPVectorizer/X86/diamond_broadcast_extra_shuffle.ll
37 ↗(On Diff #434535)

Yep, my update would be pretty similar to your original patch, just some situation will be handled a bit better. Just need to clarify the question, if shuffle <ld, ld, ld, undef> can be safely replaced by shuffle <ld, ld, ld, ld>

hvdijk added inline comments.Jun 6 2022, 12:19 PM
llvm/test/Transforms/SLPVectorizer/X86/diamond_broadcast_extra_shuffle.ll
37 ↗(On Diff #434535)

If we ensure that the vectorizer only ever generates a mask of <0, 0, 0, undef> when we element 3 is meant to be poison, it will be valid to merge that with <0, 0, 0, 0> even without waiting for D93818, I think: we do not have to support arbitrary existing shufflevectors here that may rely on undef masks giving undef results, we only have to worry about the new shufflevectors we generate.

nlopes added inline comments.Jun 6 2022, 1:19 PM
llvm/test/Transforms/SLPVectorizer/X86/diamond_broadcast_extra_shuffle.ll
37 ↗(On Diff #434535)

Just need to clarify the question, if shuffle <ld, ld, ld, undef> can be safely replaced by shuffle <ld, ld, ld, ld>

It's correct to do that in just 2 cases:

  • ld is known to be non-poison (eg via ValueTracking), or
  • ld is present in the expression tree already. so x * undef can be replaced with x * x, but not with x * y.

If instead of undef, you've poison, then it can safely be replaced with ld.
We've been moving away from initializing vectors with undef, so hopefully undef should be rare in vectors these days. The only exception left is shufflevector.
It would be nice to duplicate some of the SLP tests to use poison rather than undef to ensure the code is simplifying this appropriately.

hvdijk added inline comments.Jun 6 2022, 1:27 PM
llvm/test/Transforms/SLPVectorizer/X86/diamond_broadcast_extra_shuffle.ll
37 ↗(On Diff #434535)

This explanation is correct for arbitrary existing shuffles, but I wrote earlier that I do not think that is the case we are concerned with. We are concerned with new shuffle instructions that we generate. If we know that the undef mask was generated by SLPVectorizer (because it is a mask that is not even actually the operand to a shufflevector instruction yet, because it is a mask that we have built up in order to potentially generate a shufflevector instruction), and we decide that in SLPVectorizer, we only generate an undef mask if we originally wanted the result to be poison, that is also a case where we can treat an undef mask as selecting poison.

aqjune added inline comments.Jun 6 2022, 5:51 PM
llvm/test/Transforms/SLPVectorizer/X86/diamond_broadcast_extra_shuffle.ll
37 ↗(On Diff #434535)

Let me slightly make a progress to that patch and its friends. I think I will have some time for this.

ABataev updated this revision to Diff 434790.Jun 7 2022, 5:34 AM

Address comments

Comparing the changes to the tests in this diff to those in D127073, I am seeing a number of tests where we have more shufflevectors, and none where we have fewer. Are there improvements that are not as obvious to see?

Comparing the changes to the tests in this diff to those in D127073, I am seeing a number of tests where we have more shufflevectors, and none where we have fewer. Are there improvements that are not as obvious to see?

Yes, there are. These extra shuffles caused by changes in performExtractsShuffleAction() and in IsIdenticalOrLessDefined lambda, these changes (they treat UndefMaskElem as possible poison) increase number of shuffles. Without them, there are less shuffles, these extra changes are required for correct handling of UndefMaskElem as posion.

Comparing the changes to the tests in this diff to those in D127073, I am seeing a number of tests where we have more shufflevectors, and none where we have fewer. Are there improvements that are not as obvious to see?

Yes, there are. These extra shuffles caused by changes in performExtractsShuffleAction() and in IsIdenticalOrLessDefined lambda, these changes (they treat UndefMaskElem as possible poison) increase number of shuffles. Without them, there are less shuffles, these extra changes are required for correct handling of UndefMaskElem as posion.

That suggests that D127073 still results in incorrect code in some cases. I was under the impression that it was already correct, just not optimal. Can you point to specific tests where you believe D127073 results in wrong code?

Taking a random example

--- a/llvm/test/Transforms/SLPVectorizer/X86/insert-shuffle.ll
+++ b/llvm/test/Transforms/SLPVectorizer/X86/insert-shuffle.ll
@@ -19,9 +19,11 @@ define { <2 x float>, <2 x float> } @foo(%struct.sw* %v) {
 ; CHECK-NEXT:    [[TMP8:%.*]] = fadd <4 x float> [[TMP7]], undef
 ; CHECK-NEXT:    [[TMP9:%.*]] = fadd <4 x float> [[TMP8]], undef
 ; CHECK-NEXT:    [[TMP10:%.*]] = shufflevector <4 x float> [[TMP9]], <4 x float> poison, <2 x i32> <i32 0, i32 1>
-; CHECK-NEXT:    [[TMP11:%.*]] = shufflevector <4 x float> [[TMP9]], <4 x float> poison, <2 x i32> <i32 2, i32 3>
-; CHECK-NEXT:    [[INS1:%.*]] = insertvalue { <2 x float>, <2 x float> } undef, <2 x float> [[TMP10]], 0
-; CHECK-NEXT:    [[INS2:%.*]] = insertvalue { <2 x float>, <2 x float> } [[INS1]], <2 x float> [[TMP11]], 1
+; CHECK-NEXT:    [[TMP11:%.*]] = shufflevector <2 x float> undef, <2 x float> [[TMP10]], <2 x i32> <i32 2, i32 3>
+; CHECK-NEXT:    [[TMP12:%.*]] = shufflevector <4 x float> [[TMP9]], <4 x float> poison, <2 x i32> <i32 2, i32 3>
+; CHECK-NEXT:    [[TMP13:%.*]] = shufflevector <2 x float> undef, <2 x float> [[TMP12]], <2 x i32> <i32 2, i32 3>
+; CHECK-NEXT:    [[INS1:%.*]] = insertvalue { <2 x float>, <2 x float> } undef, <2 x float> [[TMP11]], 0
+; CHECK-NEXT:    [[INS2:%.*]] = insertvalue { <2 x float>, <2 x float> } [[INS1]], <2 x float> [[TMP13]], 1
 ; CHECK-NEXT:    ret { <2 x float>, <2 x float> } [[INS2]]
 ;
 entry:

The extra shufflevectors here, TMP11 and TMP13, are identity shuffles that should not have been generated, they do not change the handling of undef/poison.

Comparing the changes to the tests in this diff to those in D127073, I am seeing a number of tests where we have more shufflevectors, and none where we have fewer. Are there improvements that are not as obvious to see?

Yes, there are. These extra shuffles caused by changes in performExtractsShuffleAction() and in IsIdenticalOrLessDefined lambda, these changes (they treat UndefMaskElem as possible poison) increase number of shuffles. Without them, there are less shuffles, these extra changes are required for correct handling of UndefMaskElem as posion.

That suggests that D127073 still results in incorrect code in some cases. I was under the impression that it was already correct, just not optimal. Can you point to specific tests where you believe D127073 results in wrong code?

Taking a random example

--- a/llvm/test/Transforms/SLPVectorizer/X86/insert-shuffle.ll
+++ b/llvm/test/Transforms/SLPVectorizer/X86/insert-shuffle.ll
@@ -19,9 +19,11 @@ define { <2 x float>, <2 x float> } @foo(%struct.sw* %v) {
 ; CHECK-NEXT:    [[TMP8:%.*]] = fadd <4 x float> [[TMP7]], undef
 ; CHECK-NEXT:    [[TMP9:%.*]] = fadd <4 x float> [[TMP8]], undef
 ; CHECK-NEXT:    [[TMP10:%.*]] = shufflevector <4 x float> [[TMP9]], <4 x float> poison, <2 x i32> <i32 0, i32 1>
-; CHECK-NEXT:    [[TMP11:%.*]] = shufflevector <4 x float> [[TMP9]], <4 x float> poison, <2 x i32> <i32 2, i32 3>
-; CHECK-NEXT:    [[INS1:%.*]] = insertvalue { <2 x float>, <2 x float> } undef, <2 x float> [[TMP10]], 0
-; CHECK-NEXT:    [[INS2:%.*]] = insertvalue { <2 x float>, <2 x float> } [[INS1]], <2 x float> [[TMP11]], 1
+; CHECK-NEXT:    [[TMP11:%.*]] = shufflevector <2 x float> undef, <2 x float> [[TMP10]], <2 x i32> <i32 2, i32 3>
+; CHECK-NEXT:    [[TMP12:%.*]] = shufflevector <4 x float> [[TMP9]], <4 x float> poison, <2 x i32> <i32 2, i32 3>
+; CHECK-NEXT:    [[TMP13:%.*]] = shufflevector <2 x float> undef, <2 x float> [[TMP12]], <2 x i32> <i32 2, i32 3>
+; CHECK-NEXT:    [[INS1:%.*]] = insertvalue { <2 x float>, <2 x float> } undef, <2 x float> [[TMP11]], 0
+; CHECK-NEXT:    [[INS2:%.*]] = insertvalue { <2 x float>, <2 x float> } [[INS1]], <2 x float> [[TMP13]], 1
 ; CHECK-NEXT:    ret { <2 x float>, <2 x float> } [[INS2]]
 ;
 entry:

The extra shufflevectors here, TMP11 and TMP13, are identity shuffles that should not have been generated, they do not change the handling of undef/poison.

Yes, I see some extra shuffles, that can be removed safely. Working on the improvements.

ABataev updated this revision to Diff 434939.Jun 7 2022, 1:49 PM

Address comments.

Thanks, I am seeing improvements with this new version. I'll try to go over the changes in more detail later, some initial superficial comments now.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6965

This extra overload does not look like it adds anything: it takes a parameter U if and only if U is Value *. The previous overload takes Value * if and only if U is Value *. That is the same thing; this new overload does not appear to add anything and the file compiles successfully for me if I remove this.

6970

This extra overload goes against the documentation above ValueSelect. The documentation for ValueSelect says it takes a Value *, and if a Value * is wanted, returns that value, otherwise returns a default value. This extra overload is needed because ValueSelect::get now gets called with a const TreeEntry * in the instantiation of performExtractsShuffleAction<const TreeEntry>, but that is contrary to its documentation and it is not clear what it is supposed to mean.

7001

a && b || !a && c is simpler expressed as a ? b : c.

7010

(a ? b : c) == b is simpler expressed as a || c == b.

7016

This comment seems like it does not match how IsCompleteIdentity is set: we can get IsCompleteIdentity to be false even when we select elements only from a single vector.

ABataev updated this revision to Diff 436845.Jun 14 2022, 10:24 AM

Address comments

vporpo added inline comments.Jun 22 2022, 12:20 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
9109

These ->getAggregateElement(...) and ->getOperand(...) expressions repeat several times in the code and make this code quite verbose. Is there a way to avoid this repetition?

ABataev updated this revision to Diff 439867.Jun 24 2022, 12:54 PM

Address comments

vporpo added inline comments.Jun 24 2022, 1:37 PM
llvm/test/Transforms/SLPVectorizer/X86/buildvector-same-lane-insert.ll
50

TMP9 seems to be redundant here, it looks like it is a copy of TMP3:
TMP8 is: TMP3[0], undef
TMP9 is: TMP3[0], TMP3[1]

I guess this was an issue even before this patch: TMP8 was a copy of TMP3, so the TMP8 shufflevector was redundant.

ABataev updated this revision to Diff 440605.Jun 28 2022, 7:00 AM

Address comments

vporpo added inline comments.Jun 28 2022, 12:02 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
9063

Please add a comment describing the lambda.

9128–9131

This also repeats in line 9054. Perhaps move it to a lambda like getOperandIndex(SI2, VecOp2) ?

hvdijk added inline comments.Jun 28 2022, 12:03 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
9128–9131

Actually... That just looks very wrong, = rather than ==.

ABataev updated this revision to Diff 440728.Jun 28 2022, 12:42 PM

Address comments.

ABataev updated this revision to Diff 441713.Jul 1 2022, 9:31 AM

Fixes and improvements.

ABataev updated this revision to Diff 451973.Thu, Aug 11, 1:51 PM

Rebase + fixes